Professor Taguchi, Department of Physics, Faculty of Science and Technology at Chuo University and colleague Professor Mototake discuss their research on ‘Signal identification without signal formulation’ in this Open Access Government Q&A
Professor Taguchi’s main research interests are in bioinformatics, specifically in multi-omics data analysis using linear algebra. Having edited one book and published another on bioinformatics along with more than 100 journal papers, book chapters and papers in conference proceedings, Taguchi has been identified as one of the top 2% researchers in the field of bioinformatics by Stanford University in the years of 2021 and 2022.
Professor Taguchi’s and Professor Mototake’s current research centres around what they describe as signal identification without signal formulation. Here, we examine the topic more closely with this exclusive Q&A.
In your research, you mentioned the assumption that most data, including gene expression data, is potentially generated by dynamic systems. Can you elaborate on this assumption and how it relates to the results of signal extraction?
There is no direct way to evaluate our assumption. However, we could successfully demonstrate that data sets like gene expression missing transparent dynamical systems can be processed with a pretty similar procedure by which we could process data generated from the dynamical system (i.e., globally compelled map (GCM)).
This is strong evidence that data sets like gene expression missing clear dynamical systems have hidden dynamical systems. Because of this assumption, we believe that the data set without an apparent dynamical system can be processed with a similar procedure used to process a data set generated from a dynamical system. In addition, non-quantum systems follow dynamical determinism, so if there are infinite samples, all data can be treated as originating from a dynamical system.
Could you provide more details about the method you developed to detect signal variables based on the structural change of histograms as the number of samples decreases? How did you validate the accuracy of this method?
When data sets lack structure, i.e., it is supposed to be a random number, the histogram is entirely flat. If not, the sharp peak should appear at the region of P ~ 0, which is supposed to correspond to signal variables. For the data set generated from GCM, since we know the correct answer (three-state solution), we can evaluate the method’s performance to check how correctly the way detects these three-state variables.
Furthermore, we can determine the selected genes from the biological point of view for gene expression. However, we did not do this since the paper was submitted to the physical journal.
How do you envision your research on signal identification without signal formulation’s practical applications or potential impact in various domains? Are there specific fields or industries where this approach could be precious?
It is always required to detect signals within a large number of noises. For example, suppose you have a set of sounds recorded at a party. There are need to detect human voices with surround noise sound. Or you can have the time sequence of stock prices and might be able to detect hidden structures by which you can predict future stock prices. In this sense, the possibility of application of our method is unlimited. Any companies or fields that can use vast amounts of data can employ our method to detect something that might be regarded as a signal.
Did you encounter any limitations or challenges during your research? If so, how did you address or mitigate them?
The limitation is how to relate static data set lacking apparent time with those with clear time points. However, we can overcome this difficulty by assuming that the time interval between subsequent observations is reversely proportional to the number of observations (samples) in the static data. This assumption seems to work very well.
Is there any advice or key takeaway you would like to share with researchers or practitioners interested in signal identification or working with data of small-size samples and high dimensions?
One should note that being eager to let data sets speak for themselves without humanmade assumptions is essential. We often have subjective criteria on what the signal is. Detecting signals often requires more subjective criteria than noise. It might prevent us from detecting true signals. We are not bright enough to know such things about signals a priori.