Shinichi Tokuno from the Department of Voice Analysis of Pathophysiology, Graduate School of Medicine, at the University of Tokyo, lifts the lid on voice analysis technology for healthcare
It is possible to tell from a person’s voice that he or she is unwell. This is not only possible with the familiar voices of family members and friends, but also with the voices of unfamiliar people. Therefore, it is expected that the voice has some characteristics that change depending on one’s physical condition. Apart from clinical diseases in the field of otolaryngology, which directly affects the voice, it has long been pointed out that patients with other conditions, including depression, have a characteristic quality to their voice (Newman et al., 1938). Over time, attempts have been made to objectively evaluate speech characteristics specific to these diseases.
Initial research involved analysis of the speaking rate, switching pause and pause rate in patients with depression (Weintraub et al., 1967). Then, with the development of computers, research began on analysis of the basic frequency of speech (Nilsonne et al., 1988) and analysis of formant frequencies formed by resonance when the vocal cord vibrations pass through the vocal tract (Flint et al., 1993).
Recent changes in the human-machine interface have increased the opportunity for humans to operate devices using voice activation; in other words, there have been increased opportunities to capture speech with devices. Therefore, attention has been focused on research attempting to use speech as biomarkers and the latest research has made it possible to move beyond simple frequency analysis, with research flourishing in fields that make full use of computers such as machine learning and deep learning (Fang et al., 2018).
Unlike biomarkers obtained from other specimens, voice analysis is cost-effective because it does not require special devices or reagents and it can be measured repeatedly using non-invasive and simple methods. Therefore, it can be used not only for screening but also for household monitoring, which may enable early detection of disease. Taking a slightly different approach, we focused on the reduction in emotion expression that manifests when a person is stressed or depressed and developed the application ‘MIMOSYS (Mind Monitoring System)’, which automatically monitors the mental health of the owner of the smartphone from their voice during conversations on the smartphone (Tokuno et al., 2018). This system is based on voice Sensibility Technology (ST), which captures emotions from changes in the basic frequency of the voice. MIMOSYS calculates ‘Vitality’, which is the degree of mental health, from the percentage of emotion within the measured conversation by ST.
As Vitality varies significantly depending on the content of the conversation and the other party in the conversation, we devised the indicator of ‘Mental Activity’ based on variations in Vitality over a two-week period and the mean Vitality over that period. We verified this Vitality and Mental Activity with more than 10,000 people in a variety of situations and confirmed its effectiveness.
That is, Mental Activity accurately showed the user’s mental fluctuations; in other words, it showed the trends in the degree of stress. The algorithm we developed has already been commercialised for industrial medicine in Japan and applications using our algorithm have been preinstalled in a number of smartphones for personal use.
Currently, we are adding a function for analysing set phrases read-aloud for users who do not make many telephone calls and we are verifying the effectiveness of these applications in multiple languages. In addition to health applications, we are also considering other applications such as mental healthcare for athletes.
MIMOSYS is mediated via emotion, so it makes human-like judgements, but it is difficult to capture subtle changes between different diseases. Therefore, we are also attempting methods that compare voice characteristics directly, without using emotions. We have already achieved results for depression, Parkinson’s disease and dementia and we are working on expanding the application to other diseases.
As stated above, voice biomarkers have a broad range of applications and they have great potential.
References
Newman SS, et al. (1938). Analysis of spoken language of patients with affective disorders. Am. J. Psqchitrt. 94:912-942.
Weintraub W, et al. (1967). The Application of Verbal Behavior
Analysis to the Study of Psychological Defense Mechanisms: IV. Speech Patterns Associated with Depressive Behavior, J. Nerv. Ment. Disord. 144, 22-28.
Nilsonne Å, et al. (1988). Measuring the rate of change of voice fundamental frequency in fluent speech during mental depression. The Journal of the Acoustical Society of America, 83(2), 716-728.
Flint AJ, et al. (1993). Abnormal speech articulation, psychomotor retardation, and subcortical dysfunction in major depression.
Journal of psychiatric research, 27(3), 309-319.
Fang, SH, et al. (2018). Detection of pathological voice using
cepstrum vectors: A deep learning approach. Journal of Voice.
Tokuno S. (2018). Pathophysiological Voice Analysis for Diagnosis and Monitoring of Depression. Understanding Depression, 83-95. Springer, Singapore.
Please note: This is a commercial profile