What is voice recognition?
Write with your voice, move with your voice.
New communication between humans and machines
For many people, the easiest way to communicate is through voice.
Voice recognition technology that converts voice into text eliminates inconvenient communication such as keyboard input and button operations.
Realizing human-oriented, natural communication.
Basic mechanism of speech recognition
We naturally recognize and understand the meaning of other people's voices in everyday conversations, so we don't think it's difficult. However, it is not easy for computers to correctly recognize human voices, as even voices with the same content, such as gender, speaking habits, and language, can differ greatly depending on the person and the situation. Speech recognition is a technology that converts speech into text by closely combining vocal information and language information.
acoustic model
Acoustic Analysis
For example, even when hearing the same "ah" sound, the waveform of the voice will change depending on the gender and age of the speaker, the microphone used for recording, and other factors.
Therefore, instead of inputting the voice waveform data directly into the recognition decoder, the sound characteristics are quantified through acoustic analysis, and these numerical values (feature quantities) are input into the recognition decoder.
Recognition Decoder
The features extracted by acoustic analysis are input into a recognition decoder, which outputs the speech recognition results.
Recognition decoders can be broadly divided into two types: "DNN-HMM hybrid type" and "End-to-End type." The "DNN-HMM hybrid type" is a recognition decoder that combines "DNN (Deep Neural Network)" and "HMM (Hidden Markov Model)," and is composed of three parts: "acoustic model," "language model," and "pronunciation dictionary." On the other hand, the "End-to-End type" is characterized by its simple structure, in which the recognition decoder is composed only of a neural network.
acoustic model
The acoustic model calculates an "acoustic score" that indicates which phonemes the features extracted through acoustic analysis are likely to correspond to.
A typical acoustic model is created by training thousands of people and thousands of hours of speech.
Language model
It is used to evaluate whether a character string or word string is appropriate as Japanese. A language model is a collection of Japanese texts and statistical processing.
pronunciation dictionary
The pronunciation dictionary links words such as "sorrow," "pleading," and "love" with their pronunciation and phonetic representations such as "aware," "aigaN," and "ai." The pronunciation dictionary makes it possible to represent words as phoneme strings (sequences of phonemes).
deep learning technology
Deep learning is a "machine learning" method in which a machine imitates the structure of the human brain using software, learns the characteristics of data, and performs recognition and classification.
In speech recognition, it is used in acoustic models and language models.
History of speech recognition
Kyoto University develops voice typewriter
Developing speech recognition based on statistical data