Homechevron_rightWhat is voice recognition?

What is voice recognition?

Write with your voice, move with your voice.
New communication between humans and machines

For many people, the easiest way to communicate is through voice.
Voice recognition technology that converts voice into text eliminates inconvenient communication such as keyboard input and button operations.
Realizing human-oriented, natural communication.

Basic mechanism of speech recognition

We naturally recognize and understand the meaning of other people's voices in everyday conversations, so we don't think it's difficult. However, it is not easy for computers to correctly recognize human voices, as even voices with the same content, such as gender, speaking habits, and language, can differ greatly depending on the person and the situation. Speech recognition is a technology that converts speech into text by closely combining vocal information and language information.

acoustic model

Acoustic Analysis

For example, even when hearing the same "ah" sound, the waveform of the voice will change depending on the gender and age of the speaker, the microphone used for recording, and other factors.
Therefore, instead of inputting the voice waveform data directly into the recognition decoder, the sound characteristics are quantified through acoustic analysis, and these numerical values ​​(feature quantities) are input into the recognition decoder.

Recognition Decoder

The features extracted by acoustic analysis are input into a recognition decoder, which outputs the speech recognition results.
Recognition decoders can be broadly divided into two types: "DNN-HMM hybrid type" and "End-to-End type." The "DNN-HMM hybrid type" is a recognition decoder that combines "DNN (Deep Neural Network)" and "HMM (Hidden Markov Model)," and is composed of three parts: "acoustic model," "language model," and "pronunciation dictionary." On the other hand, the "End-to-End type" is characterized by its simple structure, in which the recognition decoder is composed only of a neural network.

acoustic model

The acoustic model calculates an "acoustic score" that indicates which phonemes the features extracted through acoustic analysis are likely to correspond to.
A typical acoustic model is created by training thousands of people and thousands of hours of speech.

Language model

It is used to evaluate whether a character string or word string is appropriate as Japanese. A language model is a collection of Japanese texts and statistical processing.

pronunciation dictionary

The pronunciation dictionary links words such as "sorrow," "pleading," and "love" with their pronunciation and phonetic representations such as "aware," "aigaN," and "ai." The pronunciation dictionary makes it possible to represent words as phoneme strings (sequences of phonemes).

deep learning technology

Deep learning is a "machine learning" method in which a machine imitates the structure of the human brain using software, learns the characteristics of data, and performs recognition and classification.
In speech recognition, it is used in acoustic models and language models.

History of speech recognition

1950Age distribution
Research on speech recognition begins
1952
Bell Laboratories in the United States announces the digital speech recognition system "Audery"
1962
IBM announces the world's first speech recognition computer "Soebox"
Kyoto University develops voice typewriter
1970Age distribution
"DP matching method" developed in Japan and Russia, enabling continuous word recognition
1980Age distribution
Carnegie Mellon University applied the "hidden Markov model",
Developing speech recognition based on statistical data
1982
NEC releases Japan's first voice word processor "VWP-100"
1990Age distribution
Large vocabulary speech recognition based on HMM and large-scale speech data established
1995
Microsoft introduces speech tools to Windows 95
1997
Advanced Media is established as Japan's first specialized voice recognition vendor.
2002
Announced AmiVoice DSR, the world's first distributed voice recognition based on communications.
2010Around the year
Implementing deep learning technology for speech recognition
2011
Apple brings Siri to smartphones
2015
Research into "end-to-end" voice recognition begins
2016
Introducing "Transformer," a highly accurate and fast neural network that can be used as a component technology for speech recognition
2017
Apple, Amazon, and Google unveil AI speakers
2019
Released “AmiVoice Cloud Platform”, a development platform that provides voice recognition API
2022
OpenAI releases "Whisper", a model capable of multilingual speech recognition

No.1 domestic shareNo.1 domestic shareAmiVoiceⓇAmiVoiceⓇ

*Source: ecarlate LLC "Voice Recognition Market Trends 2024"
Software/Cloud Services Market

AmiVoice's AI voice recognition is
While closely combining sound information and linguistic information,
This is a technology that converts speech into text.