There are three core technologies that power Alexa. It’s not just Alexa, either. All voice assistants such as Siri, Google Home, etc. use these important technologies. They are Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), and Text To Speech (TTS).
Automatic Speech Recognition (ASR)
ASR is a key part of Alexa’s system. ASR converts speech to text. It can understand what a human is trying to say, which can be tough. Some people have loud voices and others have quiet voices. People may have accents. Maybe there are other people chatting in the same room. This can be a tough job, even for humans. ASR can handle it.
Let’s take an example of one of my skills, Wheel of Fun, which gives you a random prize, such as a compliment, joke, poem, or fart sound.
First, Alexa will record what you said (also known as your utterance) and send it to ASR.
Then, ASR will convert the recording into some readable text.
Thanks to ASR, the rest of the process can continue.
Natural Language Understanding (NLU)
Another thing to note is NLU.
The main thing that NLU does is that it can understand what we actually mean when we say something like, “Alexa, tell me the time.”
NLU can also understand that there are many ways to phrase a command. For example, let’s say we have my skill Wheel of Fun. Some people will say, “open Wheel of Fun.” Others may say, “Alexa, spin the Wheel of Fun.” More can say, “Alexa, can you please spin the Wheel of Fun for me?” There are many possibilities, and it is practically impossible to figure them all out. NLU saves you from having to spend hours coding every utterance and saves users from having to commit a certain command to memory.
Text To Speech (TTS)
TTS does literally what the name implies. It converts ‘text to speech’! Near the end of the process, Alexa Voice Service (AVS) needs to convert the text into speech so that the Echo device can actually speak our skill’s response! In the end, TTS and ASR basically undo each other.
Here is an example of the entire Alexa cycle:
You have the skill Wheel of Fun.
First, Alexa will record your utterance and send it to ASR. Next, ASR will convert the recording into text and send it to NLU. Then, NLU will figure out you want to use the skill Wheel of Fun. NLU will then send a JSON request, containing the intent, to your Lambda function. After that, your Lambda function will respond back to Alexa Voice Service. Following this, TTS will convert the JSON response into speech.
Finally, Alexa will speak, and depending on your device’s model, an image may start to form on the screen as well.
The difference between ASR and NLU comes up a lot. ASR converts the speech into text, sends it to NLU, and NLU figures out what we really mean to say. I hope this post clarified things for you!
HOW I KNOW THIS STUFF:
I learned most of the stuff above from this great video by Paul Cutsinger:
Here’s another good video that was very helpful by Nuance Developers Learning:
And finally, here is an amazing article by Amazon on the topic.