IEEE SB NITP: Into Alexa

We are now interacting with technology in the most natural way possible-by talking. People talk to their smart devices every day and also flirt with the virtual assistants (shout out to all the single boys out there). So, today we will take a closer look at the technologies used in one of the most loved intelligent virtual assistants- Alexa.

Alexa is built based on natural language processing (NLP), a procedure of converting speech into words, sounds, and ideas.

It starts with signal processing, which gives Alexa as many chances as possible to make sense of the audio by cleaning the signal. The idea is to improve the target signal, which means being able to identify ambient noise like the TV and minimize them. To resolve these issues, seven microphones are used to identify roughly where the signal is coming from so the device can focus on it. Acoustic Echo Cancellation (AEC) can subtract that signal so only the remaining important signal remains.

In speech recognition systems, the term “acoustic echo” refers to the signal that is played out of a loudspeaker and captured by a microphone in the vicinity of the loudspeaker. The AEC algorithm functions by adaptively estimating the acoustic echo path (and thereby the acoustic echo) between the loudspeaker and microphone components. The estimated acoustic echo is then subtracted from the microphone signal to obtain a near echo-free microphone signal. An AEC-processed microphone signal is ideally free of acoustic echo.

The next task is “Wake Word Detection”. It determines whether the user says one of the words the device is programmed to need to turn on, such as “Alexa”. This is needed to minimize false positives and false negatives, which could lead to accidental purchases and angry customers. This is complicated as it needs to identify pronunciation differences, and it needs to do so on the device, which has limited CPU power.

Any command to Alexa is broken into 3 main parts: Wake word, Invocation name, Utterance. You already know about the Wake word. Invocation name is the keyword used to trigger a specific “skill”. Skills are voice-driven Alexa capabilities. Utterances are phrases the users will use when making a request to Alexa. Alexa identifies the user’s intent from the given utterance and responds accordingly.

If the wake word is detected, Alexa enabled devices send the user’s instruction to a cloud-based service called Alexa Voice Service (AVS). AVS is the brain of Alexa enabled devices and performs all the complex operations such as Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU).

Natural Language Processing (NLP) helps computer machines to engage in communication using natural human language in many forms, including but not limited to speech and writing. Natural Language Understanding (NLU) is a subset of the wider world of NLP.

At first, it converts the input of natural language into Artificial language like speech recognition. Here we get the data into a textual form which NLU (Natural Language Understanding) processes to understand the meaning. It tries to understand each word whether it is a Noun or Verb, what is the tense used, etc. This process is defined as POS: Part Of Speech Tagging.

NLP systems also have a lexicon (a vocabulary) and a set of grammar rules coded into the system. Modern NLP algorithms use statistical machine learning to apply these rules to the natural language and determine the most likely meaning behind what you said.

When Alexa makes a mistake in interpreting your request, that data is used to make the system better the next time through Machine learning.

So in a nutshell here's what happens when you make a request to Alexa:

• Amazon records your words. Indeed, interpreting sounds take up a lot of computational power, the recording of your speech is sent to Amazon’s servers to be analyzed more efficiently.

• Amazon breaks down your “orders” into individual sounds. It then consults a database containing various words’ pronunciations to find which words most closely correspond to the combination of individual sounds.

• It then identifies important words to make sense of the tasks and carry out corresponding functions. For instance, if Alexa notices words like “sport” or “basketball”, it would open the sports app.

• Amazon’s servers send the information back to your device and Alexa may speak. If Alexa needs to say anything back, it would go through the same process described above, but in reverse order

DMZ (demilitarized zone) is a physical or logical subnet that separates an internal LAN from other untrusted networks -usually the public internet.

Amazon is adding new capabilities to Alexa just about every day, with more skills and device compatibility. Beginning in 2020, Amazon is rolling out frustration detection features, so Alexa will be able to understand and acknowledge when you’re getting frustrated with her. If you want to learn more about Alexa, all you have to do is ask: “Alexa, what’s new with you?” and she’s happy to share.

IEEE SB NITP

Thursday, 14 May 2020

Into Alexa

2 comments: