With the rise in voice-driven virtual assistants over the years, the sight of people talking to various electrical devices in public and in private has become rather commonplace. While such voice-driven interfaces are decidedly useful for a range of situations, they also come with complications. One of these are the trigger phrases or wake words that voice assistants listen to when in standby. Much like in Star Trek, where uttering ‘Computer’ would get the computer’s attention, so do we have our ‘Siri’, ‘Cortana’ and a range of custom trigger phrases that enable the voice interface.
Unlike in Star Trek, however, our virtual assistants do not know when we really desire to interact. Unable to distinguish context, they’ll happily respond to someone on TV mentioning their trigger phrase. This possibly followed by a ludicrous purchase order or other mischief. The realization here is the complexity of voice-based interfaces, while still lacking any sense of self-awareness or intelligence.
Another issue is that the process of voice recognition itself is very resource-intensive, which limits the amount of processing that can be performed on the local device. This usually leads to the voice assistants like Siri, Alexa, Cortana and others processing recorded voices in a data center, with obvious privacy implications.
Just Say My Name
The idea of a trigger word that activates a system is an old one, with one of the first known practical examples being roughly a hundred years old. This came in the form of a toy called Radio Rex, which featured a robot dog that would sit in its little dog house until its name was called. At the moment it’d hop outside to greet the person calling it.
The way that this was implemented was simple and rather limited courtesy of available technologies in the 1910s and 1920s. Essentially it used the acoustic energy of a formant corresponding roughly to the vowel [eh] in ‘Rex’. As noted by some, an issue with Radio Rex is that it is tuned for 500 Hz, which would be the [eh] vowel when spoken by an (average) adult male voice.
This tragically meant that for children and women Rex would usually refuse to come out of its dog house, unless they used a different vowel that matched the 500 Hz frequency range for their vocal range. Even then they were likely to run into the other major issue with this toy, namely that of the sheer acoustic pressure required. Essentially this meant that some yelling might be required to make Rex move.
What is interesting about this toy is that in many ways ol’ Rex isn’t too different from how modern-day Siri and friends work. The trigger word that wakes them up from standby is less crudely interpreted, using a microphone and signal processing hardware and software rather than a mechanical contraption, but the effect is the same. In the low-power trigger search mode the assistant’s software constantly compares the incoming sound samples’ formants for a match with the sound signature of the predefined trigger word(s).
Once a match has been detected and the mechanism kicks into gear, the assistant will pop out of its digital house as it switches to its full voice processing mode. At this stage a stand-alone assistant – as one might find in e.g. older cars – may use a simple Hidden Markov Model (HMM) to try and piece together the intent of the user. Such a model is generally trained on a fairly simple vocabulary model. Such a model will be specific to a particular language and often a regional accent and/or dialect to increase accuracy.
Too Big For The Dog House
While it would be nice to run the entire natural language processing routine on the same system, the fact of the matter is that speech recognition remains very resource-intensive. Not just in terms of processing power, as even an HMM-based approach has to sift through thousands of probabilistic paths per utterance, but also in terms of memory. Depending on the vocabulary of the assistant, the in-memory model can range from dozens of megabytes to multiple gigabytes or even terabytes. This would obviously be rather impractical on the latest whizbang gadget, smartphone or smart TV, which is why this processing is generally moved to a data center.
When accuracy is considered to be even more of a priority – such as with the Google assistant when it gets asked a complex query – the HMM approach is usually ditched for the newer Long Short-Term Memory (LSTM) approach. Although LSTM-based RNNs deal much better with longer phrases, they also come with much higher processing and memory usage requirements.
With the current state-of-the-art in speech recognition moving towards ever more complex neural network models, it would seem unlikely that such system requirements will be overtaken by technological progress.
As a reference point of what a basic lower-end system on the level of a single-board computer like a Raspberry Pi might be capable of with speech recognition, we can look at a project like CMU Sphinx, developed at Carnegie Mellon University. The version that is aimed at embedded systems is called PocketSphinx, and like its bigger versions uses an HMM-based approach. In the Spinx FAQ it’s mentioned explicitly that large vocabularies won’t work on SBCs like the Raspberry Pi due to the limited RAM and CPU power on these platforms.
When you limit the vocabulary to around a thousand words, however, the model may just fit in RAM and the processing will be fast enough to appear instantaneous for the user. This is fine if you desire for the voice-driven interface to only have decent accuracy, within the limits of the training data, while only offering limited interaction. In the case that the goal is to, say, allow the user to turn a handful of lights on or off, this may be sufficient. On the other hand, if this interface is called ‘Siri’ or ‘Alexa’ the expectations for such an interface are a lot higher.
Essentially, these virtual assistants are supposed to act like they understand natural language, the context in which it is used, and to reply in a way that is consistent with the way that the average civilized human interaction is expected to occur. Not surprisingly, this is a tough challenge to meet. Having the speech recognition part off-loaded to a remote data center, and using recorded voice samples to further train the model are natural consequences of this demand.
No Smarts, Just Good Guesses
Something which we humans are naturally pretty good at, and which we get further nagged with during our school time, is called ‘part-of-speech tagging’, also called grammatical tagging. This is where we quantify parts of a phrase into its grammatical constituents, including nouns, verbs, articles, adjectives, and so on. Doing so is essential for understanding a sentence, as the meaning of words can change wildly depending on their grammatical classification, especially in languages like English with its common use of nouns as verbs and vice versa.
Using grammatical tagging we can then understand the meaning of the sentence. Yet this is not what these virtual assistants do. Using a Viterbi algorithm (for HMMs) or equivalent RNN approach, instead the probability is determined of the given input fitting a specific subset of the language model. As most of us are undoubtedly aware, this is an approach that feels almost magical when it works, and makes you realize that Siri is as dumb as a bag of bricks when it fails to get an appropriate match.
As demand for ‘smart’ voice-driven interfaces increases, engineers will undoubtedly work tirelessly to find more ingenious methods to improve the accuracy of today’s system. The reality for the foreseeable future would appear to remain that of voice data being sent to data centers where powerful server systems can perform the requisite probability curve fitting, to figure out that you were asking ‘Hey Google’ where the nearest ice cream parlor is. Never mind that you were actually asking for the nearest bicycle store, but that’s technology for you.
Perhaps slightly ironic about the whole natural language and computer interaction experience is that speech synthesis is more or less a solved problem. As early as the 1980s the Texas Instruments TMS (of Speak & Spell fame) and the General Instrument SP0256 Linear Predictive Coding (LPC) speech chips used a fairly crude approximation of the human vocal tract in order to synthesize a human-sounding voice.
Over the intervening years. LPC has become ever more refined for use in speech synthesis, while also finding use in speech encoding and transmission. By using a real-life human’s voice as the basis for an LPC vocal tract, virtual assistants can also switch between voices, allowing Siri, Cortana, etc. to sound as whatever gender and ethnicity appeals the most to an end user.
Hopefully within the next few decades we can make speech recognition work as well as speech synthesis, and perhaps even grant these virtual assistants a modicum of true intelligence.