Apple has published a fascinating new entry in its Machine Learning Journal this month that explains how the voice-activated ‘Hey Siri’ detector works in detail. While many of these entries tend to be too in depth for the average reader (i.e. me), October’s entry from the Siri team includes several interesting (and understandable!) tidbits about what happens behind-the-scenes when you use ‘Hey Siri’ on your iPhone and Apple Watch.
Apple explains that the iPhone and Apple Watch microphone “turns your voice into a stream of instantaneous waveform samples, at a rate of 16000 per second” before the detector on device decides if you intended to invoke Siri with your voice:
A spectrum analysis stage converts the waveform sample stream to a sequence of frames, each describing the sound spectrum of approximately 0.01 sec. About twenty of these frames at a time (0.2 sec of audio) are fed to the acoustic model, a Deep Neural Network (DNN) which converts each of these acoustic patterns into a probability distribution over a set of speech sound classes: those used in the “Hey Siri” phrase, plus silence and other speech, for a total of about 20 sound classes.
Apple also has a variable threshold for deciding whether or not you’re trying to invoke Siri:
We built in some flexibility to make it easier to activate Siri in difficult conditions while not significantly increasing the number of false activations. There is a primary, or normal threshold, and a lower threshold that does not normally trigger Siri. If the score exceeds the lower threshold but not the upper threshold, then it may be that we missed a genuine “Hey Siri” event. When the score is in this range, the system enters a more sensitive state for a few seconds, so that if the user repeats the phrase, even without making more effort, then Siri triggers. This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time.
As we know, ‘Hey Siri’ relies on the co-processor in iPhones to listen for the trigger word without requiring physical interaction or eating up battery life, and the Apple Watch treats ‘Hey Siri’ differently as it requires having the display on. Apple explains that ‘Hey Siri’ only uses about 5% of the compute budget using this method.
The “Hey Siri” detector runs only when the watch motion coprocessor detects a wrist raise gesture, which turns the screen on. At that point there is a lot for WatchOS to do—power up, prepare the screen, etc.—so the system allocates “Hey Siri” only a small proportion (~5%) of the rather limited compute budget. It is a challenge to start audio capture in time to catch the start of the trigger phrase, so we make allowances for possible truncation in the way that we initialize the detector.
Finally, why did Apple choose the phrase ‘Hey Siri’ as the trigger?
Well before there was a Hey Siri feature, a small proportion of users would say “Hey Siri” at the start of a request, having started by pressing the button. We used such “Hey Siri” utterances for the initial training set for the US English detector model. We also included general speech examples, as used for training the main speech recognizer. In both cases, we used automatic transcription on the training phrases. Siri team members checked a subset of the transcriptions for accuracy.
We created a language-specific phonetic specification of the “Hey Siri” phrase. In US English, we had two variants, with different first vowels in “Siri”—one as in “serious” and the other as in “Syria.”
The full entry is a neat read, especially if you’re interested in speech recognition or use ‘Hey Siri’ on your iPhone or Apple Watch.