Why Voice Recognition Doesn’t Work for Transcription

You mean, I can’t just record the language sample right into SALT?

When talking with people new to SALT I frequently hear, “Oh, I have to transcribe the sample? Can’t I just record into SALT?” There is typically a look of disappointment when I reply that the speech recognition software available to us today just isn’t accurate enough for our purposes. The goal of LSA is to document functional, authentic, and verbatim expressive language. Unfortunately, that is precisely the type of language that computers are least able to capture reliably.

Today computer solutions like SALT provide a wealth of lightning-fast, reliable analysis outcomes to document language in action. But, back in the beginning, recorded samples were analyzed by hand after orthographically transcribing what was said. The background work included designing a definitive transcription protocol to identify each unit of analysis – e.g., word roots, bound morphemes, and utterances – so each transcript followed the same rules. These rules insured each analysis was as accurate as it could be, albeit prone to human error as the words, morphemes, and utterances were hand-counted and summarized.

When personal computers were introduced in the 1980s, a new issue arose: how to give the computer the simple, consistent rules that it needs in order to perform analyses. We now have a clearly defined, simple set of transcription rules, conventions, and codes that have increased the reliability and validity of LSA tremendously. The computer is happy with where we currently reside. But it’s 2018 and most everyone has used speech recognition tools to draft emails or texts on their smartphone. C’mon SALT let’s get on that train!

If you’ve spoken to your smartphone, at some point you’ve become frustrated or even embarrassed with the autocorrects from speech recognition. After the error, you take the time to resend a corrected message, often including a frown emoji and “[email protected]#n autocorrect!” My last name is Andriacchi, pronounced “Ann-dree-ă-key”. Without fail, Siri consistently spells it “Andrea Key” (two words). I speak with a typical rate and without misarticulating, yet even words that Siri has been fed many times over are misspelled routinely in my messages. She’s just not accurate enough to get the message written without errors. Imagine speech input from a very young speaker whose articulation and rules of syntax haven’t fully developed. Or, imagine input from a speaker with a language impairment for which word prediction software may be extremely inaccurate. This is where the human ear and perceptual skills still trump technology.

Below is a an excerpt of a transcribed sample from an articulate speaker, age 10, retelling the story Dr. De Soto (Steig, 1982). Although the speaker is language impaired, his words are 100% intelligible to the human ear. With language impairment, comes less predictable word and phrase production, as well as errors not typical from “normal” speakers. In August of 2016, SALT worked with programmers of a leading speech recognition software program. We provided the vocabulary from the story Dr. De Soto from multiple transcripts (from different speakers). The text was imported into the back end of the software to enhance the word recognition abilities prior to being fed the audio sample for transcription. Below, on the left, you’ll see what was actually said by the speaker – typed by a trained SALT transcriber (rated with 97% inter-transcriber word-level reliability). On the right, see what the speech recognition software produced from the same audio sample.

SALT Manual Transcription

Speech Recognition Software
Note: *asterisk denotes a part-word

Dr. DeSoto is a dentist. He um does like animals his size. But for the bigger ones, they he sits um they sit on the ground when he u* u* he use sta* he uses a ladder to go to their mouth. And he’s famous for the really big ones because um he has a like a crane in a special room that his assis* pers* his assistant p* pulls him up to his mouth. Um his assistant is his wife. Um Dr. DeSoto um just did a n* his tools doesn’t hurt the larger animals because um they’re just they’re very tiny…  

De Soto he does and since the ones he sits on the family room when he you use the letter to Hughes through home here cream the sisters consistent holes up to off system this is worth home several just to his tools hurt the order and will close on the just returning this memo in the server on the money changers animals are kept coming in June only a part was outside of the three and she will know in the city’s mercy for silver and their thinking through and amend some services through this and the Fox…


As you can see, the speech recognition software did not accurately capture the speaker’s message with all of the nuances and characteristics of language that we as clinicians are interested in exploring. The speech recognition software was not able to parse utterances (denote utterance boundaries). Nor could it transcribe part-words or filled pauses, and the content of the message was simply incorrect. The transcript generated by the software would have to undergo massive edits in order to be used diagnostically. Not surprisingly, our transcriber spent more time correcting the speech recognition output than she did on the first pass transcription (which was accurate and reliable). As part of the editing process the transcriber would also need to add bound morphemes, filled pauses, mazes, and other transcription conventions in order to be sensitive enough to capture language use for the purpose of thorough language sample analysis.

But… There is one scenario where speech recognition tools might be helpful to you. We’re not lazy, we’re just busy. SALT hasn’t put time and effort into this next suggestion, as our transcribers are trained to type everything verbatim and use all codes and conventions – our transcription services customers want the whole shebang. However, if you were interested in very narrow language analysis, such as outcomes on only  number of total words produced and MLU in words, you could listen to a recorded language sample through headphones and repeat what was said into a speech recognition software program that has been trained to recognize your voice and the nuances of your production style. This may be somewhat of a time saver. It would likely take the real time of a short audio sample (excluding the elicitation process) plus a few extra minutes to start/stop the recording, repeat the utterances, and direct the software to include end punctuation. For analysis outcomes of a limited nature, speech recognition software may be quicker than transcription.

And there is hope for the future. Not long ago SALT was informed by one the the leading technology companies in the world that the technology does exist to accurately mark utterance boundaries and to “turn off” word prediction (so part words could be captured). However, our efforts to get any farther, to learn more, and to have access to the technology were met with “no reply”. SALT won’t likely be the first small company to access the technology, but once it’s available, we’ll do everything we can to make it work for LSA!

As SLPs charged with analyzing language of the speakers we’re assessing, we must be careful to ensure that what is included in analysis is exactly what was said, to the best of our ability. Computer solutions like SALT have greatly streamlined the LSA process. The analysis-by-hand method is no longer a burden in the process. But yes, at present, we still need to transcribe the samples into the software in order to provide the most accurate analysis outcomes. Remember, there are transcription shortcuts that can save time as well. See our blog post about that HERE.

The day will come when speech recognition software accurate enough to produce valid and reliable transcription will be available to us. I CAN’T WAIT to be able to say, “Just feed your recorded language sample into SALT” and voila!

Leave A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.