PEPNet-Northeast
formerly the Northeast Technical Assistance Center (NETAC)

Special Applications of Automatic Speech Recognition (ASR) with Deaf and Hard-of-Hearing People: Part II

by Ross Stuckless

1997 Symposium on ASR

I once heard a deaf person say wistfully that he longed for a "little black box" he could carry around in his pocket to enable him to become more independent in his communication with hearing people. To date, widespread use of automatic speech recognition (ASR) in classrooms and other group settings in which deaf or hard-of-hearing people are participants, as discussed in Part I of this article ( NTID Research Bulletin,2(3), 1997), has been constrained by the need for a third-party operator. However, in 1997 we came a step closer to that "little black box" when two new ASR products came on the market, both capable of processing large vocabulary, speaker-adaptive, continuous speech.
    Fortuitously, announcements of Dragon Systems' Naturally Speaking, Personal Edition and IBM's ViaVoice Gold were announced at the same time that the University of Rochester and the Rochester Institute of Technology hosted the Frank W. Lovejoy Symposium on Applications of Automatic Speech Recognition with Deaf and Hard of Hearing People in April, 1997. Most of what follows in this article is based on observations from, and presentations and formal discussions during, that symposium.

Complexity of recognizing natural language

The first commercial applications of ASR were directed toward a more controlled form of language and not toward the more free-flowing form we commonly use in spoken conversation. Naturally-spoken language is much more difficult to recognize automatically than the formal, carefully organized language we use in dictating or reading aloud from text, resulting in a considerably higher error rate for the former. In his presentation, Michael Picheny (1997) talked about dysfluencies common to spontaneous speech--the "um's," "ah's" and "you know's," the false starts and the restarts, all of which complicate the task of speech recognition.
    However, as we move beyond the task of recognizing discreet words one at a time, we cannot depend on their  acoustic characteristics alone. Harry Levitt (1997) indicated that while acoustic cues are important, they are not sufficient for accurate automatic recognition of naturally-spoken language and must be coupled in clever statistical/computational ways with linguistic cues. James Allen (1997) carried this point further. He distinguished among syntax, semantics, and pragmatics, all of which provide important cues, particularly in the recognition of spontaneously-spoken language, drawing close to what we regard as artificial intelligence.
    Another practical problem associated with efforts to record conversational and other informal forms of speech is
that ASR can transcribe only what is spoken. No punctuation, new sentence, new paragraph, or other visual markers accompany the automatically-recognized text unless the speaker issues a voice command (or keystroke), such as "period" or "new paragraph," to instruct the computer to do so. Otherwise, the text would be displayed as an unbroken string of words.

Early generation continuous speech recognition

Mark Mandel (1977) of Dragon Systems extemporaneously demonstrated the pre-release alpha version of his company's first generation continuous speech product, NaturallySpeaking, Personal Edition. It appears in the sidebar on page 7 without correction (errors are underlined). The speaker added punctuation by voice as he spoke. Four errors appeared in this 129 word continuous speech production, making it better than 95% error-free and quite readable. A second version, NaturallySpeaking Deluxe, has since come on the market with several refinements.
    IBM has also released a second version of its continuous speech recognition product, ViaVoice Gold. I have used both products and both performed very well in a dictation or text-reading mode. I also used them in both lecture and conversational modes, where their performance was unsatisfactory. I do believe that with some speaker adaptations and the addition of an equipment accessory or two, performance could be considerably improved. However, whether it would be satisfactory would depend on what Kathryn Woodcock (1997) identified an ergonomic model of the system's requirements which combines the user and the device within an environment, performing a task.

What's next in speech recognition?

Michael Picheny (1997), an expert in ASR research, suggested that "the biggest research challenge over the next couple of years will be to come up with models for handling rapid conversational speech." He also talked about other priorities, including the need to deal with recognition problems posed by dysfluencies and accents in speech, background noise, and telephone characteristics, e.g., narrow bandwidth.
    These challenges notwithstanding, Picheny made several personal predictions, qualifying them by saying that their enabling systems "may have lots of errors, but technology should allow us to begin using them within the following timeframe." He felt that by the end of 1998, ASR systems should be capable of handling speech over the telephone at 20% error rates. In 1999, limited ASR systems capable of transcribing broadcast news (though perhaps not TV shows) should be available. By the year 2001, we may see ASR systems that transcribe meetings. There may also be hand-held devices (possibly in communication with more powerful remote ASR systems) that people can carry around in their hands to read transcriptions of actual ongoing conversations and events.

Personal thoughts from the symposium

Like others who participated in the symposium and/or have read its proceedings, I came away with new information and expectations. First, we need to proactively encourage major ASR developers to recognize deaf and hard-of-hearing people as members of a potential niche market. We should also take the initiative ourselves in adapting new systems and devices as needed.
    Second, the potential value of ASR in telephone communication involving deaf and hard-of-hearing people is quite significant, and efforts in this direction should be pursued vigorously. Third, as Kathryn Woodcock (1997) suggested, we should be attentive to the interaction of user, device, environment, and task in targeting ASR applications.
    And, finally, we need to understand that transcription errors in ASR cannot be eliminated in the foreseeable future. Nonetheless, we need to remember that deaf and hard-of-hearing people will not accept and use ASR-based systems and devices that do not meet their standards for accurate and reliable communication.

Conclusion

If I had a disappointment, it was that we had no time to delve more deeply into some topics and to open others. As a former teacher of deaf children, I wanted discussion about ASR's potential for English language learning, both at home and in school. As an educational researcher, I wanted discussion about how ASR might assist deaf and hard-of-hearing students in mainstreamed classes. And as a faculty member in a career-oriented college and university, I wanted to explore some thoughts about how ASR could be adapted to communication needs in the workplace.  Another time...

References

Allen, J. (1997). Applications of automatic speech recognition to natural language and conversational speech. In R. Stuckless (Ed.) Frank W. Lovejoy Symposium on Applications of Automatic Speech Recognition with Deaf and Hard-of-Hearing people (pp. 33-39). Rochester, NY: Rochester Institute of Technology.
   Levitt, H. (1997). Automatic speech recognition: Exploring potential applications for people with hearing loss. Ibid.(pp. 3-14).
   Mandel, M. (1997). Discrete word recognition systems and beyond: Today and five-year projection. Ibid. (pp. 15-21).
   Picheny, M. (1997). Continuous speech recognition systems: Today and five-year projection. Ibid. (pp. 23-30).
   Woodcock, K. (1997). Ergonomics of applications of automatic speech recognition for deaf and hard of hearing people. Ibid. (pp. 41-53).

To obtain a copy of the proceedings from the Frank W. LoveJoy Symposium on Applications of Automatic Speech Recognition with Deaf and Hard of Hearing People, contact Ross Stuckless at ERSNVD@RIT.EDU and type "ASR Proceedings" on the subject line. Also, you can review and download the proceedings in their entirety at http://www.rit.edu/~ewcncp/Lovejoy.html