Speech Recognition and Natural Language Processing as a Highly Effective Means of Human-Computer Interaction

By Mike Machowski

Abstract

As computers have become more pervasive in many parts of society it has become clear that most people have great difficulty understanding and communicating with computers. Often users are unable to simply state what they want done and must learn archaic commands or non-intuitive procedures in order to get anything done. Furthermore, such communication is often accomplished via slow, difficult to use devices such as mice or keyboards. It is becoming clear that a easier, faster, and more intuitive method of communicating with computers is needed. One proposed method is the combination of Speech Recognition and Natural Language Processing software. Speech Recognition (SR) software is software that has the ability to audibly detect human speech and parse that speech in order to generate a string of words, sounds or phonemes to represent what the person said. Natural Language Processing (NLP) software has the ability to process the output from Speech Recognition software and understand what the user meant. The NLP software could then translate what it believes to be the user's command into an actual machine command and execute it.

How Speech Recognition and Natural Language Processing Work

Speech recognition and Natural Language Processing systems are tremendously complex pieces of software. While their are a variety of algorithms used in the implementation of such systems, there seem to be something of a standard understanding of the fundamental methods used. Speech Recognition works by disassembling sound into atomic units and then piecing back together, while Natural Language Processing attempts to translate words into ideas by examining context, patterns, phrases, etc.

Speech recognition works by breaking down sounds the hardware "hears" into smaller, non-divisible, sounds called phonemes. Phonemes are distinct, atomic units of sound. For example, the word "those" is made up of three phonemes; the first is the "th" sound, the second the hard "o" sound, and the final phoneme the "s" sound. A series of phonemes make up syllables, syllables make up words, and words make up sentences, which in turn represent ideas and commands. Generally, phonemes can be thought of as the sound made by one or more letters in sequence with other letters. When the SR software has broken sounds into phonemes and syllables, a "best guess" algorithm is used to map the phonemes and syllables to actual words.

Once the Speech Recognition software translates sound into words, Natural Language Processing software takes over. NLP software parses strings of words into logical units based on context, speech patterns, and more "best guess" algorithms. These logical units of speech are then parsed and analyzed , and finally translated into actual commands the computer can understand based on the same principles used to generate logical units.

Optimally Speech Recognition and Natural Language Processing software can work with each other in a non-linear fashion in order to facilitate better comprehension of what the user says and means. For example, a SR package could ask a NLP package if it thinks the "tue" sound means "to", "two", "too", or if it is part of a larger word such as "tutelage". The NLP system could make a suggestion to the SR system by analyzing what seems to make the most sense given the context of what the user has previously said. It could work the other way around as well. For example, a NLP system could query a SR system to see if a user seemed to emphasize a certain word or phrase in a given sentence. If the NLP realizes when the user emphasizes certain words, it may be able to more accurately determine what the user wants. ( e.g. the sentence "I don't like that !" differs subtly, yet importantly from the sentence " I don't like that!") SR systems may be able to determine which sounds or words were emphasized by analyzing the volume, tone, and speed of the phonemes spoken by the user and report that information back to the NLP system.

Problems with Speech Recognition and Natural Language Processing

So why isn't Speech Recognition and Natural Language Processing use more widespread? Thus far, it has proven too difficult and too impractical to provide these services on most systems. SR has been plagued by problems stemming from the difficulties of understanding different types of voices ( e.g. male vs. female voices), parsing sounds when people have different dialects ( e.g. different accents), distinguishing between background noise and commands issued to the computer, etc. Moreover, if SR is to work in real time, the software must have access to a large, fast database of known words and the ability to add more words. Natural Language Processing software's problems are even more difficult to overcome the Speech Recognition's. NLP must be able to understand sentences peppered with verbal artifacts, slang, synonyms, ambiguities, and colloquialisms.

Speech recognition packages have begun to deal with some of these problems and the results have been promising. Historically, SR software has been plagued with problems stemming from differences in pronunciation, enunciation, and speech patterns. For example, the way a child with a high-pitched voice and a southern-drawl pronounces "gravel" may differ significantly from how a deep-voiced man from the northeast pronounces the same word; yet adept SR software should be able to determine that both people are speaking the same word. This can be accomplished by allowing variable patterns of phonemes make up a given word. Of course, doing so will increase the size of the database needed to map phonemes to words. However, this issue is becoming less problematic as computers become faster and cheaper. Indeed, this problem has become so trivial that some computerized telephone services have begun using speech recognition software to gather information from users (Admittedly, the vocabulary of these systems is extremely limited; e.g. A computer will ask a user some simple questions with "yes" or "no" answers). The problem of distinguishing speech directed at the computer from background noise has not been dealt with as successfully. Currently, users of SR packages often must either work in an environment with minimal background noise, or must wear a head set with a sampling microphone inches from his or her mouth. Furthermore, users often have to "inform" the computer when it is being spoken to. This often implies pressing some button ( e.g. a foot pedal) or similar action. Certainly, this is not the most desirable user interface. It is inefficient, taxing, and is not "user-centered."

Verbal artifacts must be dealt with. Verbal artifacts are words or phrases that are spoken, but add little, if any, content to a sentence. For example, the sentence. "Golly, I sure love pudding!" contains two verbal artifacts: "Golly" and "sure". These two words add little to the meaning the speaker of the sentence was attempting to convey. NLP software must be able to identify these types of words for what they are and react appropriately. NLP also needs to be able to recognize that human beings are capable of conveying a single idea in synonymous ways. This is not as simple a task as it might initially seem. For instance, NLP systems must be able to understand that a user saying "Take a memo." is conveying virtually the same idea as a user saying "Why don't you record this for me?" The first sentence is direct and rather unambiguous, but the second sentence comes in the form of a question even though it is really a request. Surely, one can understand where some confusion could arise in such a situation.

Unfortunately researchers exploring NLP systems have not yet been able to develop systems that are robust enough to handle these dilemmas. Currently these problems are dealt with by simply "hard-coding" certain phrases and words that are synonymous. Ultimately, NLP will need to be able to recognize and react to such synonyms by the context it comes in, the user's habits ( e.g. does the user normally make requests in the form of a question, or is the user actually asking a question), etc. Moreover, NLPs must have an extraordinary understanding of grammatical rules, practices, and structures. Furthermore, truly adept NLPs would need to be able to identify and react accordingly to sarcasm, humor, rhetorical questions, etc.

Benefits of Speech Recognition and Natural Language Processing

Why then should we try to implement Speech Recognition and Natural Language Processing if it is so hard to do? Simply put, SR and NLP could revolutionize the entire field of Human-Computer interaction like nothing before. SR and NLP can completely abstract Human-Computer Interaction, eliminating the need to understand anything about the computer's internal workings or how to accomplish certain tasks.

"Indeed, one should be able to say to a computer 'Do what I meant, not what I said.' "

What will be some specific benefits of SR and NLP interfaces? SR and NLP will allow real time language translation. If a computer can figure out what words one utters and understand what one actually means, it is a trivial task to translate an idea from one language to another. That's the key: computers with capable Natural Language Processing abilities will begin to act on the ideas that their users have, not the commands explicitly given to them. Indeed, one should be able to say to a computer "Do what I meant, not what I said."

SR and NLP technologies could also conceivably eliminate the need to physically interact with computers. This means no more having to sit down in front of the computer and manually manipulate a keyboard or mouse. Instead, we'll have the freedom to be anywhere in relatively close proximity to a computer and interact with it. For example, one could instruct a computer to find a recipe for Chicken Kiev while hanging wallpaper. More importantly, people with certain types of disabilities may be able to more effectively interact with a computer. For example, a person with a broken arm will be able to easily work with computers, whereas now, a broken arm would almost certainly impair one's ability to operate more traditional types of interface devices such as a keyboard or a mouse.

SR and NLP have the added benefit of being much faster than many other types of interfaces. Clearly, most people can speak much faster than they can type. If a user can convey an idea in four seconds that would otherwise take 20 or 30 seconds to type in, productivity could be greatly improved. Clearly, this would be highly desirable.

Perhaps the greatest benefit that NLP will yield will not be directly in the field of computer input and output, but in the ability of a computer to understand a user's desires so profoundly that the computer will be able to act autonomously on its user's behalf. For example, one will be able to tell a computer to search the Internet for information on recreational activities in the Yellowstone park area. An electronic, intelligent "agent" could be given a command from the NLP instructing it to search for this data. Because, the NLP will have understood the meaning of what the user requested, the agent could be instructed to search for a variety of activities; camping, snowmobiling , cross-country-skiing, etc. Again, the key idea is that NLP will be able to transform words and phrases into ideas. Once a computer can accurately understand human ideas true artificial intelligence won't be too far behind.

Areas Where Speech Recognition and Natural Language Processing May Not be Beneficial

Of course, there almost certainly are some areas where Speech Recognition and Natural Language Processing are not best forms of Human-Computer interaction. For example, mathematical equations or other logical languages (such as many programming languages) are, generally speaking, difficult to communicate efficiently via the spoken word. A keyboard or handwriting recognition based interaction may be more appropriate. Furthermore, other forms of Human-Computer Interaction (HCI) offer security advantages. If a user wishes to convey some sensitive information to a computer, SR and NLP technologies dictate that the user speaks aloud, allowing anyone nearby to overhear what is transpiring. Again, keyboards, mice, or other silent forms of Human-Computer interface may be more appropriate in such a situation. What's more, speech can be, in certain situations, a very slow and inefficient way of interaction with a computer. For instance, if a user wishes to select several dozen options from a list presented, it would take a great deal of time to list off each option individually. In such a scenario it might be more appropriate to use some other form of HCI that lends itself to manipulating multiple elements quickly.

Conclusion

Clearly, Speech recognition and Natural Language Processing do offer distinct advantages over other forms of Human-Computer interaction. The ability to abstract Human-Computer Interaction would revolutionize the way people interact with their environment. Unfortunately, implementing such an ambitious scheme presents some very difficult problems. Speech Recognition and Natural Language Processing require tremendously complex software that can make sense of human speech (which is a monumental task in itself.) Nevertheless, initial attempts at NLP and SR technologies have yielded promising results.

Notes

This paper was written for a User Interface class I took at the University Of Colorado in the Computer Science Department . For a more detailed explanation of the mechanics of the systems I've described, visit Microsoft's Persona Project page. For a much more visionary and interesting discussion of the promise of Speech Recognition and Natural Language Processing (and other matters relating to the future of Human-Computer Interaction) read Nicholas Negroponte's Being Digital.

This page is � Copyright 1997-1998, Mike Machowski. If you have any comments, questions, or if you would like to use this paper in full, or in parts, please e-mail me . I'm more than happy to share, but I would like to know who I'm sharing with. This document is avaliable for downloading .