Speech Recognition and
Natural Language Processing as a Highly Effective Means of Human-Computer Interaction
computers have become more pervasive in many parts of society it has become clear that
most people have great difficulty understanding and communicating with computers. Often
users are unable to simply state what they want done and must learn archaic commands or
non-intuitive procedures in order to get anything done. Furthermore, such communication is
often accomplished via slow, difficult to use devices such as mice or keyboards. It is
becoming clear that a easier, faster, and more intuitive method of communicating with
computers is needed. One proposed method is the combination of Speech Recognition and
Natural Language Processing software. Speech Recognition (SR) software is software that
has the ability to audibly detect human speech and parse that speech in order to generate
a string of words, sounds or phonemes to represent what the person said. Natural Language
Processing (NLP) software has the ability to process the output from Speech Recognition
software and understand what the user meant. The NLP software could then translate what it
believes to be the user's command into an actual machine command and execute it.
How Speech Recognition and
Natural Language Processing Work
Speech recognition and Natural
Language Processing systems are tremendously complex pieces of software. While their are a
variety of algorithms used in the implementation of such systems, there seem to be
something of a standard understanding of the fundamental methods used. Speech Recognition
works by disassembling sound into atomic units and then piecing back together, while
Natural Language Processing attempts to translate words into ideas by examining context,
patterns, phrases, etc.
Speech recognition works by
breaking down sounds the hardware "hears" into smaller, non-divisible, sounds
called phonemes. Phonemes are distinct, atomic units of sound. For example, the word
"those" is made up of three phonemes; the first is the "th" sound, the
second the hard "o" sound, and the final phoneme the "s" sound. A
series of phonemes make up syllables, syllables make up words, and words make up
sentences, which in turn represent ideas and commands. Generally, phonemes can be thought
of as the sound made by one or more letters in sequence with other letters. When the SR
software has broken sounds into phonemes and syllables, a "best guess" algorithm
is used to map the phonemes and syllables to actual words.
Once the Speech Recognition
software translates sound into words, Natural Language Processing software takes over. NLP
software parses strings of words into logical units based on context, speech patterns, and
more "best guess" algorithms. These logical units of speech are then parsed and
analyzed , and finally translated into actual commands the computer can understand based
on the same principles used to generate logical units.
Optimally Speech Recognition and
Natural Language Processing software can work with each other in a non-linear fashion in
order to facilitate better comprehension of what the user says and means. For example, a
SR package could ask a NLP package if it thinks the "tue" sound means
"to", "two", "too", or if it is part of a larger word such
as "tutelage". The NLP system could make a suggestion to the SR system by
analyzing what seems to make the most sense given the context of what the user has
previously said. It could work the other way around as well. For example, a NLP system
could query a SR system to see if a user seemed to emphasize a certain word or phrase in a
given sentence. If the NLP realizes when the user emphasizes certain words, it may be able
to more accurately determine what the user wants. ( e.g. the sentence "I don't
like that !" differs subtly, yet importantly from the sentence " I don't
like that!") SR systems may be able to determine which sounds or words were
emphasized by analyzing the volume, tone, and speed of the phonemes spoken by the user and
report that information back to the NLP system.
Problems with Speech
Recognition and Natural Language Processing
why isn't Speech Recognition and Natural Language Processing use more widespread? Thus
far, it has proven too difficult and too impractical to provide these services on most
systems. SR has been plagued by problems stemming from the difficulties of understanding
different types of voices ( e.g. male vs. female voices), parsing sounds when
people have different dialects ( e.g. different accents), distinguishing between
background noise and commands issued to the computer, etc. Moreover, if SR is to work in
real time, the software must have access to a large, fast database of known words and the
ability to add more words. Natural Language Processing software's problems are even more
difficult to overcome the Speech Recognition's. NLP must be able to understand sentences
peppered with verbal artifacts, slang, synonyms, ambiguities, and colloquialisms.
Speech recognition packages have
begun to deal with some of these problems and the results have been promising.
Historically, SR software has been plagued with problems stemming from differences in
pronunciation, enunciation, and speech patterns. For example, the way a child with a
high-pitched voice and a southern-drawl pronounces "gravel" may differ
significantly from how a deep-voiced man from the northeast pronounces the same word; yet
adept SR software should be able to determine that both people are speaking the same word.
This can be accomplished by allowing variable patterns of phonemes make up a given word.
Of course, doing so will increase the size of the database needed to map phonemes to
words. However, this issue is becoming less problematic as computers become faster and
cheaper. Indeed, this problem has become so trivial that some computerized telephone
services have begun using speech recognition software to gather information from users
(Admittedly, the vocabulary of these systems is extremely limited; e.g. A computer
will ask a user some simple questions with "yes" or "no" answers). The
problem of distinguishing speech directed at the computer from background noise has not
been dealt with as successfully. Currently, users of SR packages often must either work in
an environment with minimal background noise, or must wear a head set with a sampling
microphone inches from his or her mouth. Furthermore, users often have to
"inform" the computer when it is being spoken to. This often implies pressing
some button ( e.g. a foot pedal) or similar action. Certainly, this is not the most
desirable user interface. It is inefficient, taxing, and is not "user-centered."
Verbal artifacts must be dealt
with. Verbal artifacts are words or phrases that are spoken, but add little, if any,
content to a sentence. For example, the sentence. "Golly, I sure love pudding!"
contains two verbal artifacts: "Golly" and "sure". These two words add
little to the meaning the speaker of the sentence was attempting to convey. NLP software
must be able to identify these types of words for what they are and react appropriately.
NLP also needs to be able to recognize that human beings are capable of conveying a single
idea in synonymous ways. This is not as simple a task as it might initially seem. For
instance, NLP systems must be able to understand that a user saying "Take a
memo." is conveying virtually the same idea as a user saying "Why don't you
record this for me?" The first sentence is direct and rather unambiguous, but the
second sentence comes in the form of a question even though it is really a request.
Surely, one can understand where some confusion could arise in such a situation.
exploring NLP systems have not yet been able to develop systems that are robust enough to
handle these dilemmas. Currently these problems are dealt with by simply
"hard-coding" certain phrases and words that are synonymous. Ultimately, NLP
will need to be able to recognize and react to such synonyms by the context it comes in,
the user's habits ( e.g. does the user normally make requests in the form of a
question, or is the user actually asking a question), etc. Moreover, NLPs must have an
extraordinary understanding of grammatical rules, practices, and structures. Furthermore,
truly adept NLPs would need to be able to identify and react accordingly to sarcasm,
humor, rhetorical questions, etc.
Benefits of Speech
Recognition and Natural Language Processing
Why then should we try to
implement Speech Recognition and Natural Language Processing if it is so hard to do?
Simply put, SR and NLP could revolutionize the entire field of Human-Computer interaction
like nothing before. SR and NLP can completely abstract Human-Computer Interaction,
eliminating the need to understand anything about the computer's internal workings or how
to accomplish certain tasks.
"Indeed, one should be able
to say to a computer 'Do what I meant, not what I said.' "
What will be some specific benefits of SR and NLP interfaces?
SR and NLP will allow real time language translation. If a computer can figure out what
words one utters and understand what one actually means, it is a trivial task to translate
an idea from one language to another. That's the key: computers with capable Natural
Language Processing abilities will begin to act on the ideas that their users have, not
the commands explicitly given to them. Indeed, one should be able to say to a computer
"Do what I meant, not what I said."
SR and NLP technologies could
also conceivably eliminate the need to physically interact with computers. This means no
more having to sit down in front of the computer and manually manipulate a keyboard or
mouse. Instead, we'll have the freedom to be anywhere in relatively close proximity to a
computer and interact with it. For example, one could instruct a computer to find a recipe
for Chicken Kiev while hanging wallpaper. More importantly, people with certain types of
disabilities may be able to more effectively interact with a computer. For example, a
person with a broken arm will be able to easily work with computers, whereas now, a broken
arm would almost certainly impair one's ability to operate more traditional types of
interface devices such as a keyboard or a mouse.
SR and NLP have the added benefit
of being much faster than many other types of interfaces. Clearly, most people can speak
much faster than they can type. If a user can convey an idea in four seconds that would
otherwise take 20 or 30 seconds to type in, productivity could be greatly improved.
Clearly, this would be highly desirable.
Perhaps the greatest benefit that
NLP will yield will not be directly in the field of computer input and output, but in the
ability of a computer to understand a user's desires so profoundly that the computer will
be able to act autonomously on its user's behalf. For example, one will be able to tell a
computer to search the Internet for information on recreational activities in the
Yellowstone park area. An electronic, intelligent "agent" could be given a
command from the NLP instructing it to search for this data. Because, the NLP will have
understood the meaning of what the user requested, the agent could be instructed to search
for a variety of activities; camping, snowmobiling , cross-country-skiing, etc. Again, the
key idea is that NLP will be able to transform words and phrases into ideas. Once a
computer can accurately understand human ideas true artificial intelligence won't be too
Areas Where Speech
Recognition and Natural Language Processing May Not be Beneficial
Of course, there almost certainly
are some areas where Speech Recognition and Natural Language Processing are not best forms
of Human-Computer interaction. For example, mathematical equations or other logical
languages (such as many programming languages) are, generally speaking, difficult to
communicate efficiently via the spoken word. A keyboard or handwriting recognition based
interaction may be more appropriate. Furthermore, other forms of Human-Computer
Interaction (HCI) offer security advantages. If a user wishes to convey some sensitive
information to a computer, SR and NLP technologies dictate that the user speaks aloud,
allowing anyone nearby to overhear what is transpiring. Again, keyboards, mice, or other
silent forms of Human-Computer interface may be more appropriate in such a situation.
What's more, speech can be, in certain situations, a very slow and inefficient way of
interaction with a computer. For instance, if a user wishes to select several dozen
options from a list presented, it would take a great deal of time to list off each option
individually. In such a scenario it might be more appropriate to use some other form of
HCI that lends itself to manipulating multiple elements quickly.
Clearly, Speech recognition and Natural Language Processing do offer distinct advantages
over other forms of Human-Computer interaction. The ability to abstract Human-Computer
Interaction would revolutionize the way people interact with their environment.
Unfortunately, implementing such an ambitious scheme presents some very difficult
problems. Speech Recognition and Natural Language Processing require tremendously complex
software that can make sense of human speech (which is a monumental task in itself.)
Nevertheless, initial attempts at NLP and SR technologies have yielded promising results.
This paper was written for a User Interface class I
took at the University Of Colorado in the Computer Science Department . For a more detailed
explanation of the mechanics of the systems I've described, visit Microsoft's Persona
Project page. For a much more visionary and interesting discussion of the promise of
Speech Recognition and Natural Language Processing (and other matters relating to the
future of Human-Computer Interaction) read Nicholas Negroponte's Being Digital.
This page is © Copyright
1997-1998, Mike Machowski. If you have any comments, questions, or if you would like to
use this paper in full, or in parts, please e-mail
me . I'm more than happy to share, but I would like to know who I'm sharing with. This
document is avaliable for downloading