Abstract :
The days when
you had to keep staring at the computer screen and frantically hit the key or click
the mouse for the computer to respond to your commands may soon be a things of past.
Today we can stretch out and relax and tell your computer to do your bidding.
This has been made possible by the ASR (Automatic Speech Recognition) technology.
The ASR technology
would be particularly welcome by automated telephone exchange operators, doctors
and lawyers, besides others whose seek freedom from tiresome conventional computer
operations using keyboard and the mouse. It is suitable for applications in which
computers are used to provide routine information and services. The ASR’s direct
speech to text dictation offers a significant advantage over traditional transcriptions.
With further refinement of the technology in text will become a thing of past. ASR
offers a solution to this fatigue-causing procedure by converting speech in to
text.
The ASR technology
is presently capable achieving recognition accuracies of 95% - 98% but only under
ideal conditions. The technology is still far from perfect in the uncontrolled
real world. The routes of this technology
can be traced to 1968 when the term Information Technology hadn’t even been coined.
American’s had only begun to realize the vast potential of computers. In the Hollywood
blockbuster 2001: a space odyssey. A talking listening computer HAL-9000, had been
featured which to date is a called figure in both science fiction and in the world
of computing. Even today almost every speech recognition technologist dreams of
designing an HAL-like computer with a clear voice and the ability to understand
normal speech. Though the ASR technology is still not as versatile as the imaginer
HAL, it can nevertheless be used to make life easier. New application specific standard
products, interactive error-recovery techniques, and better voice activated user
interfaces allow the handicapped, computer-illiterate, and rotary dial phone owners
to talk to the computers. ASR by offering a natural human interface to computers,
finds applications in telephone-call centers, such as for airline flight information
system, learning devices, toys, etc..
HOW DOES THE
ASR TECHNOLOGY WORK?
When a person
speaks, compressed air from the lungs is forced through the vocal tract as a sound
wave that varies as per the variations in the lung pressure and the vocal tract.
This acoustic wave is interpreted as speech when it falls upon a person’s ear. In
any machine that records or transmits human voice, the sound wave is converted
into an electrical analogue signal using a microphone. When we speak into a telephone
receiver, for instance, its microphone converts the acoustic wave into an electrical
analogue signal that is transmitted through the telephone network. The electrical
signals strength from the microphone varies in amplitude over time and is referred
to as an analogue signal or an analogue waveform. If the signal results from
speech, it is known as a speech waveform. Speech waveforms have the characteristic
of being
continuous
in both time and amplitude.
THE SPEECH RECOGNITION
PROCESS
When a person
speaks, compressed air from the lungs is forced through the vocal tract as a sound
wave that varies as per the variations in the lung pressure and the vocal tract.
This acoustic wave is interpreted as speech when it falls up on a person’s ear. Speech waveforms have the characteristic of being
continuous in both time and amplitude.
Any speech recognition
system involves five major steps:
Converting sounds
into electrical signals: when we speak into microphone it converts sound waves
into electrical signals. In any machine that records or transmits human
voice, the
sound wave is converted into an electrical signal using a microphone. When we
speak into telephone receiver, for instance, its microphone converts the acoustic
wave into an electrical
analogue signal that is transmitted through the telephone network. The electrical
signal’s strength from the microphone varies in amplitude overtime and is referred
to as an analogue signal or an analogue waveform.
2. Background noise removal: the ASR programs removes
all noise and retains the words that you have spoken.
3. Breaking up words into phonemes: The words are
broken down into individual sounds, known as phonemes, which are the smallest sound
units discernible. For each small amount of time, some feature, value is found
out in the wave. Likewise, the wave is divided into small parts, called Phonemes.
4. Matching and choosing character combination:
this is the most complex phase. The program has big dictionary of popular words
that exist in the language. Each Phoneme is matched against the sounds and converted
into appropriate character group. This is where problem begins. It checks and compares
words that are similar in sound with what they have heard. All these similar words
are collected.
5. Language analysis: here it checks if the language
allows a particular syllable to appear after another.
6. After that, there will be grammar check. It tries
to find out whether or not the combination of words any sense. That is there will
be a grammar check package.
7. Finally the numerous words constitution the
speech recognition programs come with their own word processor, some can work with
other word processing package like MS word and word perfect.
VARIATIONS IN SPEECH
The speech-recognition
process is complicated because the production of phonemes and the transition
between them varies from person to person and even the same person. Different people
speak differently. Accents, regional dialects, sex, age, speech impediments,
emotional state, and other factors cause people to pronounce the same word in different
ways. Phonemes are added, omitted, or substituted. For example, the word, America,
is pronounced in parts of New England
as America. The rate of speech also varies from person the person depending upon a person’s habit
and his regional background.
A word or a
phrase spoken by the same individual differs from moment to moment illness; tiredness,
stress or other conditions cause subtle variations in the way a word is spoken at
different times. Also, the voice quality varies in accordance with the position
of the person relative to the microphone, the acoustic nature of the surroundings,
or the quality of the recording devices.
The resulting changes in the waveform can drastically affect the performance of
the recognizer.
Download :
Download :