Developing Speech Recognition Systems

Developing Speech Recognition Systems

From TeleFlow

Jump to: navigation, search

This document provides points to consider and a step by step flow through for developing Speech Recognition (SR) environments using TeleFlow. The following resources and materials will be required for development of a speech system

  • TeleFlow Designer & Server – Development Environment
  • TeleFlow Server – Production Environment
  • Nuance / Lumenvox (or other) SR processing environment on both Development & Production
  • TeleFlow SpeechTrainer
  • Post-It notes

Speech Recognition systems are very difficult to create and require careful planning. The following points will help you to plan development and train the application to function correctly over time.

Brand new SR system may require significant adjustment to the grammars and voice work in order to function effectively. Training to account for the different ways in which people speak and communicate will need to be taking into consideration. No matter how carefully a system is planned, people will always interface with it in different ways that anticipated. The user experience can then be gauged using TeleFlow SpeechTrainer. Feedback viewed and listened to through SpeechTrainer can help finalize the grammars and provide insight into how the application be modified to help ensure a good experience by most callers.


Planning Development

The following notes provide some thoughts that can be used to determine how the planning and development process should take place. Please note that SR development is no trivial matter, and that the process will require input from many people over several weeks or months.

The Right People

Make sure the right people are there. These are the people who:

  • Listen to what callers say (call center operators)
  • Marketing people (what the jargon should be etc)
  • Managers
  • Technical people

Education about Speech Recognition

Ensure to educating the speech team on the methods that will be employed to build the system. Determine the plan ahead of time, and share with all involved.

  • Provide examples of what is reasonable for use.
  • Explain that SR is not “StarTrek”.
  • Intelligence is not involved. Anticipation and caller guidance is the key.
  • SR It is not Artificial Intelligence. This is a common misconception.
  • Entry of information

If you are intending to use speech recognition as a method to gather words (or codes) that are spelled out by having the speaker spell out the alphabet, please take a moment to consider a number of approaches as described in the Speech Recognition and the Alphabet section.

State Requirements

Get all user requirements clearly stated before creating design

Careful for Prompts

Be careful with prompts using the word "or"

  • “or” tends to be answered with a Yes or No, and can be confusing.

Grammar Alignment

Prompts and Grammars must be aligned:

  • Do not change the prompt half way in, or the grammars will need to be updated
  • Make certain all prompts have been defined early on

Usability Testing

Do not skip iterative usability testing. Make certain to test various elements of the design independently.

  • Menus
  • Error handling
  • Processes

Used to ensure the speech system works with the callers mental process

Testing Tasks

Do not shorten testing tasks.

  • Make sure testing tasks are not done in parallel.

Test on Production

Test on Production and not Development.

  • A system should be in production for users to test against
  • Development should move save points to Production
  • Testing on Development means testing against a moving target

Instructions for Testing

Provide proper instructions for testers to call with.

  • A specific document with numbers to call, and test accounts etc

Evaluate Usability

Perform evaluative usability before the system is ready.

  • Evaluate as the system is being developed

Production Walk Through

The following meetings and steps need to be taken from start to completion.

  • Requirements Analysis Meeting
  • Gather user requirements
  • Ensure correct people in room or in conference
  • Use sticky notes on white board (or net equivalent)

Define Business Requirements

  • What needs to be achieved
  • What data elements need to be gathered

Identify Application Requirements

  • A functions list should be created by the end of meeting
  • Ensure front line staff have equal input, or important information will be lost

Design Documentation

Make sure the documentation is created in a format that can be readily shared, reviewed and updated.

Persona and Audio Design

  • How do you want the system to sound
  • Consider tone and use of language
  • Decide on voice talent early

Dialog Design

  • Create scripts around functions list
  • Create questions list
  • Gather list of Grammars that will be required.
  • Run questions by a test group, and gather addition words (answers)

Application Design

  • Database Dictionary describes Data Elements

Provide Final Design Document

  • Scripts
  • Grammars
  • Functions
  • Interface to database
  • Other programs (web / configuration / reports etc)

Client Sign off

Development and Implementation

Persona and audio production

Grammar development

Application development


Application Tests

Recognition and Traversal Tests

Evaluative and Usability Tests

Tuning and Monitoring

Pilot tuning

Post deployment tuning

Post roll-out monitoring

Once the last step is complete, the system will be ready for general use.

Avoid Common Errors

Development of speech recognition systems is a challenge to start with. The following list provides an indication of common errors, and can help you identify frustrations that may impede your development time line.

General Approach

  • Make sure you have people who know how to write speech systems.
    • This can save time over trial and error development.
  • Would you like “Fries or Salad”?
    • User will say “Yes”
    • s/b Which would you like... Fries (pause for answer), or Salad?
    • Better yet, stay away from “or”
  • Avoid requests like: “Please say your phone number”
    • s/b What's your phone number?
    • Consider allowing touch tone entry for prompts like this
  • Choose very different words to ensure best results from the recognizer
  • Keep grammars short, and then only expand on pilot testing when users speak words
  • Use pilot testing to gather words that users speak
  • Users will often speak a different set of vocabulary than expected
  • Use the SpeechTrainer to review the words callers speak.
  • Make sure you design for the callers, and not for the sake of speech.
  • Do testing and development on entirely separately systems
  • Be sure to test on the production system, and not the demo system
  • Develop a test matrix to cover all aspects of testing
  • Be sure to balance testing across entire system
  • A common error is to test certain sections, and assume the rest will work
    • For this reason, it is a good idea to test each functional section separately
  • Create a document and check off when tested
  • Make sure you have detailed instructions on how callers (participants) should test
    • Test information such as account numbers should be apparent
    • Consider that callers testing have different intentions than your client callers
  • Check system to make sure it covers all original user requirements
  • Loud locations will sometimes trip SR.
    • Test the system from a variety of locations


Cellphones are the scourge of speech recognition. As more and more people are using cell phones today, with poor reception consider the following:

  • Cell phones (and some VoIP services) are not as clear as land lines.
  • Test with a variety of telephone handsets

Other Points

  • Provide routes to touch tone enter certain information where readily possible
  • Avoid pretending to be human, but keep the system sounding light, by using bridging phrases: "Okay, I heard you say [option]. Is this correct?"
  • The caller should never hear the words “I’m sorry…”
  • The caller should not hear the same “We did not receive a valid response” error
    • Try cycling through five or six messages to keep the annoyance factor low


  • Make sure the callers are interviewed at the end of the calls to ensure they like it
  • Make sure a reasonable amount of time is set apart for the testing
  • Be certain to review each day’s results and tune (SpeechTrainer) each day
  • As your callers train the system, the system will train them overtime.
  • Seasoned users learn to speech in a way the system understands
  • Before roll out, ensure a new set of users are able to adopt the system
    • Provide any user documentation in final form, if applicable.

Problems and Inconsistencies

Numbers with a specific number of digits are much easier to recognize over variable length ones. When we know we are waiting for seven digits (for local telephone numbers example), the caller can speak the numbers at any pace they want to, and the system will simple wait until it hears all seven... within reason.

Variable length numbers are exceptionally difficult to account for, because timing is everything. If waits between numbers are too extended, then numbers can get cut off when the system thinks the caller may be finished speaking.

Both variable and fixed numbers should be “chunked” where possible to guide the caller on how to speak. North American telephone numbers are generally expressed as 604-555-1234, and callers will generally speak them as “six”, “oh”, “four”, pause, “five”, “five”, “five”, pause “one”, ‘”two”, “three”, “four”. This chunking can be useful to program the system for number utterances when callers leave very long spaces between numbers.

In cases where numbers are variable length, see if it is possible to manufacture a fixed length number, or account for numbers by chunking them. For example, “1” through “99999” could become, “100-001” through “199-999”. If at all possible, try to create numbers that chunked in a familiar manner. Seven and ten digit numbers in North America are commonly spoken, and accordingly other numbers of the same length are commonly spoken in the same chunks.

Cell phone usage is prevalent today, and requires some additional handling in loud areas. Where possible, provide an opt-out when entering numbers so that Touch Tones can be used. If a grammar is worded correctly

Caller frustration (expletives) is something that you may want to account for in grammars. However, as where some callers realized that pressing “zero, zero, zero” repeatedly in touch tones systems brought them to an operator, the same applies to spoken voice. Expletives spoken repeatedly may also be used by some callers to by pass automation.

Expectations of a Speech Enabled System

Reasonable expectations from an SR system include:

  • Low cognitive load for the caller (Intuitive design helps adoption)
  • Efficiency (No long menus)
  • Graceful error recovery (Sounds humanistic, without pretending to be human)
  • Accuracy
  • Clarity

Unreasonable expectations include:

  • Perfect recognition of spoken numbers and commands
  • Provide an ideal response for every caller’s voice requests

IMPORTANT: A common misconception about Speech Recognition is that it will understand what is being spoken. Sounds are identified against a dictionary, which then creates a word. All functionality must to be specifically developed by programmers to account for expect situations. A friendly voice can help to guide the caller through.

Delivery of a Speech Enabled System


The voice actor or actress you choose must be well thought out, and match the general concept of the voice system. You should also expect to have several recording sessions as the voice can have a strong impact on acceptance of the system by testers and ultimately, the client.

When the voice actor or actress records, ensure the tempo is at a reasonable pace. This is important as slow recordings may prompt the caller to speak slower for SR. The pace of the recordings can act as a guide for the caller.


From the start, it should be explained that SR is not “Star Trek”, and that the expectations should be in line with the investment of development.


Try not to education testers too much ahead of time, and only provide the basic information that client callers should have (or may not have) as the case may be.

As your testers use the system, they will learn to use it more effectively as time goes by. Effectively, they are trained by the system over time. Unless this is the purpose of the system, a separate set of test users should be kept aside for late testing, and for interview on the voice work etc. Systems that are intended to be usable by first time callers will need testers who are entirely unaware of the capabilities of the system.


Very soon after the system is tested, a conference call and/or questionnaire should be arranged and/or filled out. Be careful to keep the conversation to questions only, and listen to feedback. A conference may be ideal, as testers may help each other remember aspects of the systems they had difficulty with. If the testers are employees of the company accepting the system, consider keeping the conference call to just the testers and not their managers.

Ongoing System Review

The following systems are highly beneficial for gauging the user experience with speech recognition.

Call Recording

TeleFlow CallCapture can be a great way to gauge the user experience. Call recording systems record the entire call so that you can hear exactly what the caller heard and said. Leaving the system in place can be a very good way to spot check over time. Please note that if you employ a call recording system to review calls, that a message stating “This call may be monitored for quality assurance” may be necessary according to law.

TeleFlow SpeechTrainer

Please refer to TeleFlow SpeechTrainer for information on the how to review recordings and make adjustments to the speech grammars.