Using HTML5 Speech Recognition and Text to Speech

  • 390
  • 0

Using HTML5 Speech Recognition and Text to Speech

  • Stephen WaltherJanuary 5th, 2015

Using HTML5 Speech Recognition and Text to Speech

Wouldn’t it be great if you could interact with websites just like Siri on your iPhone? In other words, you could ask web pages questions out loud and get answers spoken back to you?

Imagine, for example, that you are creating a children’s game. If the child cannot type or read then the most natural way for the child to interact with the game is through speech.

HTML5 includes the Web Speech API Specification, which covers both Speech Recognition and Text to Speech. You can find the spec right here:

https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

Currently, browser support for the specification is spotty and buggy (I hope this changes – I write this on Jan 5, 2015). The only browsers that support the speech recognition standard are Google Chrome and Apple Safari. If you want to use Microsoft IE or Mozilla Firefox then you are out of luck.

Also, I need to warn you that the implementation of the specification on both Google Chrome and Apple Safari is still buggy. Sometimes, Speech API events are never raised and your app comes to a stop. Frustrating, but keep in mind that this is a very new technology.

Furthermore, right now, Speech Recognition is not very usable when you are not using SSL. If you are not using SSL then you are asked repeatedly to give permission for an app to use Speech Recognition. This gets very irritating very fast.

permissions

So the Web Speech API is not yet stable enough for production apps. However, the potential for the standard is so great that I couldn’t help trying out the standard when writing a simple game.

In this blog post, I explain how you can create a Math Quiz game. The math questions (What is 8 + 2?) are spoken aloud. You answer the math questions by voice using speech recognition.

Before I show you how to create the math game, however, I want to go over the fundamentals of the speech api.

Using HTML5 Speech Synthesis

You can use the following code to read the message “Jon likes Iced Tea!” out loud:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

speak('Jon likes Iced Tea!');

 

// say a message

function speak(text, callback) {

    var u = new SpeechSynthesisUtterance();

    u.text = text;

    u.lang = 'en-US';

 

    u.onend = function () {

        if (callback) {

            callback();

        }

    };

 

    u.onerror = function (e) {

        if (callback) {

            callback(e);

        }

    };

 

    speechSynthesis.speak(u);

}

The speak() function creates an instance of the SpeechSynthesisUtterance object which represents the text that you want to read out loud. You can specify a number of characteristics of the utterance such as the pitch, rate, volume, and voice.

In the code above, two event handlers are used. The onend event handler is invoked after the utterance is spoken. The onerror event handler is invoked if anything goes wrong.

The speak() function accepts a callback that is called in the onend handler. That way, you can execute additional code after the computer finishes speaking.

Finally, the speak() function calls the speechSynthesis.speak() method to actually voice the utterance. That’s all there is to it.

Using Different Voices

You can use different voices when using speech synthesis. The available voices depend on your browser and operating system.

For example, Google Chrome on Mac OSX supports 74 different voices including voices with names such as Alice, Google UK English Female, Deranged, Junior, Bubbles, and Princess.

On the other hand, Google Chrome on Windows 8 only supports 11 voices and only one of these voices is intended for United States English.

You can use the following code to get a list of all of the supported voices and use the Deranged voice when uttering the sentence “Jon likes Iced Tea.”:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

speechSynthesis.onvoiceschanged = function () {

    // get the voice

    var voices = speechSynthesis.getVoices();

    var derangedVoice = voices.filter(function (voice) {

        return voice.name == 'Deranged';

    })[0];

 

    // create the uttrance

    var u = new SpeechSynthesisUtterance();

    u.voice = derangedVoice;

    u.text = 'Jon likes Iced Tea!';

 

    // utter the utterance

    speechSynthesis.speak(u);

}

The voices are retrieved by using the speechSynthesis.getVoices() method. This method returns an array of voices that looks like this:

voices

Notice that the speechSynthesis.getVoices() method is called within the speechSynthesis.onvoiceschanged() event handler. This is necessary because the voices are retrieved asynchronously. If you attempt to get the voices outside of the handler then you will get an empty array.

After you get the voices, you can grab the voice that you want to use. In the code above, I select the Deranged voice.

The Deranged voice is assigned to the utterance and then the speechSynthesis.speak() method is used to speak the utterance.

Using HTML5 Speech Recognition

You can use the webkitSpeechRecognition object to perform speech recognition. This object is only supported by Google Chrome and Apple Safari.

If you are not using SSL then each and every time you use the webkitSpeechRecognition object, a permissions banner appears at the top of Google Chrome.

permissions

If you don’t want this banner to appear each and every time you use the object then you need to use an SSL certificate. If you use an SSL certificate then a user only needs to grant permissions once – even if the user returns multiple times to the website.

The following code illustrates how you can use the webkitSpeechRecognition object to ask a user for their favorite color:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

ask('What is your favorite color?', function (err, result) {

    if (result && result.transcript == 'blue') {

        speak('Right!');

    } else {

        speak('Wrong!');

    }

});

 

 

// ask a question and get an answer

function ask(text, callback) {

    // ask question

    speak(text, function () {

        // get answer

        var recognition = new webkitSpeechRecognition();

        recognition.continuous = false;

        recognition.interimResults = false;

 

        recognition.onend = function (e) {

            if (callback) {

                callback('no results');

            }

        };

 

 

        recognition.onresult = function (e) {

            // cancel onend handler

            recognition.onend = null;

            if (callback) {

                callback(null, {

                    transcript: e.results[0][0].transcript,

                    confidence: e.results[0][0].confidence

                });

            }

        }

 

        // start listening

        recognition.start();

    });

}

The ask() method first calls the speak() method to ask the question out loud (I discussed the speak() method earlier in this blog post).

Next, after the question is asked, an instance of the webkitSpeechRecognition object is created. Two event handlers are associated with the webkitSpeechRecognition object.

First, the onend handler is called whenever speech recognition ends. There are three reasons that the onend handler might be called:

(1) After an error
(2) After a timeout
(3) After a recognition result is successfully recorded

If you don’t say anything then Chrome times out after about 10 seconds. In that case, the onend() handler is called and any callback passed to the ask() function is called.

If you do say something then the onresult handler is invoked. This handler first disables the onend() handler so the callback is not called twice. Next, the result is retrieved from the event object and passed to the callback.

There are two bits of information that you get from the webkitSpeechRecognition object: the transcript and the confidence. The transcript contains the recorded response and the confidence represents a number (between 0-1) that represents how confident the computer is about the response. For example, if the confidence is less than 0.5 then you might want to ignore the response.

Using Continuous Speech Recognition

When using the webkitSpeechRecognition object, you have the option of taking advantage of continuous speech recognition. In other words, the webkitSpeechRecognition will continuously record what you say even if you keep speaking and speaking.

For example, the following code enables you to dictate anything that you say into a textarea:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

<textarea id="results" cols="80" rows="5"></textarea>

 

<script>

    var recognition = new webkitSpeechRecognition();

    recognition.continuous = true;

    recognition.interimResults = true;

 

    recognition.onresult = function (e) {

        var textarea = document.getElementById('results');

        for (var i = e.resultIndex; i < e.results.length; ++i) {

            if (e.results[i].isFinal) {

                textarea.value += e.results[i][0].transcript;

            }

        }

    }

 

    // start listening

    recognition.start();

 

</script>

Notice that the webkitSpeechRecognition continuous and interimResults properties are both set to true. The onend event handler is used to continuously update the textarea as you speak.

dictate

If you want to stop the voice recognition then you can take advantage of the webkitSpeechRecognition.stop() method.

Building a Math Quiz Game

I’ll show you how you can bring everything together that I discussed in this blog post – both HTML5 speech synthesis and speech recognition — to build a simple math quiz game. This game is intended to be used by children to practice addition.

In the game, you are asked a simple addition problem and you must respond with the right answer. Everything is done by voice so neither reading nor typing is required.

mathQuiz

Let me start with the HTML:

1

2

3

4

5

6

7

8

9

10

11

12

13

<html>

<head>

    <title>Math Quiz</title>

    <link href="mathquiz.css" rel="stylesheet" />

</head>

<body>

 

    <output id="result"></output>

    <a href="mathQuiz.html">Ask Question</a>

 

    <script src="mathQuiz.js"></script>

</body>

</html>

The HTML page uses an element to display a text message for the result. This result is also spoken out loud.

All of the interesting work happens in the JavaScript:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

// startup code

var numberA = getRandomNumber();

var numberB = getRandomNumber();

var text = 'What is ' + numberA + ' + ' + numberB + '?';

var response;

 

// ask the problem

ask(text, function (err, result) {

    if (err) {

        document.getElementById('result').innerHTML = 'No Answer.';

    } else {

        var answer = parseInt(result.transcript);

        if (answer == numberA + numberB) {

            response = 'Right! ' + numberA + ' + ' + numberB + ' is ' + answer + '.';

            speak(response);

            document.getElementById('result').innerHTML = response;

        } else {

            response = 'Wrong! ' + numberA + ' + ' + numberB + ' is not ' + answer + '.';

            speak(response);

            document.getElementById('result').innerHTML = response;

        }

    }

})

 

// get random number between 1 - 10

function getRandomNumber() {

    return Math.floor((Math.random() * 10) + 1);

}

 

 

// ask a question and get an answer

function ask(text, callback) {

    // ask question

    speak(text, function () {

        // get answer

        var recognition = new webkitSpeechRecognition();

        recognition.continuous = false;

        recognition.interimResults = false;

 

        recognition.onend = function (e) {

            if (callback) {

                callback('no results');

            }

        };

 

        recognition.onresult = function (e) {

            // cancel onend handler

            recognition.onend = null;

            if (callback) {

                callback(null, {

                    transcript: e.results[0][0].transcript,

                    confidence: e.results[0][0].confidence

                });

            }

        }

 

        // start listening

        recognition.start();

    });

}

 

 

// say a message

function speak(text, callback) {

    var u = new SpeechSynthesisUtterance();

    u.text = text;

    u.lang = 'en-US';

 

    u.onend = function () {

        if (callback) {

            callback();

        }

    };

 

    u.onerror = function (e) {

        if (callback) {

            callback(e);

        }

    };

 

    speechSynthesis.speak(u);

}

The JavaScript code above creates a math question by randomly generating two numbers between 1 and 10. The math question is passed to the ask() function which says the question out loud and waits for a response.

When the response is returned, the response is compared against the expected solution to the math question. If the right answer is provided then the app says “Right!”. Otherwise, the app says “Wrong!”.

If you don’t host the HTML page on a website with SSL enabled then you will be prompted with the permissions dialog each and every time you are asked a math question. The only way around this irritating interaction is to use SSL.

Even more unfortunately, in my experience, sometimes the onend event handler in the speak() method is never invoked. That means that the callback passed to the speak() method is never called and the speech recognition never starts.

I hope this issue is fixed in the near future:

http://stackoverflow.com/questions/23483990/speechsynthesis-api-onend-callback-not-working

Conclusion

The HTML5 Speech API is not quite ready for production web apps. Browser support is limited to Google Chrome and Apple Safari. Furthermore, even on Google Chrome, the Speech API is flakey (events are not reliably raised).

However, this API has great promise. I can’t wait until I can start navigating games and apps by voice.