[Azure]使用Bing Speech API - Speech To Text

  • 1630
  • 0

最近試用Azure上的Bing Speech API(Bing 語音API),是一個簡便又有用的服務,可以將語音轉成文字,也可以將文字轉成語音。

  • Speech to Text:語音轉文字,應用上可以讓使用者透過語音傳達對程式的命令,或做為資料輸入的來源
  • Text to Speech:文字轉語音,應用上可以用語音的方式回饋訊息給使用者

我先試用語音轉文字的服務,主要是參考Microsoft | Get Started with Speech Recognition using REST API進行實做。Microsoft在GitHub有提供語音轉文字的Sample Code

Speech to Text(語音轉文字)的服務型式有兩種:REST APIWebSocket API,差別如下:

Feature WebSocket API REST API
Speech hypotheses Yes No
Continuous recognition Yes No
Maximum audio input 10 minutes of audio 15 seconds of audio
Service detects when speech ends Yes No
Subscription key authorization Yes No

我試用的,則是使用REST API的服務型式。

前置作業

要存取Microsoft Cognitive Services,需要一個Microsoft的subscription key,才能透過該Key取得權限以使用Azure的Service。為了開發及測試,Microsoft提供了免費的測試用Key,請至Microsoft Subscriptions站台,註冊並登入後就可以依據所需使用的Service,取得測試用Key。此測試用Key的效期只有30天,而且有使用次數及使用頻率的限制。不過,就開發階段來說,已經綽綽有餘。

另外,請先建立一個WAV格式的錄音檔,錄下中文語音,以做為測試該API的資料來源。

使用此服務所需的認證

要存取REST Service,需要一個OAuth token,以驗證存取該Service的Client端是經過認證的,而不是任何Client可以任意存取的。要取得此OAuth token,就需要上面所提到的subscription key,並透過以下URI呼叫Token service-https://api.cognitive.microsoft.com/sts/v1.0/issueToken。此Token service會以JSON Web Token(JWT)的方式回傳access token。而此access token只有10分鐘的效期,過期了就需要重新renew。Microsoft提供了一個C# 的Sample Code以處理此Token的運作。

    /*
     * This class demonstrates how to get a valid O-auth token.
     */
    public class Authentication
    {
        public static readonly string FetchTokenUri = "https://api.cognitive.microsoft.com/sts/v1.0";
        private string subscriptionKey;
        private string token;
        private Timer accessTokenRenewer;

        //Access token expires every 10 minutes. Renew it every 9 minutes.
        private const int RefreshTokenDuration = 9;

        public Authentication(string subscriptionKey)
        {
            this.subscriptionKey = subscriptionKey;
            this.token = FetchToken(FetchTokenUri, subscriptionKey).Result;

            // renew the token on set duration.
            accessTokenRenewer = new Timer(new TimerCallback(OnTokenExpiredCallback),
                                           this,
                                           TimeSpan.FromMinutes(RefreshTokenDuration),
                                           TimeSpan.FromMilliseconds(-1));
        }

        public string GetAccessToken()
        {
            return this.token;
        }

        private void RenewAccessToken()
        {
            this.token = FetchToken(FetchTokenUri, this.subscriptionKey).Result;
            Console.WriteLine("Renewed token.");
        }

        private void OnTokenExpiredCallback(object stateInfo)
        {
            try
            {
                RenewAccessToken();
            }
            catch (Exception ex)
            {
                Console.WriteLine(string.Format("Failed renewing access token. Details: {0}", ex.Message));
            }
            finally
            {
                try
                {
                    accessTokenRenewer.Change(TimeSpan.FromMinutes(RefreshTokenDuration), TimeSpan.FromMilliseconds(-1));
                }
                catch (Exception ex)
                {
                    Console.WriteLine(string.Format("Failed to reschedule the timer to renew access token. Details: {0}", ex.Message));
                }
            }
        }

        private async Task<string> FetchToken(string fetchUri, string subscriptionKey)
        {
            using (var client = new HttpClient())
            {
                client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);
                UriBuilder uriBuilder = new UriBuilder(fetchUri);
                uriBuilder.Path += "/issueToken";

                var result = await client.PostAsync(uriBuilder.Uri.AbsoluteUri, null);
                Console.WriteLine("Token Uri: {0}", uriBuilder.Uri.AbsoluteUri);
                return await result.Content.ReadAsStringAsync();
            }
        }
    }

語音辨識模式

Microsoft's Speech to Text API在語音辨識上,還分為以下三個不同的模式。

Mode Description
interactive "Command and control" recognition for interactive user application scenarios. Users speak short phrases intended as commands to an application.
dictation Continuous recognition for dictation scenarios. Users speak longer sentences that are displayed as text. Users adopt a more formal speaking style.
conversation Continuous recognition for transcribing conversations between humans. Users adopt a less formal speaking style and may alternate between longer sentences and shorter phrases.

在呼叫API時,每個模式會有對應的URI。所以在呼叫REST API時,就需要依據所需的recognition模式使用對應的URI(End Point),並指定語音的語言別。

Mode Path
Interactive/Command /speech/recognition/interactive/cognitiveservices/v1
Dictation /speech/recognition/dictation/cognitiveservices/v1
Conversation /speech/recognition/conversation/cognitiveservices/v1

例如在我的測試檔案中,內容是我說一小段的中文。所以我使用Interactive模式來呼叫,因此我所指定的URI則是https://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?language=zh-TW

建立Http Request

要呼叫REST API,可以透過HttpWebRequest物件進行Service叫用。而Request需要有以下幾個設定:

  • Method為POST
  • Host為speech.platform.bing.com
  • Request headers的Authorization需要是字串"Bearer "+token

Bing Speech API可以支援chunked transfer encoding以提高傳輸效能。以下的Sample Code將音訊傳輸切成1024 byte

using (fs = new FileStream(audioFile, FileMode.Open, FileAccess.Read))
{

    /*
    * Open a request stream and write 1024 byte chunks in the stream one at a time.
    */
    byte[] buffer = null;
    int bytesRead = 0;
    using (Stream requestStream = request.GetRequestStream())
    {
        /*
        * Read 1024 raw bytes from the input audio file.
        */
        buffer = new Byte[checked((uint)Math.Min(1024, (int)fs.Length))];
        while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) != 0)
        {
            requestStream.Write(buffer, 0, bytesRead);
        }

        // Flush
        requestStream.Flush();
    }
}

Response

透過HttpWebRequest送出Request之後,如果沒問題的話,Bing Speech API會回傳JSON格式的response,內含語音轉文字的結果。得到的格式如下:

OK
{"RecognitionStatus":"Success","DisplayText":"測試微軟服務","Offset":11600000,"Duration":36600000}