最近試用Azure上的Bing Speech API(Bing 語音API),是一個簡便又有用的服務,可以將語音轉成文字,也可以將文字轉成語音。
- Speech to Text:語音轉文字,應用上可以讓使用者透過語音傳達對程式的命令,或做為資料輸入的來源
- Text to Speech:文字轉語音,應用上可以用語音的方式回饋訊息給使用者
我先試用語音轉文字的服務,主要是參考Microsoft | Get Started with Speech Recognition using REST API進行實做。Microsoft在GitHub有提供語音轉文字的Sample Code。
Speech to Text(語音轉文字)的服務型式有兩種:REST API 與 WebSocket API,差別如下:
Feature | WebSocket API | REST API |
---|---|---|
Speech hypotheses | Yes | No |
Continuous recognition | Yes | No |
Maximum audio input | 10 minutes of audio | 15 seconds of audio |
Service detects when speech ends | Yes | No |
Subscription key authorization | Yes | No |
我試用的,則是使用REST API的服務型式。
前置作業
要存取Microsoft Cognitive Services,需要一個Microsoft的subscription key,才能透過該Key取得權限以使用Azure的Service。為了開發及測試,Microsoft提供了免費的測試用Key,請至Microsoft Subscriptions站台,註冊並登入後就可以依據所需使用的Service,取得測試用Key。此測試用Key的效期只有30天,而且有使用次數及使用頻率的限制。不過,就開發階段來說,已經綽綽有餘。
另外,請先建立一個WAV
格式的錄音檔,錄下中文語音,以做為測試該API的資料來源。
使用此服務所需的認證
要存取REST Service,需要一個OAuth token,以驗證存取該Service的Client端是經過認證的,而不是任何Client可以任意存取的。要取得此OAuth token,就需要上面所提到的subscription key,並透過以下URI呼叫Token service-https://api.cognitive.microsoft.com/sts/v1.0/issueToken
。此Token service會以JSON Web Token(JWT)的方式回傳access token。而此access token只有10分鐘的效期,過期了就需要重新renew。Microsoft提供了一個C# 的Sample Code以處理此Token的運作。
/*
* This class demonstrates how to get a valid O-auth token.
*/
public class Authentication
{
public static readonly string FetchTokenUri = "https://api.cognitive.microsoft.com/sts/v1.0";
private string subscriptionKey;
private string token;
private Timer accessTokenRenewer;
//Access token expires every 10 minutes. Renew it every 9 minutes.
private const int RefreshTokenDuration = 9;
public Authentication(string subscriptionKey)
{
this.subscriptionKey = subscriptionKey;
this.token = FetchToken(FetchTokenUri, subscriptionKey).Result;
// renew the token on set duration.
accessTokenRenewer = new Timer(new TimerCallback(OnTokenExpiredCallback),
this,
TimeSpan.FromMinutes(RefreshTokenDuration),
TimeSpan.FromMilliseconds(-1));
}
public string GetAccessToken()
{
return this.token;
}
private void RenewAccessToken()
{
this.token = FetchToken(FetchTokenUri, this.subscriptionKey).Result;
Console.WriteLine("Renewed token.");
}
private void OnTokenExpiredCallback(object stateInfo)
{
try
{
RenewAccessToken();
}
catch (Exception ex)
{
Console.WriteLine(string.Format("Failed renewing access token. Details: {0}", ex.Message));
}
finally
{
try
{
accessTokenRenewer.Change(TimeSpan.FromMinutes(RefreshTokenDuration), TimeSpan.FromMilliseconds(-1));
}
catch (Exception ex)
{
Console.WriteLine(string.Format("Failed to reschedule the timer to renew access token. Details: {0}", ex.Message));
}
}
}
private async Task<string> FetchToken(string fetchUri, string subscriptionKey)
{
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);
UriBuilder uriBuilder = new UriBuilder(fetchUri);
uriBuilder.Path += "/issueToken";
var result = await client.PostAsync(uriBuilder.Uri.AbsoluteUri, null);
Console.WriteLine("Token Uri: {0}", uriBuilder.Uri.AbsoluteUri);
return await result.Content.ReadAsStringAsync();
}
}
}
語音辨識模式
Microsoft's Speech to Text API在語音辨識上,還分為以下三個不同的模式。
Mode | Description |
---|---|
interactive | "Command and control" recognition for interactive user application scenarios. Users speak short phrases intended as commands to an application. |
dictation | Continuous recognition for dictation scenarios. Users speak longer sentences that are displayed as text. Users adopt a more formal speaking style. |
conversation | Continuous recognition for transcribing conversations between humans. Users adopt a less formal speaking style and may alternate between longer sentences and shorter phrases. |
在呼叫API時,每個模式會有對應的URI。所以在呼叫REST API時,就需要依據所需的recognition模式使用對應的URI(End Point),並指定語音的語言別。
Mode | Path |
---|---|
Interactive/Command | /speech/recognition/interactive/cognitiveservices/v1 |
Dictation | /speech/recognition/dictation/cognitiveservices/v1 |
Conversation | /speech/recognition/conversation/cognitiveservices/v1 |
例如在我的測試檔案中,內容是我說一小段的中文。所以我使用Interactive模式來呼叫,因此我所指定的URI則是https://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?language=zh-TW
。
建立Http Request
要呼叫REST API,可以透過HttpWebRequest物件進行Service叫用。而Request需要有以下幾個設定:
- Method為
POST
- Host為
speech.platform.bing.com
- Request headers的
Authorization
需要是字串"Bearer "+token
Bing Speech API可以支援chunked transfer encoding
以提高傳輸效能。以下的Sample Code將音訊傳輸切成1024 byte
using (fs = new FileStream(audioFile, FileMode.Open, FileAccess.Read))
{
/*
* Open a request stream and write 1024 byte chunks in the stream one at a time.
*/
byte[] buffer = null;
int bytesRead = 0;
using (Stream requestStream = request.GetRequestStream())
{
/*
* Read 1024 raw bytes from the input audio file.
*/
buffer = new Byte[checked((uint)Math.Min(1024, (int)fs.Length))];
while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) != 0)
{
requestStream.Write(buffer, 0, bytesRead);
}
// Flush
requestStream.Flush();
}
}
Response
透過HttpWebRequest送出Request之後,如果沒問題的話,Bing Speech API會回傳JSON格式的response,內含語音轉文字的結果。得到的格式如下:
OK
{"RecognitionStatus":"Success","DisplayText":"測試微軟服務","Offset":11600000,"Duration":36600000}