自動偵測網頁文件的字元編碼：Stream與charset=

2014-05-22

1253
0

在Downloading content from the web using different encodings提到自動偵測網頁文件的字元編碼有二種方式，其中第二種是以ASCII編碼的方式將網頁文件讀入Stream，然後輸出成為字串，再針對輸出結果尋找關鍵字charset=，並再進一步尋找關鍵字之後的第一個雙引號，最後再擷取關鍵字與雙引號之間的字串，即可獲得編碼方式。

此一方法固然可行，但是僅限於HTML4，關鍵之處在於HTML5中，charset=之後馬上出現一組雙引號，所以僅會取得空字串，因此必須略加修改，才能套用在HTML5的網頁文件。

HTML4
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

HTML5
<meta charset="UTF-8">

參考資料來源：

[1]Downloading content from the web using different encodings
http://blogs.msdn.com/b/feroze_daud/archive/2004/03/30/104440.aspx

[2]HTML meta http-equiv 屬性
http://www.wibibi.com/info.php?tid=416

回首頁