HtmlAgility 處理網頁元件

  • 3014
  • 0
  • C#
  • 2018-12-03

HtmlAgility、HTML

解析html元件很好用的套件HtmlAgility

https://html-agility-pack.net/?z=codeplex

VS透過Nuget搜尋HtmlAgility

當網頁中要取得一個元素的值時

範例: 取得value

<input type="hidden" name="dse_processorId" value="AKISDOHAERGVEVABBLFBDFDFEFE">

HtmlWeb webClient = new HtmlWeb(); 
HtmlDocument doc = webClient.Load("http://www.w3.org/"); 
 
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//input[@name='dse_processorId']"); 
var process = node.GetAttributeValue("value", string.Empty);

其中 SelectNodes裡下的是XPath的語法 (https://zh.wikipedia.org/wiki/XPath)

範例的網址是亂給的詳細請參照要解析的網頁

 

另外如果是透過WebClient GET取得的HTML可以用 LoadHtml後去解析

WebClient wc = new WebClient();
string htmlCode = wc.DownloadString("http://example.com"); 

HtmlDocument doc = new HtmlDocument(); 
doc.LoadHtml(WebUtility.HtmlDecode(htmlCode)); 

 

補充: 解析table

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
    Console.WriteLine("Found: " + table.Id);
    foreach (HtmlNode row in table.SelectNodes("tr")) {
        Console.WriteLine("row");
        foreach (HtmlNode cell in row.SelectNodes("th|td")) {
            Console.WriteLine("cell: " + cell.InnerText);
        }
    }
}

請參考

https://stackoverflow.com/questions/655603/html-agility-pack-parsing-tables