[C#] 使用 CSS Selector方式來解析爬蟲Html網頁並修改文件 with AngleSharp

2017-10-27

4565
0
.Net Framework
2019-11-25

Parse Html jQuery like CSS/HTML selector in C#

前言

在2010年左右，.net技術如果想parse解析Html網頁，只有Html Agility Pack這個選擇

HtmlAgilityPack由於是以XML角度看待Html，抓取網頁標籤資料使用XPath+Linq的寫法

對於習慣寫前端jQuery的人來說相當不好上手

如今事隔多年，.net解析Html網頁的第三方套件百家爭鳴，在nuget官網上看得我眼花撩亂XD

陸續出現採用CSS Selector的寫法來解析網頁的套件也不少，終於可以在後端使用類似jQuery CSS Selector方式來抓取網頁標籤資料

今天要介紹的一款就是號稱解析效能很好和HtmlAgilityPack有得拼的AngleSharp

實作

從Nuget即可安裝，第一個就是

專案環境必須是.net Framework 4以上，3.5以下的話，從Nuget會安裝失敗

以抓取奇摩電影海報圖為例

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
/*引用命名空間*/
using AngleSharp;
using AngleSharp.Dom;

namespace ConsoleApp2Test
{
    class Program
    {
        static void Main(string[] args)
        { 
            IConfiguration config = Configuration.Default.WithDefaultLoader();
            string url = "https://tw.movies.yahoo.com"; 
            IDocument doc =   BrowsingContext.New(config).OpenAsync(url).Result;
           
            /*CSS Selector寫法*/
            IHtmlCollection<IElement> imgs = doc.QuerySelectorAll("div.movie_foto img:first-child");//取得圖片
            foreach (IElement img in imgs)
            {
                Console.WriteLine(img.GetAttribute("src"));
            }
            Console.ReadKey();
        }
    }
}

執行結果：

個人覺得寫法比起HtmlAgilityPack，要來的簡潔好懂多了

再看看其他官方AngleSharp examples，也支援JavaScript engine

整體而言，真是不錯的套件，後續看好它的發展

2019-10-20補充

如果需要把讀取出來的HTML代碼做修改可以參考以下

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
/*引用命名空間*/
using AngleSharp;
using AngleSharp.Dom;
namespace ConsoleApp1_Selector
{
    class Program
    {
        static void Main(string[] args)
        {
            //html代碼
            var source = @"
                            <!DOCTYPE html>
                            <html>
                              <meta charset=utf-8>
                              <meta name=viewport content=""initial-scale=1, width=device-width"">
                              <title>Test Page</title>
                              <style>
                                *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px} 
                              </style>
                            <div>
                                   <!--第一張圖沒有alt-->
                                   <img src=""Content/1.jpg"" />
                                   <br/>
                                   <!--第二張圖alt沒給值也沒有等號-->
                                   <img alt src=""Content/2.jpg"" /> 
                                   <br/>
                                    <img alt="""" src=""Content/3.jpg"" /> 
                                   <br/>
                                    <img alt='' src='Content/4.jpg' /> 
                                      <br/>
                                    <img alt=  src=Content/5.jpg /> 
                                    <br/>
                                    <img alt=test src=Content/6.jpg > 
                            </div>";
             
            IDocument document = BrowsingContext.New(Configuration.Default.WithDefaultLoader())
                                .OpenAsync(req => req.Content(source)).Result;

            IEnumerable<IElement> imgs=  document.QuerySelectorAll("img");//取得所有img
            int i = 1;
            foreach (IElement img in imgs)
            {
                img.SetAttribute("alt", "alt_" + i);//設定img的alt屬性
                i++;
            }
            //將修改後的html代碼輸出
            Console.WriteLine(document.ToHtml());
            Console.ReadKey();
        }
    }
}

※留意最後輸出的 document.ToHtml() 或是 document.DocumentElement.OuterHtml; 兩者稍有不同，請自行嘗試~

請參考：https://anglesharp.github.io/docs/Examples.html 的 Getting Single Elements

執行結果