C# 練習題 (11)

C# 練習題 (11)

練習題 (11):Input is HTML table, Remove all tags and put data in a comma/tab separated file.

這一題使用 .NET Framework 中處理 Regular Expression 的 Regex 類別、Match 類別、Group 類別,來幫忙擷取 HTML 中的欄位資料。

以下是我輸入的 html 檔案內容:

  1. <table width="100" border="1" >
  2. <td align="center" colspan="2" > >1.5 </td>
  3. <td align="right" > 3<</td>
  4. </tr>
  5. <td align="left" > <4</td>
  6. <td align="center" > 5> </td>
  7. <td align="right" > <>6 </td>
  8. </tr>
  9. <td align="left" > 7<> </td>
  10. <td colspan="2" align="center" > <8.5> </td>
  11. </tr>
  12. </table>

如下圖所示:

>1.5 3<
<4 5> <>6
7<> <8.5>

程式碼:

  1. using System.IO;
  2. using System.Text.RegularExpressions;
  3. namespace TableParser
  4. {
  5. internal class Program
  6. {
  7. private static void Main( string [ ] args)
  8. {
  9. StreamReader sr = new StreamReader( @"in.html" );
  10. StreamWriter sw = new StreamWriter( "out.txt" );
  11. while (sr.Peek ( ) != -1 )
  12. {
  13. string s = sr.ReadLine ( );
  14. Regex r = new Regex( @"<td\b[^>]*>(.*?)</td>", RegexOptions.IgnoreCase );
  15. if (Regex.IsMatch (s, @"</tr>" ) )
  16. {
  17. sw.WriteLine ( );
  18. }
  19. else
  20. {
  21. Match m = r.Match (s);
  22. while (m.Success )
  23. {
  24. Group g = m.Groups [ 1 ];
  25. sw.Write (g.ToString ( ) + "\t" );
  26. m = m.NextMatch ( );
  27. }
  28. }
  29. }
  30. sr.Dispose ( );
  31. sw.Flush ( );
  32. sw.Dispose ( );
  33. }
  34. }
  35. }