How to use mshtml.
There is lots of free html parser to be downloaded on the internet.
And you can find a lot of document about this topic.
But I haven’t seen someone talking about “mshtml” which is offered by Microsoft.
So we will talk about “mshtml” in this article.
1. Please click your right mouse button on your project. And choose “Add reference”.
2. Select the “.NET” tab. Then try to find “Microsoft.mshtml” and click OK.
3. There will be added an item of “Microsoft.mshtml” in your project’s reference. Don’t forget to write “using mshtml” in your source code.
4. There is a simple.
/// <summary> /// main process /// </summary> private void MainProcess() { string szTestURL_ = @"http://tw.dictionary.yahoo.com/dictionary?p=concentrate"; string szHtmlContent_ = this.DownloadWeb(szTestURL_); string szAfterParser_ = this.ParserHtml(szHtmlContent_, "TD"); } /// <summary> /// Parser html by mshtml /// </summary> /// <param name="szHtmlContent"></param> /// <param name="szFilterTag"></param> private string ParserHtml(string szHtmlContent, string szFilterTag) { // input the content of html. HTMLDocumentClass IHTMLDocument2 docHtml_ = new HTMLDocumentClass(); docHtml_.write(new object[] {szHtmlContent }); docHtml_.close(); IHTMLElementCollection col_1_ = (IHTMLElementCollection)docHtml_.body.all; // filter tags which we need. IHTMLElementCollection col_2_ = (IHTMLElementCollection)col_1_.tags(szFilterTag); string szResult_ = ""; IHTMLElement elem_ = null; // You can get amount of elements in the current collection by the property which is named length. int iCollectionLength = col_2_.length; for (int i = 0; i < iCollectionLength; i++) { elem_ = (IHTMLElement)col_2_.item(i, null); if( elem_.innerHTML ==null) continue; szResult_ += elem_.innerHTML; } return szResult_; } /// <summary> /// there is only for downloading html /// </summary> /// <param name="szURL"></param> /// <returns></returns> private string DownloadWeb(string szURL) { HttpWebRequest reqHttp_ = (HttpWebRequest)WebRequest.Create(szURL); reqHttp_.Timeout = 30000; HttpWebResponse respHttp_ = (HttpWebResponse)reqHttp_.GetResponse(); StreamReader readerHtml = new StreamReader(respHttp_.GetResponseStream()); string szResult_ = readerHtml.ReadToEnd(); readerHtml.Close(); return szResult_; }
5. I will tidy some documents here which you may use.
IHTMLDocument2:
Gets information about the document, and examines and modifies the HTML elements and text in the document.
http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx
IHTMLElement:
This interface provides the ability to programmatically access the properties and methods that are common to all element objects.
http://msdn.microsoft.com/en-us/library/aa752279(VS.85).aspx
IHTMLElementCollection:
TProvides access to a collection of element objects.
http://msdn.microsoft.com/en-us/library/aa703928(VS.85).aspx
6. Some problem maybe you will meet.
Question:
Why are there 4 options to choose from and which one is the correct one to use?
IHTMLElement or IHTMLElement2 or IHTMLElement3 or IHTMLElement4
Answer:
IHTMLElement is the original interface, but as more functionality was added
the 2, 3 and 4 interfaces were created. All are valid, so use the one that
exposes the methods and/or properties that you require.
http://bytes.com/topic/visual-basic-net/answers/386946-ihtmlelement-question