我希望可以把网页中所有的按钮和有id的DIV都取出来以下是我以前写的代码里面的问题是我把网页转换为xml格式来分析,但是一个是doc.LoadXml(html);会很慢,另外一个是html一定要很规则的才可以,稍微不规则一点的html都不能解析成xml文件。请问除了转换成xml之外还有什么方法可以实现以上要求阿?谢谢!
WebRequest request = WebRequest.Create(FormUrl);
WebResponse response = request.GetResponse();
Stream resStream = response.GetResponseStream();
StreamReader sr = new StreamReader(resStream, System.Text.Encoding.UTF8);
string html = sr.ReadToEnd();
resStream.Close();
sr.Close();
resStream.Dispose();
response.Close();
html = html.Replace("\r", "").Replace("\n", "").Replace(@"\", "");
XmlDocument doc = new XmlDocument();
doc.LoadXml(html);
M_list.Clear();
ParseHtml(doc.DocumentElement); public void ParseHtml(XmlNode node)
{
WFFormParse fp = null;
IEnumerator ienum = node.GetEnumerator();
while (ienum.MoveNext())
{
fp = new WFFormParse();
XmlNode Currentnode = (XmlNode)ienum.Current; if (Currentnode.ChildNodes.Count > 0)
{
ParseHtml(Currentnode);//recursion
}
if (Currentnode.Attributes != null)
{
if (Currentnode.Name == "input" && Currentnode.Attributes != null && Currentnode.Attributes["id"] != null && Currentnode.Attributes["type"].InnerText == "submit")
{
if (Currentnode.Attributes["value"] != null)
{
fp.Type = ButtonType.Button;
fp.ID = Currentnode.Attributes["id"].InnerText;
fp.DisplayName = Currentnode.Attributes["value"].InnerText; }
else
{
fp.Type = ButtonType.Button;
fp.ID = Currentnode.Attributes["id"].InnerText;
fp.DisplayName = Currentnode.Attributes["id"].InnerText;
} }
else if (Currentnode.Name == "div" && Currentnode.Attributes != null && Currentnode.Attributes["id"] != null)
{
fp.Type = ButtonType.Div;
fp.ID = Currentnode.Attributes["id"].InnerText;
fp.DisplayName = Currentnode.Attributes["id"].InnerText;
}
if (fp.ID != null)
{
M_list.Add(fp);
}
} }
WebRequest request = WebRequest.Create(FormUrl);
WebResponse response = request.GetResponse();
Stream resStream = response.GetResponseStream();
StreamReader sr = new StreamReader(resStream, System.Text.Encoding.UTF8);
string html = sr.ReadToEnd();
resStream.Close();
sr.Close();
resStream.Dispose();
response.Close();
html = html.Replace("\r", "").Replace("\n", "").Replace(@"\", "");
XmlDocument doc = new XmlDocument();
doc.LoadXml(html);
M_list.Clear();
ParseHtml(doc.DocumentElement); public void ParseHtml(XmlNode node)
{
WFFormParse fp = null;
IEnumerator ienum = node.GetEnumerator();
while (ienum.MoveNext())
{
fp = new WFFormParse();
XmlNode Currentnode = (XmlNode)ienum.Current; if (Currentnode.ChildNodes.Count > 0)
{
ParseHtml(Currentnode);//recursion
}
if (Currentnode.Attributes != null)
{
if (Currentnode.Name == "input" && Currentnode.Attributes != null && Currentnode.Attributes["id"] != null && Currentnode.Attributes["type"].InnerText == "submit")
{
if (Currentnode.Attributes["value"] != null)
{
fp.Type = ButtonType.Button;
fp.ID = Currentnode.Attributes["id"].InnerText;
fp.DisplayName = Currentnode.Attributes["value"].InnerText; }
else
{
fp.Type = ButtonType.Button;
fp.ID = Currentnode.Attributes["id"].InnerText;
fp.DisplayName = Currentnode.Attributes["id"].InnerText;
} }
else if (Currentnode.Name == "div" && Currentnode.Attributes != null && Currentnode.Attributes["id"] != null)
{
fp.Type = ButtonType.Div;
fp.ID = Currentnode.Attributes["id"].InnerText;
fp.DisplayName = Currentnode.Attributes["id"].InnerText;
}
if (fp.ID != null)
{
M_list.Add(fp);
}
} }
分析。
我现在想要得主要还是如何去分析Html资料获得我要得Button和有id的div呢
还有就是我还要获取这个div的 id之类的信息
这个要怎么获得呢?