抓取网页内容插入数据库

抓取网页中a标签href连接内容和a标签内的内容分别插入到数据库表字段中href的连接必须是网址

网上有很多例子小偷程序通常 webClient/ httpWebRequest  + Regex
http://down.chinaz.com/class/196_1.htm
/// <summary>
/// 通过节点名称以及节点的一个属性值在源码中过滤出相应的信息
/// </summary>
/// <param name="type"></param>
/// <param name="attribute"></param>
/// <param name="attributeName"></param>
/// <param name="paser"></param>
/// <returns></returns>
public NodeList getNodeListByAttribute(string nodeName, string attributeName, string attributeValue, Parser paser)
{
NodeList nodeList = null;
try
{
NodeFilter nodeFilter = new TagNameFilter(nodeName);
NodeFilter nameFilter = new HasAttributeFilter(attributeName, attributeValue);
AndFilter andFilter = new AndFilter(nodeFilter, nameFilter);
nodeList = paser.ExtractAllNodesThatMatch(andFilter);
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
return nodeList;
}NodeList nodeList = getNodeListByAttribute("a", "属性", "属性值", Parser.CreateParser(html, "utf-8"));
补充：
using Winista.Text.HtmlParser;
using Winista.Text.HtmlParser.Filters;
using Winista.Text.HtmlParser.Util;引入这个Winista.HtmlParser.dll
忘了。还差一个东西
string html = string.Empty;
using (StreamReader reader = new StreamReader("html路径或链接", Encoding.GetEncoding("utf-8")))
{
html = reader.ReadToEnd();
reader.Close();
}
wdywqc 的方法基本OK。
百度：C# 爬虫
1、WebClient读取html
2、使用正则表达式过滤
3、存储