上面是个某网站的html源代码我想做的就是抓<tr align="center" bgcolor="#EFEFEF">
<td width="8%" nowrap bgcolor="#CFDBE8" title="搜寻引擎收录情况">收录情况</td> <td width="7%"><a href='http://www.baidu.com/s?wd=site%3Awww.baidu.com&cl=3' target=_blank title='2140' rel=nofollow class=LN>2140</a></td> <td width="7%"><a href='http://www.google.cn/search?hl=zh-CN&q=site%3Awww.baidu.com' target=_blank title='36700' rel=nofollow class=LN>36700</a></td> <td width="8%"><a href='http://sitemap.cn.yahoo.com/search?bwm=p&p=www.baidu.com' target=_blank title='927' rel=nofollow class=LN>927</a></td> <td width="7%"><a href='http://www.sogou.com/web?query=site%3Awww.baidu.com' target=_blank title='870048' rel=nofollow class=LN>870048</a></td> <td width="7%"><a href='http://www.soso.com/q?w=site%3Awww.baidu.com&sc=web&ch=w.ptl&lr=chs' target=_blank title='25900' rel=nofollow class=LN>25900</a></td> </tr>
<tr align="center" bgcolor="#EFEFEF">
<td nowrap bgcolor="#CFDBE8" title="外部网站链接到你的网站">反向链接</td> <td><a href='http://www.baidu.com/s?wd=domain%3Awww.baidu.com&cl=3' target=_blank title='1900000' rel=nofollow class=LN>1900000</a></td> <td><a href='http://www.google.cn/search?hl=zh-CN&q=link%3Awww.baidu.com' target=_blank title='0' rel=nofollow class=LN>0</a></td> <td><a href='http://sitemap.cn.yahoo.com/search?p=www.baidu.com&bwm=i' target=_blank title='5175919' rel=nofollow class=LN>5175919</a></td> <td><a href='http://www.sogou.com/web?query=link%3Awww.baidu.com&num=10' target=_blank title='2939831' rel=nofollow class=LN>2939831</a></td> <td><a href='http://www.soso.com/q?w=link%3Awww.baidu.com&sc=web&ch=w.ptl&lr=chs' target=_blank title='4130' rel=nofollow class=LN>4130</a></td>这里的数据,只是数值,其他属性可以不管 求救
<td width="8%" nowrap bgcolor="#CFDBE8" title="搜寻引擎收录情况">收录情况</td> <td width="7%"><a href='http://www.baidu.com/s?wd=site%3Awww.baidu.com&cl=3' target=_blank title='2140' rel=nofollow class=LN>2140</a></td> <td width="7%"><a href='http://www.google.cn/search?hl=zh-CN&q=site%3Awww.baidu.com' target=_blank title='36700' rel=nofollow class=LN>36700</a></td> <td width="8%"><a href='http://sitemap.cn.yahoo.com/search?bwm=p&p=www.baidu.com' target=_blank title='927' rel=nofollow class=LN>927</a></td> <td width="7%"><a href='http://www.sogou.com/web?query=site%3Awww.baidu.com' target=_blank title='870048' rel=nofollow class=LN>870048</a></td> <td width="7%"><a href='http://www.soso.com/q?w=site%3Awww.baidu.com&sc=web&ch=w.ptl&lr=chs' target=_blank title='25900' rel=nofollow class=LN>25900</a></td> </tr>
<tr align="center" bgcolor="#EFEFEF">
<td nowrap bgcolor="#CFDBE8" title="外部网站链接到你的网站">反向链接</td> <td><a href='http://www.baidu.com/s?wd=domain%3Awww.baidu.com&cl=3' target=_blank title='1900000' rel=nofollow class=LN>1900000</a></td> <td><a href='http://www.google.cn/search?hl=zh-CN&q=link%3Awww.baidu.com' target=_blank title='0' rel=nofollow class=LN>0</a></td> <td><a href='http://sitemap.cn.yahoo.com/search?p=www.baidu.com&bwm=i' target=_blank title='5175919' rel=nofollow class=LN>5175919</a></td> <td><a href='http://www.sogou.com/web?query=link%3Awww.baidu.com&num=10' target=_blank title='2939831' rel=nofollow class=LN>2939831</a></td> <td><a href='http://www.soso.com/q?w=link%3Awww.baidu.com&sc=web&ch=w.ptl&lr=chs' target=_blank title='4130' rel=nofollow class=LN>4130</a></td>这里的数据,只是数值,其他属性可以不管 求救
解决方案 »
- “Admin_YX_Product_Edit_Num”并不包含“Edit_Num”的定义是什么意思啊
- 高分:多线程下载的时候积分被重复扣除的问题
- ASP.NET(C#)中,用GridView输出到EXCEL,但是老是在保存完excel文件后,再打开时有错!
- 老总让仿照这个网站作一个,诸多问题,百分求教!
- asp.net2.0中如何在服务器端调用客户端的方法函数?急
- (急问)纳闷的问题,望大虾帮忙
- .net程序在服务器端运行是多线程还是单线程
- 为什么我的VS.net不能调试asp.net程序
- 大家帮帮忙,帮我用if语句写一段小代码
- Dev ASPxGridView的几个问题
- 请教问题
- 如何制作树状图导航前提是和SQL数据库绑定的,
WebResponse Wrs = Wrq.GetResponse();
Stream strm = Wrs.GetResponseStream();
StreamReader sr = new StreamReader(strm, System.Text.Encoding.GetEncoding("UTF-8"));
string allstrm;
allstrm = sr.ReadToEnd();
string strPattern = @"要取的内容对应的正则";
string result =String.Empty;
MatchCollection Matches = Regex.Matches(allstrm, strPattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
foreach (Match NextMatch in Matches)
{
result = NextMatch.Groups[0].Value.ToString().Trim();
}你所要做的就是把我汉字部分换成你需要的就行了
<td width="7%"><a href='http://www.baidu.com/s?wd=site%3Awww.baidu.com&cl=3' target=_blank title='2140' rel=nofollow class=LN>2140</a></td>
我就是是想把2140给截取出来。
当然我是想截取出来 “收录情况”和“反向链接” 这2个<TR>里的所有数据
<a.*?title='(\d+)' rel=nofollow class=LN>(\1)</a>
Match mat = reg.Match(html);
while (mat.Successful)
{
Response.Write(Regex.Replace(mat.Value, @"[^\d]*", ""));
mat = reg.Match(html, mat.Index + mat.Length);
}
\<td[^\>]*\>\<a[^\>]*\>.*?(?<V1>\d+).*?\</a\>\</td\>\s*
\<td[^\>]*\>\<a[^\>]*\>.*?(?<V2>\d+).*?\</a\>\</td\>\s*
\<td[^\>]*\>\<a[^\>]*\>.*?(?<V3>\d+).*?\</a\>\</td\>\s*
\<td[^\>]*\>\<a[^\>]*\>.*?(?<V4>\d+).*?\</a\>\</td\>\s*
\<td[^\>]*\>\<a[^\>]*\>.*?(?<V5>\d+).*?\</a\>\</td\>\s*去掉换行, V1...V5就是要的数字
public string checkStr(string html)
{
System.Text.RegularExpressions.Regex regex1 = new System.Text.RegularExpressions.Regex(@"<script[sS]+</script *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Regex regex2 = new System.Text.RegularExpressions.Regex(@" href *= *[sS]*script *:", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Regex regex3 = new System.Text.RegularExpressions.Regex(@" no[sS]*=", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Regex regex4 = new System.Text.RegularExpressions.Regex(@"<iframe[sS]+</iframe *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Regex regex5 = new System.Text.RegularExpressions.Regex(@"<frameset[sS]+</frameset *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Regex regex6 = new System.Text.RegularExpressions.Regex(@"<img[^>]+>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Regex regex7 = new System.Text.RegularExpressions.Regex(@"</p>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Regex regex8 = new System.Text.RegularExpressions.Regex(@"<p>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
System.Text.RegularExpressions.Regex regex9 = new System.Text.RegularExpressions.Regex(@"<[^>]*>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
html = regex1.Replace(html, ""); //过滤<script></script>标记
html = regex2.Replace(html, ""); //过滤href=javascript: (<A>) 属性
html = regex3.Replace(html, " _disibledevent="); //过滤其它控件的on...事件
html = regex4.Replace(html, ""); //过滤iframe
html = regex5.Replace(html, ""); //过滤frameset
html = regex6.Replace(html, ""); //过滤frameset
html = regex7.Replace(html, ""); //过滤frameset
html = regex8.Replace(html, ""); //过滤frameset
html = regex9.Replace(html, "");
html = html.Replace(" ", "");
html = html.Replace("</strong>", "");
html = html.Replace("<strong>", "");
return html;
}