C# 提取QQ微博信息，还是正则问题。。。会的大哥请帮帮帮忙，网页源代码已经下下来了，就差提取了。。。。

本帖最后由 sunzh1wei 于 2011-08-11 11:08:18 编辑

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

(?is)(?<=<div class="msgCnt">).*?(?=</div>)
怎么显示错误：当前上下文中不存在名称“msgCnt”大哥看我写的对吗：System.IO.StreamReader getReader = new System.IO.StreamReader(this.webBrowser1.DocumentStream, System.Text.Encoding.GetEncoding("utf-8"));
            textBox2.Text= getReader.ReadToEnd();
            Regex reg = new Regex(@"(?is)(?<=<div class="msgCnt">).*?(?=</div>)");
            foreach (Match m in reg.Matches(textBox2.Text))
                textBox1.Text= m.Value;
public static string DelHTML(string Htmlstring)
        {            //删除脚本
            Htmlstring = Regex.Replace(Htmlstring, @"([\r\n])[\s]+", "", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"<script[^>]*?>.*?</script>", "", RegexOptions.IgnoreCase);
            //删除HTML
            Htmlstring = Regex.Replace(Htmlstring, @"<(.[^>]*)>", "", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"([\r\n])[\s]+", "", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"-->", "", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"<!--.*", "", RegexOptions.IgnoreCase);
            //Htmlstring = Regex.Replace(Htmlstring,@"<A>.*</A>","");
            //Htmlstring = Regex.Replace(Htmlstring,@"<[a-zA-Z]*=\.[a-zA-Z]*\?[a-zA-Z]+=\d&\w=%[a-zA-Z]*|[A-Z0-9]","");
            Htmlstring = Regex.Replace(Htmlstring, @"&(quot|#34);", "\"", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"&(amp|#38);", "&", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"&(lt|#60);", "<", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"&(gt|#62);", ">", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"&(nbsp|#160);", " ", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"&(iexcl|#161);", "\xa1", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"&(cent|#162);", "\xa2", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"&(pound|#163);", "\xa3", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"&(copy|#169);", "\xa9", RegexOptions.IgnoreCase);
            Htmlstring = Regex.Replace(Htmlstring, @"&#(\d+);", "", RegexOptions.IgnoreCase);
            Htmlstring.Replace("<", "");
            Htmlstring.Replace(">", "");
            Htmlstring.Replace("\r\n", "");
            //Htmlstring=HttpContext.Current.Server.HtmlEncode(Htmlstring).Trim();
            return Htmlstring;        }
Regex re = new Regex("(?is)(?<=<div class=\"msgCnt\">).*?(?=</div>)", RegexOptions.None);
MatchCollection mc = re.Matches("text");
foreach (Match ma in mc)
{
}
不行，提取出来的是：||/aiq燕/aiq :||哼、谁管你  :||銨瀞鍀瘋籽   :||876052426  :太狠了笑的我心情好多了奶奶的搞笑【笑死我了。连着看了8遍】<a target="_blank" href="http://url.cn/32n2Gs" class="ico_video" shorturl="32n2Gs"  reltitle="笑死我了。连着看了8遍">http://url.cn/32n2Gs</a>
用Microsoft.mshtml，微软自家的，速度快还方便。
操作起来类似js。HTMLDocumentClass，主要是这个类，类似于html的document。
            string path = @"E:\t.txt";
            string str = File.ReadAllText(path, Encoding.GetEncoding("gb2312"));//换成你的字符串
            Regex reg = new Regex(@"(?<=<div[^>]*?class=""msgCnt"">)(?:(?!</?div).)*");
            Console.WriteLine(reg.Match(str).Value);
//提取网页文字咋这么男
大哥，如果我换成别的话就提取不到了，我主要是提取手机微博发表的信息，也就是说这个信息是变化的，我刚刚试了试过一段时间在提取就提取出来就不对了，正则还是不行，大哥，每次我用手机发表后后面总会跟着一个“刚刚”，如果过上几分钟，“刚刚”就会变成“几分钟之前”发表了什么，就像这样：
title="孙志炜(@szw520yun)">孙志炜</a>:</strong></div><div class="msgCnt">提取好难过啊</div><div class="pubInfo">      <span class="left">        <a class="time" target="_blank" href="http://t.qq.com/p/t/50553073201528" from="3">刚刚</a> <a href="http://t.qq.com" class="f" target="_blank">来自腾讯微博</a>   ，大哥，能不能每次提取时判断是不是“刚刚”就提取后面有“刚刚”的文字内容呢？？？
//你试下,str换为你的字符串
            string str = File.ReadAllText(@"E:\t.txt", Encoding.GetEncoding("gb2312"));
            Regex reg = new Regex(@"(?<=<div[^>]*?class=""msgCnt"">)((?:(?!</?div).)*)</div>.*?<a[^>]*?>\s*刚刚.*?</a>");
            Console.WriteLine(reg.Match(str).Groups[1].Value);