解决方案 »
- 在RichTextBox中调用一个类的问题
- C#文件读取加密数据
- 初学者提问,如何在ComboBox中添加选项?
- C#读取Oracle存储过程记录,出现异常!
- 哪位有微软软的DotNetNuKE开源代码,发一个给我?
- 求助,不为分值-C#操作Office相关
- VS2005的问题
- 请问各位,“C:/Windows/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/UserTest.Base.dll”像这样的类库对应的实
- Array 遍历
- 关于关闭窗口的菜问题.
- DataGridView.SelectionChanged 事件问题
- C#中内嵌Excel程序
textBox2.Text= getReader.ReadToEnd();
Regex reg = new Regex(@"(?is)(?<=<div class="msgCnt">).*?(?=</div>)");
foreach (Match m in reg.Matches(textBox2.Text))
textBox1.Text= m.Value;
{ //删除脚本
Htmlstring = Regex.Replace(Htmlstring, @"([\r\n])[\s]+", "", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"<script[^>]*?>.*?</script>", "", RegexOptions.IgnoreCase);
//删除HTML
Htmlstring = Regex.Replace(Htmlstring, @"<(.[^>]*)>", "", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"([\r\n])[\s]+", "", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"-->", "", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"<!--.*", "", RegexOptions.IgnoreCase);
//Htmlstring = Regex.Replace(Htmlstring,@"<A>.*</A>","");
//Htmlstring = Regex.Replace(Htmlstring,@"<[a-zA-Z]*=\.[a-zA-Z]*\?[a-zA-Z]+=\d&\w=%[a-zA-Z]*|[A-Z0-9]","");
Htmlstring = Regex.Replace(Htmlstring, @"&(quot|#34);", "\"", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"&(amp|#38);", "&", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"&(lt|#60);", "<", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"&(gt|#62);", ">", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"&(nbsp|#160);", " ", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"&(iexcl|#161);", "\xa1", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"&(cent|#162);", "\xa2", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"&(pound|#163);", "\xa3", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"&(copy|#169);", "\xa9", RegexOptions.IgnoreCase);
Htmlstring = Regex.Replace(Htmlstring, @"&#(\d+);", "", RegexOptions.IgnoreCase);
Htmlstring.Replace("<", "");
Htmlstring.Replace(">", "");
Htmlstring.Replace("\r\n", "");
//Htmlstring=HttpContext.Current.Server.HtmlEncode(Htmlstring).Trim();
return Htmlstring; }
MatchCollection mc = re.Matches("text");
foreach (Match ma in mc)
{
}
不行,提取出来的是:||/aiq燕/aiq :||哼、谁管你 :||銨瀞鍀瘋籽 :||876052426 :太狠了 笑的我心情好多了奶奶的搞笑【笑死我了。连着看了8遍】<a target="_blank" href="http://url.cn/32n2Gs" class="ico_video" shorturl="32n2Gs" reltitle="笑死我了。连着看了8遍">http://url.cn/32n2Gs</a>
操作起来类似js。HTMLDocumentClass,主要是这个类,类似于html的document。
string path = @"E:\t.txt";
string str = File.ReadAllText(path, Encoding.GetEncoding("gb2312"));//换成你的字符串
Regex reg = new Regex(@"(?<=<div[^>]*?class=""msgCnt"">)(?:(?!</?div).)*");
Console.WriteLine(reg.Match(str).Value);
//提取网页文字咋这么男
大哥,如果我换成别的话就提取不到了,我主要是提取手机微博发表的信息,也就是说这个信息是变化的,我刚刚试了试过一段时间在提取就提取出来就不对了,正则还是不行,大哥,每次我用手机发表后后面总会跟着一个“刚刚”,如果过上几分钟,“刚刚”就会变成“几分钟之前”发表了什么,就像这样:
title="孙志炜(@szw520yun)">孙志炜</a>:</strong></div><div class="msgCnt">提取好难过啊</div><div class="pubInfo"> <span class="left"> <a class="time" target="_blank" href="http://t.qq.com/p/t/50553073201528" from="3">刚刚</a> <a href="http://t.qq.com" class="f" target="_blank">来自腾讯微博</a> ,大哥,能不能每次提取时判断是不是“刚刚”就提取后面有“刚刚”的文字内容呢???
string str = File.ReadAllText(@"E:\t.txt", Encoding.GetEncoding("gb2312"));
Regex reg = new Regex(@"(?<=<div[^>]*?class=""msgCnt"">)((?:(?!</?div).)*)</div>.*?<a[^>]*?>\s*刚刚.*?</a>");
Console.WriteLine(reg.Match(str).Groups[1].Value);