求提取html中纯文本的正则表达式 - 调试易

求提取html中纯文本的正则表达式

这个非常难做啊
不是一般的难如果内容中包含了<script>这样的东西，你去掉了，内容就不全了

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

不用考虑那么复杂的情况，
就做到提取结果跟浏览器里按ctrl+c复制出的差不多就行了
现在国内很多网站页面都是用 javascript 脚本动态生成的，没有 javascript 引擎支持，很多页面是看不到内容的，更别想 ctrl+c 了。
public static String GetNormalText(String tempValue) {
            String t = tempValue;
            String  pattern = @"(<style.*?>[\s|\S|.]*?</style>)|(<style.*?>)";
            t = ReplaceContent(pattern, t);            pattern = @"(<script.*?>[\s|\S|.]*?</script>)|(<script[\s|\S|.]*?>)";
            t = ReplaceContent(pattern, t);            //替换文件里面加入的脚本
            pattern = @"((onclick)|(onload))=.+?[""|']";
            t = ReplaceContent(pattern, t);            //去掉注释
            pattern = @"";
            t = ReplaceContent(pattern, t);
            return t;
        } public static String ReplaceContent(String pattern, String tempValue) {
            Regex reg = new Regex(pattern, RegexOptions.Multiline | RegexOptions.IgnoreCase);
            return reg.Replace(tempValue, "");
        }这是我帮别人做采集的时候写的，去掉js,等。