本帖最后由 ceirel01 于 2010-12-05 23:28:27 编辑

解决方案 »

  1.   

    1、提取正则:(?<=<div\s+class=\\"propDescValue_p\\">)(?:(?!</div>)[\s\S])+string pattern = "(?<=<div\\s+class=\\\\\"propDescValue_p\\\\\">)(?:(?!</div>)[\\s\\S])+";
    2、可以过滤所有html标签,就是把"<xxx>"都替换成空字符串""。。
      

  2.   

    不知楼上是哪位朋友的马甲。
    至少是阅读了过客兄博客很仔细的朋友。
    稍微的一点建议,
    因为是做捕获,环视效率并不高,并且class未必会是第一个attribute,不如直接写
    <div[^>]+?class=\W*propDescValue_p\b[^>]*>([\s\S]+?)</div>
    取分组1
    string result = Regex.Match(html,@"<div[^>]+?class=\W*propDescValue_p\b[^>]*>[\s\S]+?</div>").Groups[1].Value;
    (?!</div>)
    的写法很好,第一次看到类似写法也是在过客兄博客上,这种写法只应当在很少的一些情况使用,例如他例子中的平衡组中,这样做效率不高,我做过测试,下面来说一下原因。
    你每匹配一个字符,都被拆成2个部分,一个是后面不是</div>的一个“位置”,一个是实际匹配的字符,匹配一个字符,需要先环视,环视的结果是一个位置,得到这个位置后,再匹配位置后面的表达式[\s\S]。
    这样效率不如.+?来的直接,.+?尽可能少匹配,当匹配后,尝试后面的表达式,后面如果是写</div,则非<的大多数情况只尝试一次匹配,发现失配后,回溯到.+?继续匹配下一个字符,直到完整的匹配到</div>才停止。.+?在这个情况下比较合理。
    (ps: [\s\S]写的麻烦,我这里假定环境为单行模式了,.表示所有字符包含换行符 )
      

  3.   

    发完上面的回复,我想了想,不是很严谨,从字面上说(?:(?</div>)[\s\S].)+的写法,实际上和.+?</div>的匹配过程、顺序似乎是一致的,都是在下一次的匹配之前,先尝试</div>,当然,如果不是<的大多数字符,两者失配都应该是发生在第一个字符<的判断上,但效率的确差很远,做了一个简单测试字符串<div>$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F$GPGGA,031910.00,2308.3445,N,11320.1198,E,1,11,0.8,62.4,M,-6.5,M,,*4F</div>匹配结果一共用时:18637毫秒(18.637秒) [10000次]表达式1:<div>(?:(?!</div>)[\s\S])+</div>
    执行时间:11536毫秒(11.536秒) 占总时间61.90%表达式2:<div>[\s\S]+?</div>
    执行时间:7101毫秒(7.101秒) 占总时间38.10%
    没实际看过.net正则引擎如何实现的,只是猜测,环视在匹配顺序上和非环视一致,但实现上,估计增加了开销。所以尽量少用环视,而是用全部匹配,通过分组返回结果。
      

  4.   

    楼上的楼上的楼上不是马甲。。过客兄博客昨晚才发现。。刚看了.NET正则第一个委托。。
    用了环视是因为某某一天在论坛某某问题中看到某某大神既不包含xxx,又不包含xxx,也不包含xxx的做法现学现用,用的环视,之前一直用的非贪婪匹配。。
    而且对于[\s\S],之前一直用(?:.|\n)。。