高分求：提取网页中的特定地址，且地址不能重复 - 调试易

高分求：提取网页中的特定地址，且地址不能重复

提取给网页中的所有博客地址，地址格式为:http://blog.sina.com.cn/XXXXXXXX
（如：http://blog.sina.com.cn/abc是允许的，http://blog.sina.com.cn/abc/13132是不允许的）
提取的地址列表不能重复。如何用正则表达式匹配，请给出代码，一经验证马上给分！

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

正则表达式当然可以，提供另一个思路：网页文本中查找所有“http://”字串，记录其起始位置并压入队列；
对于找到的每个字串，向后寻找'/'字符，遇空格或回车换行就停止并从队列中弹出；
对于每个成功找到'/'的子串，继续向后寻找，遇到'/'停止并从队列中弹出，如遇到空格或回车换行，则记录尾位置并终止。
还剩在队列中的就是所有找到了的。
再详细说明下，如：
访问网页 http://search.blog.sina.com.cn/blog/search?q=%C8%AB%B9%FA%B0%A7%B5%BF%C8%D5&tag=u&s=1&t=keyword需要将该页面中的博客地址全部抓出，如：
http://blog.sina.com.cn/happyyanghongjie
http://blog.sina.com.cn/gsdxcxt
http://blog.sina.com.cn/xiangshizhongyi1
http://blog.sina.com.cn/gsdxcxt
http://blog.sina.com.cn/priscillaxu
等而其他非博客地址及博客文章地址均不需要，如：
http://blog.sina.com.cn/lm/z/dizhen512/index.html
http://blog.sina.com.cn/s/blog_4949a5580100999f.html
http://blog.sina.com.cn/s/blog_5149d5db01009j1h.html
http://blog.sina.com.cn/s/blog_525316fd010096fw.html
等谢谢大家的解答!
楼主给的这个网页里，楼主想要提取的就一个都没见到，不想提取的倒是见到了，搞不清那些想要的是从哪找出来的
就算是这样的“详细说明”还是让人一头雾水，按我的猜想写了一个，楼主试下吧MatchCollection mc = Regex.Matches(str, @"(?<=href="")http://blog\.sina\.com\.cn/[^""/]+(?="")", RegexOptions.IgnoreCase);
foreach (Match m in mc)
{
    richTextBox1.Text += m.Value + "\n";
}
楼主打开那个链接，在源文件里搜索一下下面的网址再说
http://blog.sina.com.cn/happyyanghongjie
http://blog.sina.com.cn/gsdxcxt
http://blog.sina.com.cn/xiangshizhongyi1
http://blog.sina.com.cn/gsdxcxt
http://blog.sina.com.cn/priscillaxu
http://blog.sina.com.cn/固定吗，要取的内容前后有什么标志，比如说是在href=""内的吗？上面给的代码是否符合你的要求，不符合说明下哪里不符合
提问的时候，不要想当然的以为大家都明白你在说什么，把需求能说多细就说多细
sorry，找到了，上面的代码是可以取出来的，只不过没有做重复性过滤，我再改一下另外会取到
http://blog.sina.com.cn/help
这个除非事先知道，否则没法过滤掉这种非博客地址的
List<string> list = new List<string>();
MatchCollection mc = Regex.Matches(yourStr, @"(?<=href="")http://blog\.sina\.com\.cn/[^""/]+(?="")", RegexOptions.IgnoreCase);
foreach (Match m in mc)
{
    if (!list.Contains(m.Value))
        list.Add(m.Value);
}
foreach (string s in list)
    richTextBox2.Text += s + "\n";楼主所给链接输出结果为
http://blog.sina.com.cn/help
http://blog.sina.com.cn/happyyanghongjie
http://blog.sina.com.cn/lingchen1109
http://blog.sina.com.cn/yinqiu1986
http://blog.sina.com.cn/xiaozhuaimei
http://blog.sina.com.cn/lingdingyang
http://blog.sina.com.cn/xiaoquchen
http://blog.sina.com.cn/internationalgirl
http://blog.sina.com.cn/beiyingshaonv
http://blog.sina.com.cn/depend0925
http://blog.sina.com.cn/letoucai
http://blog.sina.com.cn/yinghongtian
作者:<a href="http://blog.sina.com.cn/nmxudongyu" target="_blank">许冬雨</a>
(<a href="http://blog.sina.com.cn/nmxudongyu" target="_blank" class="c6ul">许冬雨的BLOG</a>) root_ ：
用"作者"限制，可以得到不重复的地址了
不会写正则表达式
OK，能找到唯一标识的就好办了MatchCollection mc = Regex.Matches(str, @"(?<=作者:<a\s+href="")http://blog\.sina\.com\.cn/[^""/]+(?="")", RegexOptions.IgnoreCase);
foreach (Match m in mc)
{
    richTextBox2.Text += m.Value + "\n";
}楼主所给链接输出结果为
http://blog.sina.com.cn/happyyanghongjie
http://blog.sina.com.cn/jinggegewang
http://blog.sina.com.cn/JLD2005
http://blog.sina.com.cn/haidipiaochen
http://blog.sina.com.cn/zhuwenglei
http://blog.sina.com.cn/ChenAizhi
http://blog.sina.com.cn/vickyhaohao
http://blog.sina.com.cn/76jian
http://blog.sina.com.cn/xiaopiblog
http://blog.sina.com.cn/huangsexiangdai
http://blog.sina.com.cn/x85257902