如何从提取的网页内容中筛选出超链

我想实现的功能是将网页中的超链全部依次打开，并以文件形式存储。
现在，我抓取了网页内容。如下（部分）：
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /><title>新浪首页</title><meta name="description" content="新浪网为全球用户24小时提供全面及时的中文资讯，内容覆盖国内外突发新闻事件、体坛赛事、娱乐时尚、产业资讯、实用信息等，设有新闻、体育、娱乐、财经、科技、房产、汽车等30多个内容频道，同时开设博客、视频、论坛等自由互动交流空间。"><meta name="stencil" content="6HtwmypggdgP1NLw7NOuQBI2TW8+CfkYCoyeB8IDbn8=" /><script type="text/javascript" src="http://i3.sinaimg.cn/home/sinaflash.js"></script><script language="javascript" type="text/javascript" src="http://d2.sina.com.cn/d1images/button/rotator.js"></script><style type="text/css">/* 全局样式 */body,ul,ol,li,p,h1,h2,h3,h4,h5,h6,form,fieldset,table,td,img,div{margin:0;padding:0;border:0;}body{background:#fff;color:#333;font-size:12px; margin-top:5px;font-family:"宋体";}ul,ol{list-style-type:none;}select,input,img,select{vertical-align:middle;}a{text-decoration:underline;}a:link{color:#009;}a:visited{color:#800080;}a:hover,a:active,a:focus{color:#c00;}.clearit{clear:both;}/* page */#page{width:950px; overflow: visible; _display:inline-block; margin:0 auto;}/* 顶部 top */.top{height:27px; position:relative; z-index:99; padding:1px; border:1px #fdd26c solid; border-bottom:1px #e1a841 solid; color:#000; background:url(http://i1.sinaimg.cn/home/deco/2008/0329/sinahome_0803_ws_001.gif) repeat-x 0 0 #fff;}.top a,.top a:visited{color:#000; text-decoration:none;}.top a:hover,.top a:active{color:#000; text-decoration:underline;}.topBlk{height:27px; overflow:hidden; _display:inline-block; background:url(http://i1.sinaimg.cn/home/deco/2008/0329/sinahome_0803_ws_001.gif) repeat-x 0 -50px
但不知道该如何去筛选出超链！
还望各位解答！！

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

分行查找，找到头”http“，找到尾“"”，截中间
2种方式：1，如果你是在浏览器环境中编程，可以通过此方法获取到所有的 <a>标记HRESULT getElementsByTagName(          BSTR v,
    IHTMLElementCollection **pelColl
);
2. 如果你想做为文本处理，用正则表达式，一下子就匹配出来了
{
\b
# Match the leading part (proto://hostname, or just hostname)
(
# http://, or https:// leading part
(https?)://[-\w]+(\.\w[-\w]*)+
|
# or, try to find a hostname with more specific sub-expression
(?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
# Now ending .com, etc. For these, require lowercase
(?-i: com\b
| edu\b
| biz\b
| gov\b
| in(?:t|fo)\b # .int or .info
| mil\b
| net\b
| org\b
| [a-z][a-z]\.[a-z][a-z]\b # two-letter country code
)
)
# Allow an optional port number
( : \d+ )?
# The rest of the URL is optional, and begins with /
(
/
# The rest are heuristics for what seems to work well
[^.!,?;"\'<>()[]{}sx7F-\xFF]*
(
[.!,?]+ [^.!,?;”\’<>()\[\]{\}s\x7F-\xFF]+
)*
)?
}ix