需要抓取一个URL里包含的所有URL的源码

偶现在很穷，挣分结贴偶现在很穷，挣分结贴偶现在很穷，挣分结贴到论坛去看一看
http://www.envanet.com
我的论坛，这样说是不是有拉条皮的嫌疑
不过我确实放了很多高程考试的资料在上面
好的话，给点分支持一下

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

Recently I decided I wanted to build an Internet spider. There was no client paying for it, no boss asking for it. It’s just something I’ve always wanted to do. Thanks to the .NET Framework objects System.Net.HttpWebRequest and System.Net.HttpWebResponse, this is relatively easy. The hardest part I ran into while writing the code was scanning for HREF’s on a page. I thought this might be the easiest part when I started, but in terms of lines of code per hour, this was by far the most difficult portion of my spider. In this scenario a regular expression is needed to scan the HTML the spider has downloaded for more HREF’s so it can continue crawling the web. A quick search at Google reveals numerous regular expressions for this explicit purpose but they are all lacking. Some require the expression to have double quotes (“) to return a match, others return the entire HTML tag, and none would return all of the six ways to write an HREF. The six ways to write an HREF link, with it still working in a browser, are:<a href=http://www.yahoo.com target=_blank>link</a>
<a href=http://www.yahoo.com>link</a>
<a href="http://www.yahoo.com">link</a>
<a href='http://www.yahoo.com'>link</a>
<a href="http://www.yahoo.com" >link</a>
<a href='http://www.yahoo.com' >link</a>Microsoft’s example regular expression, found on MSDN, only works for links with double quotes. The pattern string looks like this:href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))Not very helpful. Of course we ended up having to write our own from scratch. Here is what we came up with (all on one line):
(?:[hH][rR][eE][fF]\s*=)(?:[\s""']*)(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.*css|.*this\.)(.*?)(?:[\s>""'])
I’ll break this into chunks of sub expressions.(?:[hH][rR][eE][fF]\s*=)
The “(?:” denotes this as a “sub expression that performs a positive look ahead search”. Basically it looks for a lower case ‘href’ with any number of white space characters OR an upper case ‘HREF’ with any number of white space characters. The ‘\s*’ signifies any number of white space characters. Note that this sub expression does NOT return the matched characters, it merely confirms that they are there and this section is a potential match. (?:[\s""']*)
This is another positive look ahead expression that doesn’t return the match. In this case we’re looking for any characters from the character set zero or more times (*). A character set is defined by brackets ‘[]’. So the items in this character set are ‘\s’ – which means a white space, double quotes and the single quote. The star following the closing bracket means these characters can happen zero or more times, taking care of our HREFS with nothing between the ‘=’ and the value.(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.*css|.*this\.)
This is a Negative look ahead sub expression. It verifies that whatever is in the sub expression is NOT in the string being compared.
‘(?!’ is the beginning syntax of this expression and then the pipes ‘|’ essentially mean ‘OR’. Since this is a spider, and I only want to follow links that really take me someplace, I don’t want any JavaScript, or CSS files, or mailto links. This portion of the regular expression was thrown in at the last minute and probably could use a rethinking. But for now it gets rid of 90% of the junk tags.(.*?)
This searches for the actual value of the HREF link. ‘.*’ means search for any character any number of times, and the ‘?’ following that “matches the preceding character or sub expression zero or one time”.(?:[\s>""'])
And all good things must come to an end. This positive look-ahead sub expression searches for a blank space, a greater than sign, a double quote or a single tick.With this regular expression I’ve been able to match and return just about all the usable links on any particular web page. You are of course, welcome to respond below with improvements of your own.Example .NET CodeImports System.IOImports System.NetPrivate Sub DoIt(ByVal mStrURL as String)               Dim sResult As String        Dim sReturn As String        Dim sRdr As StreamReader        Dim i As Integer        Dim sPattern As String        sPattern = "(?:[hH][rR][eE][fF]\s*=)" & _                       "(?:[\s""']*)" & _                       "(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.*css|.*this\.)" & _                       "(.*?)(?:[\s>""'])"        Try            sRdr = GrabHTML(mStrURL)            sResult = sRdr.ReadToEnd            Dim MyregEx As New Regex(sPattern, RegexOptions.IgnoreCase)            Dim m As Match = MyregEx.Match(sResult.Replace("'", """"), sPattern)            While m.Success                If Len(m.Groups(1).Value) > 2 Then                    sReturn &= m.Groups(1).Value & vbCrLf                End If                m = m.NextMatch            End While            'Do something with the            'variable 'sReturn', which contains the list of HREF's        Catch            'Do something with your error        End TryEnd SubFunction GrabHTML(ByVal sURL As String) As StreamReader        Try            Dim oReq As System.Net.HttpWebRequest            Dim oResp As System.Net.HttpWebResponse            oReq = System.Net.HttpWebRequest.Create(sURL)            oResp = oReq.GetResponse            Dim sr As New StreamReader(oResp.GetResponseStream)            Return sr        Catch            'Do something with your error        End TryEnd Function

The MSDN regular expression syntax page can be found at: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/jscript7/html/jsjsgrpRegExpSyntax.asp
是不是要把运行的aspx文件及其url地址全部显示？
如果是这样，用如下代码
vb.net
Dim str As String="http://" & HttpContext.Current.Request.Url.Host & HttpContext.Current.Request.RawUrlC#
string str = "http://" + HttpContext.Current.Request.Url.Host + HttpContext.Current.Request.RawUrl;这个str就是地址，然后用label显示或者用response.write显示即可。
运行效果，假设你运行这个网页的名称是http://localhost/web/aspweb/testurl.aspx
那么显示出来的就是这个地址。
因为使用了HttpContext对象，可以把代码写入类库里面。
我认为楼主的意思是想通过一个html页面，找到该页面上的所有连接，然后显示出来。因此，应该先用 HttpReaponse 和 HttpRequest 类把该页面下载下来，然后用字符串分析的方法过滤 html 信息。如果楼主确实是这种想法的话我可以给出示例代码
zhangbat(jim.)
我认为楼主的意思是想通过一个html页面，找到该页面上的所有连接，然后显示出来。因此，应该先用 HttpReaponse 和 HttpRequest 类把该页面下载下来，然后用字符串分析的方法过滤 html 信息。如果楼主确实是这种想法的话我可以给出示例代码
是啊是啊
不是找到该页面上的。
是在该页面上输入URL,确认后在页面上显示所有输入的URL里包含的所有URL。
:)
这里给一个简单的下载页面的代码，下载下来分析字符串的部分网上到处都是，自己研究也不难。这个示例是我自己的一个关于下载工具的练习中摘出来的，你自己调一下吧。
///示例函数，实现把页面下载到本地文件的功能
///简单起见只使用了HttpWebRequest类，没有使用socks的方法
private void fDownLoad()
{

//创建代理对象，对于使用微软ISA_server代理的域用户需要，否则不需要。
/*－创建代理对象－我的练习中本部分数据是从全局变量中得到的，
  比较难读，考虑到用处不是很大，就略去吧*/

//下载当前任务

//使用到的读写流
Stream inStream = null;
FileStream fileStream = null; try
{
//构造web请求
HttpWebRequest request;
WebResponse response;
string strOldUri,strResponseUri,strTemp; strOldUri = lvItems.Items[i].SubItems[3].Text;   //你用的时候自己构造url地址
do
{
//这个循环用来确定该url是否是最终要下载的地址（有些url是需要多重重定向的）
request = (HttpWebRequest)WebRequest.Create(strOldUri);
if (datas.Proxy)
{
request.Proxy = proxy;
} //发送请求，获取响应
response = (HttpWebResponse)request.GetResponse(); //取得头信息
strResponseUri = response.ResponseUri.ToString().Trim();

if (strResponseUri == strOldUri)
{
break;
}
//文件是重定向的
strOldUri = strResponseUri;

//需要返回去循环重建request和response对象
}while(true);
//文件长度
long longLenOfFile = response.ContentLength;

//////开始下载文件
//获得流
inStream = response.GetResponseStream(); //创建文件流对象
fileStream = new FileStream("文件名",
FileMode.OpenOrCreate,FileAccess.Write); //读取缓冲区长度和缓冲区
int length = 1024;
byte[] buffer = new byte[1025]; //记录读取的长度
int bytesread = 0;
long longtotal = 0; //下载文件

while((bytesread = inStream.Read(buffer,0,length)) > 0)
{
//操作是否终止
if(lvItems.Items[i].ImageIndex != 6)
{
//创造异常，抛出
Exception newExp = new Exception("用户终止操作");
throw newExp;
} //把数据写入文件
fileStream.Write(buffer,0,bytesread);
longtotal += bytesread; }
//下载完毕
//文件下载下来之后需要对下载下来的东东提取url 地址
}

catch(Exception exp)
{
//出现错误
lvItems.Items[i].ImageIndex = 3;
lvItems.Items[i].SubItems[5].Text = exp.Message;
}
finally
{
//关闭流
if(inStream != null)
{
inStream.Close();
}
if(fileStream != null)
{
fileStream.Close();
}

} intDone ++;
} }

}
http://xml.sz.luohuedu.net/Content.asp