用过 htmlparser 的朋友，请来帮个忙

用正则表达式 java.util.regex.*;

这个包接口很丰富，现想了一个：
Page p=new Page();
p.setBaseUrl("http://localhost");
String url=p.getAbsoluteUrl("index.html");//结果为：http://localhost/index.html如果源码的url是个完整的url(http://...),那么需要稍微处理一下字符串，好运！

是这样的，我要把页面上原有的所有link首先转换为绝对路径，然后在这个绝对路径前边拼上一个字符串比如 http://siteURL/index.html 页面的html源码里有 <a href="d/hello.jsp">hello</a>首先，把相对路径替换为它的绝对路径 <a href="http://siteURL/d/hello.jsp">hello</a>然后，在绝对路径前加上一个字符串，比如这样一个字符串 "http://localhost/test?url="那么结果为 <a href="http://localhost/test?url=http://siteURL/d/hello.jsp">这样，把原来的index.html中所有的链接替换掉之后再输出

在你使用的方法中引用一下Page对象，然后加入：p.setBaseUrl("当前页面的完整URL");
String url=p.getAbsoluteUrl(linkUrl.getLink());
p.setUrl("http://localhost/test?url="+url);//这个是关键------------------------------------
或者引用一下Attribute对象的set,get方法也可以

wssg(^_^)
说的没有错
和我想的一样可以用ｌｉｎｋ　filter　获得link后统一处理

to wssg(^_^) ( )Page 对象，没有 setBaseUrl 这个方法……我用的 htmlparser 版本是 htmlparser1_4_20040727

怎么会？
如果是引用的话没有“p.setBaseUrl("当前页面的完整URL");”也可以。
把它去掉试试。

to wssg(^_^)
Page 对象既没有 setBaseUrl 方法，也没有 getAbsoluteUrl。String url=p.getAbsoluteUrl(linkUrl.getLink()); 这行的 linkUrl 是什么对象？

import org.htmlparser.lexer.Page;

to wssg(^_^) import org.htmlparser.lexer.Page;    这没错。请你去看一下doc。我在下面列出它的方法，没有你提的那两个。org.htmlparser.lexer
Class PageMethod Summary
int column(Cursor cursor)
          Get the column number for a cursor.
int column(int position)
          Get the column number for a cursor.
String findCharset(String name, String _default)
          Lookup a character set name.
char getCharacter(Cursor cursor)
          Read the character at the cursor position.
String getCharset(String content)
          Get a CharacterSet name corresponding to a charset parameter.
URLConnection getConnection()
          Get the connection, if any.
String getContentType()
          Try and extract the content type from the HTTP header.
String getEncoding()
          Get the current encoding being used.
String getLine(Cursor cursor)
          Get the text line the position of the cursor lies on.
String getLine(int position)
          Get the text line the position of the cursor lies on.
LinkProcessor getLinkProcessor()
          Get the link processor associated with this page.
Source getSource()
          Get the source this page is reading from.
String getText()
          Get all text read so far from the source.
String getText(int start, int end)
          Get the text identified by the given limits.
void getText(StringBuffer buffer)
          Put all text read so far from the source into the given buffer.
void getText(StringBuffer buffer, int start, int end)
          Put the text identified by the given limits into the given buffer.
String getUrl()
          Get the URL for this page.
void reset()
          Reset the page by resetting the source of characters.
int row(Cursor cursor)
          Get the line number for a cursor.
int row(int position)
          Get the line number for a cursor.
void setConnection(URLConnection connection)
          Set the URLConnection to be used by this page.
void setEncoding(String character_set)
          Begins reading from the source with the given character set.
void setLinkProcessor(LinkProcessor processor)
          Set the link processor associated with this page.
void setUrl(String url)
          Set the URL for this page.
String toString()
          Display some of this page as a string.

to wssg(^_^) 原来是版本不同，我现在去试一下。

你去重新下一个，然后按我的方法试试。
Method Summary
int column(Cursor cursor)
          Get the column number for a cursor.
int column(int position)
          Get the column number for a cursor.
URL constructUrl(String link, String base)
          Build a URL from the link and base provided.
String findCharset(String name, String _default)
          Lookup a character set name.
String getAbsoluteURL(String link)
          Create an absolute URL from a relative link.
String getBaseUrl()
          Gets the baseUrl.
char getCharacter(Cursor cursor)
          Read the character at the cursor position.
String getCharset(String content)
          Get a CharacterSet name corresponding to a charset parameter.
URLConnection getConnection()
          Get the connection, if any.
String getContentType()
          Try and extract the content type from the HTTP header.
String getEncoding()
          Get the current encoding being used.
String getLine(Cursor cursor)
          Get the text line the position of the cursor lies on.
String getLine(int position)
          Get the text line the position of the cursor lies on.
Source getSource()
          Get the source this page is reading from.
String getText()
          Get all text read so far from the source.
String getText(int start, int end)
          Get the text identified by the given limits.
void getText(StringBuffer buffer)
          Put all text read so far from the source into the given buffer.
void getText(StringBuffer buffer, int start, int end)
          Put the text identified by the given limits into the given buffer.
String getUrl()
          Get the URL for this page.
void reset()
          Reset the page by resetting the source of characters.
int row(Cursor cursor)
          Get the line number for a cursor.
int row(int position)
          Get the line number for a cursor.
void setBaseUrl(String url)
          Sets the baseUrl.
void setConnection(URLConnection connection)
          Set the URLConnection to be used by this page.
void setEncoding(String character_set)
          Begins reading from the source with the given character set.
void setUrl(String url)
          Set the URL for this page.
String toString()
          Display some of this page as a string.

用不着这么麻烦,用正则表达式很好实现,假设你这个HTM文件在D:import java.io.*;
import java.util.regex.*;public class Test {

public static void main(String[] args) throws Exception{
Pattern reg = Pattern.compile("(<(a href|img src)\\s*=\\s*\")(.*)(\">.*)");
FileInputStream fi = new FileInputStream("d:\\tmp.htm");
DataInputStream in = new DataInputStream(fi);
String line = null;
while ((line = in.readLine()) != null) {
Matcher findUrl = reg.matcher(line);
if (findUrl.find()) {
String newUrl = findUrl.group(1) + "http://localhost/" +
findUrl.group(3) + findUrl.group(4);
line = newUrl;
}
System.out.println(line);
}
}}

to wssg(^_^) 你的办法是设置这个页面本身的绝对URL，而不是这个页面的源码中所有包含的link的URL吧？

p.setUrl("http://localhost/test?url="+url)
不对，我搞错了。
再引用Tag对象，过滤出你需要的Tag后，把“p.setUrl("http://localhost/test?url="+url)”改为：
tag.setAttribute("href","http://localhost/test?url="+url);
如果是IMG,则将"href"改为"src";依此类推。

调试易

用过 htmlparser 的朋友，请来帮个忙

解决方案 »