to wssg(^_^) import org.htmlparser.lexer.Page; 这没错。请你去看一下doc。我在下面列出它的方法,没有你提的那两个。org.htmlparser.lexer Class PageMethod Summary int column(Cursor cursor) Get the column number for a cursor. int column(int position) Get the column number for a cursor. String findCharset(String name, String _default) Lookup a character set name. char getCharacter(Cursor cursor) Read the character at the cursor position. String getCharset(String content) Get a CharacterSet name corresponding to a charset parameter. URLConnection getConnection() Get the connection, if any. String getContentType() Try and extract the content type from the HTTP header. String getEncoding() Get the current encoding being used. String getLine(Cursor cursor) Get the text line the position of the cursor lies on. String getLine(int position) Get the text line the position of the cursor lies on. LinkProcessor getLinkProcessor() Get the link processor associated with this page. Source getSource() Get the source this page is reading from. String getText() Get all text read so far from the source. String getText(int start, int end) Get the text identified by the given limits. void getText(StringBuffer buffer) Put all text read so far from the source into the given buffer. void getText(StringBuffer buffer, int start, int end) Put the text identified by the given limits into the given buffer. String getUrl() Get the URL for this page. void reset() Reset the page by resetting the source of characters. int row(Cursor cursor) Get the line number for a cursor. int row(int position) Get the line number for a cursor. void setConnection(URLConnection connection) Set the URLConnection to be used by this page. void setEncoding(String character_set) Begins reading from the source with the given character set. void setLinkProcessor(LinkProcessor processor) Set the link processor associated with this page. void setUrl(String url) Set the URL for this page. String toString() Display some of this page as a string.
to wssg(^_^) 原来是版本不同,我现在去试一下。
你去重新下一个,然后按我的方法试试。 Method Summary int column(Cursor cursor) Get the column number for a cursor. int column(int position) Get the column number for a cursor. URL constructUrl(String link, String base) Build a URL from the link and base provided. String findCharset(String name, String _default) Lookup a character set name. String getAbsoluteURL(String link) Create an absolute URL from a relative link. String getBaseUrl() Gets the baseUrl. char getCharacter(Cursor cursor) Read the character at the cursor position. String getCharset(String content) Get a CharacterSet name corresponding to a charset parameter. URLConnection getConnection() Get the connection, if any. String getContentType() Try and extract the content type from the HTTP header. String getEncoding() Get the current encoding being used. String getLine(Cursor cursor) Get the text line the position of the cursor lies on. String getLine(int position) Get the text line the position of the cursor lies on. Source getSource() Get the source this page is reading from. String getText() Get all text read so far from the source. String getText(int start, int end) Get the text identified by the given limits. void getText(StringBuffer buffer) Put all text read so far from the source into the given buffer. void getText(StringBuffer buffer, int start, int end) Put the text identified by the given limits into the given buffer. String getUrl() Get the URL for this page. void reset() Reset the page by resetting the source of characters. int row(Cursor cursor) Get the line number for a cursor. int row(int position) Get the line number for a cursor. void setBaseUrl(String url) Sets the baseUrl. void setConnection(URLConnection connection) Set the URLConnection to be used by this page. void setEncoding(String character_set) Begins reading from the source with the given character set. void setUrl(String url) Set the URL for this page. String toString() Display some of this page as a string.
用不着这么麻烦,用正则表达式很好实现,假设你这个HTM文件在D:import java.io.*; import java.util.regex.*;public class Test {
public static void main(String[] args) throws Exception{ Pattern reg = Pattern.compile("(<(a href|img src)\\s*=\\s*\")(.*)(\">.*)"); FileInputStream fi = new FileInputStream("d:\\tmp.htm"); DataInputStream in = new DataInputStream(fi); String line = null; while ((line = in.readLine()) != null) { Matcher findUrl = reg.matcher(line); if (findUrl.find()) { String newUrl = findUrl.group(1) + "http://localhost/" + findUrl.group(3) + findUrl.group(4); line = newUrl; } System.out.println(line); } }}
to wssg(^_^) 你的办法是设置这个页面本身的绝对URL,而不是这个页面的源码中所有包含的link的URL吧?
Page p=new Page();
p.setBaseUrl("http://localhost");
String url=p.getAbsoluteUrl("index.html");//结果为:http://localhost/index.html如果源码的url是个完整的url(http://...),那么需要稍微处理一下字符串,好运!
String url=p.getAbsoluteUrl(linkUrl.getLink());
p.setUrl("http://localhost/test?url="+url);//这个是关键------------------------------------
或者引用一下Attribute对象的set,get方法也可以
说的没有错
和我想的一样可以用link filter 获得link后统一处理
如果是引用的话没有“p.setBaseUrl("当前页面的完整URL");”也可以。
把它去掉试试。
Page 对象 既没有 setBaseUrl 方法,也没有 getAbsoluteUrl。String url=p.getAbsoluteUrl(linkUrl.getLink()); 这行的 linkUrl 是什么对象?
Class PageMethod Summary
int column(Cursor cursor)
Get the column number for a cursor.
int column(int position)
Get the column number for a cursor.
String findCharset(String name, String _default)
Lookup a character set name.
char getCharacter(Cursor cursor)
Read the character at the cursor position.
String getCharset(String content)
Get a CharacterSet name corresponding to a charset parameter.
URLConnection getConnection()
Get the connection, if any.
String getContentType()
Try and extract the content type from the HTTP header.
String getEncoding()
Get the current encoding being used.
String getLine(Cursor cursor)
Get the text line the position of the cursor lies on.
String getLine(int position)
Get the text line the position of the cursor lies on.
LinkProcessor getLinkProcessor()
Get the link processor associated with this page.
Source getSource()
Get the source this page is reading from.
String getText()
Get all text read so far from the source.
String getText(int start, int end)
Get the text identified by the given limits.
void getText(StringBuffer buffer)
Put all text read so far from the source into the given buffer.
void getText(StringBuffer buffer, int start, int end)
Put the text identified by the given limits into the given buffer.
String getUrl()
Get the URL for this page.
void reset()
Reset the page by resetting the source of characters.
int row(Cursor cursor)
Get the line number for a cursor.
int row(int position)
Get the line number for a cursor.
void setConnection(URLConnection connection)
Set the URLConnection to be used by this page.
void setEncoding(String character_set)
Begins reading from the source with the given character set.
void setLinkProcessor(LinkProcessor processor)
Set the link processor associated with this page.
void setUrl(String url)
Set the URL for this page.
String toString()
Display some of this page as a string.
Method Summary
int column(Cursor cursor)
Get the column number for a cursor.
int column(int position)
Get the column number for a cursor.
URL constructUrl(String link, String base)
Build a URL from the link and base provided.
String findCharset(String name, String _default)
Lookup a character set name.
String getAbsoluteURL(String link)
Create an absolute URL from a relative link.
String getBaseUrl()
Gets the baseUrl.
char getCharacter(Cursor cursor)
Read the character at the cursor position.
String getCharset(String content)
Get a CharacterSet name corresponding to a charset parameter.
URLConnection getConnection()
Get the connection, if any.
String getContentType()
Try and extract the content type from the HTTP header.
String getEncoding()
Get the current encoding being used.
String getLine(Cursor cursor)
Get the text line the position of the cursor lies on.
String getLine(int position)
Get the text line the position of the cursor lies on.
Source getSource()
Get the source this page is reading from.
String getText()
Get all text read so far from the source.
String getText(int start, int end)
Get the text identified by the given limits.
void getText(StringBuffer buffer)
Put all text read so far from the source into the given buffer.
void getText(StringBuffer buffer, int start, int end)
Put the text identified by the given limits into the given buffer.
String getUrl()
Get the URL for this page.
void reset()
Reset the page by resetting the source of characters.
int row(Cursor cursor)
Get the line number for a cursor.
int row(int position)
Get the line number for a cursor.
void setBaseUrl(String url)
Sets the baseUrl.
void setConnection(URLConnection connection)
Set the URLConnection to be used by this page.
void setEncoding(String character_set)
Begins reading from the source with the given character set.
void setUrl(String url)
Set the URL for this page.
String toString()
Display some of this page as a string.
import java.util.regex.*;public class Test {
public static void main(String[] args) throws Exception{
Pattern reg = Pattern.compile("(<(a href|img src)\\s*=\\s*\")(.*)(\">.*)");
FileInputStream fi = new FileInputStream("d:\\tmp.htm");
DataInputStream in = new DataInputStream(fi);
String line = null;
while ((line = in.readLine()) != null) {
Matcher findUrl = reg.matcher(line);
if (findUrl.find()) {
String newUrl = findUrl.group(1) + "http://localhost/" +
findUrl.group(3) + findUrl.group(4);
line = newUrl;
}
System.out.println(line);
}
}}
不对,我搞错了。
再引用Tag对象,过滤出你需要的Tag后,把“p.setUrl("http://localhost/test?url="+url)”改为:
tag.setAttribute("href","http://localhost/test?url="+url);
如果是IMG,则将"href"改为"src";依此类推。