用JAVA 写了一个小程序来获取网页的源代码,然后用正则表达式匹配出自己要的信息,但是碰到了一个问题,获取到的网页源代码中有一部分内容是标签信息,用正则表达式匹配只能匹配到标签名,无法取到内容,求教如果获取标签内容: try{
url = new URL(addr);
isr = new InputStreamReader(url.openStream());
br = new BufferedReader(isr);
while((webLine = br.readLine()) != null){
webBuffer.append(webLine);
};
isr.close();
br.close();
url = null;
}catch (MalformedURLException e) {
System.out.println("Address Error!");
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}这个是我获取网页源代码的部分。获取到的网页源代码中有,
<p class="mobile_info">品牌 - <span id="mobile_type"></span>
网络模式 - <span id="comm_type"></span>
</p>用正则表达式智能匹配到"mobile_type" 和"comm_type",而我想要的是"mobile_type" 和"comm_type"的内容MOTOROLA和GSM/CDMA2000
url = new URL(addr);
isr = new InputStreamReader(url.openStream());
br = new BufferedReader(isr);
while((webLine = br.readLine()) != null){
webBuffer.append(webLine);
};
isr.close();
br.close();
url = null;
}catch (MalformedURLException e) {
System.out.println("Address Error!");
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}这个是我获取网页源代码的部分。获取到的网页源代码中有,
<p class="mobile_info">品牌 - <span id="mobile_type"></span>
网络模式 - <span id="comm_type"></span>
</p>用正则表达式智能匹配到"mobile_type" 和"comm_type",而我想要的是"mobile_type" 和"comm_type"的内容MOTOROLA和GSM/CDMA2000
解决方案 »
- 求助!!!MyEclipse 8启动后弹出“problem occurred”对话框
- 为什么命令行的字符串(String)参数,无论有无引号,输出效果等同?
- 高分悬赏Java调试工具^_^
- [求助]当使用高级流封装了低级流,低级流自动关闭吗?
- 我写的java程序怎么才能生成exe供其它用户使用??
- 如何将类似“2008-09-02 12:23:24”的String格式化为Date?
- 很小的问题!
- 急急急!!!使用FileDialog时,怎样在他的默认保存类型中加入*.xls呢?他默认的是*.* ;
- 请教各位一个有关Assertion的问题,谢谢!!!
- 如何在java程序中设置断点进行调试,我用的是VisualAge for Java.
- 怎么会是0呢 求大神赐教
- 关于JTextPane的setLeftIndent 没有实现缩进
import java.io.InputStreamReader;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;public class URLTest { /**
* @param args
* @throws URISyntaxException
*/
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.ascii-code.com/");
InputStreamReader reader = new InputStreamReader(url.openStream());
BufferedReader br = new BufferedReader(reader);
String s = null;
while((s=br.readLine())!=null){
s = GetContent(s);
if(s!=null){
System.out.println(s);
}
}
br.close();
reader.close();
}
public static String GetContent(String html) {
//String html = "<ul><li>1.hehe</li><li>2.hi</li><li>3.hei</li></ul>";
String ss = ">[^<]+<";
String temp = null;
Pattern pa = Pattern.compile(ss);
Matcher ma = null;
ma = pa.matcher(html);
while(ma.find()){
temp = ma.group();
if(temp!=null){
if(temp.startsWith(">")){
temp = temp.substring(1);
}
if(temp.endsWith("<")){
temp = temp.substring(0, temp.length()-1);
}
if(!temp.equalsIgnoreCase("")){
//System.out.println(temp);
return temp;
}
}
}
return null;
}
}
//String html = "<ul><li>1.hehe</li><li>2.hi</li><li>3.hei</li></ul>";
String ss = ">[^<]+<";
String temp = null;
Pattern pa = Pattern.compile(ss);
Matcher ma = null;
ma = pa.matcher(html);
String result = null;
while(ma.find()){
temp = ma.group();
if(temp!=null){
if(temp.startsWith(">")){
temp = temp.substring(1);
}
if(temp.endsWith("<")){
temp = temp.substring(0, temp.length()-1);
}
if(!temp.equalsIgnoreCase("")){
if(result==null){
result = temp;
}
else{
result+="____"+temp;
}
//System.out.println(temp);
}
}
}
return result;
}
The following ASCII table contains both ASCII control characters, ASCII printable characters and the extended ASCII character set ISO 8859-1, also called ISO Latin1
ASCII Code
HTML Symbol
HTML Color Names
HTTP status codes
ASCII Code - The extended ASCII table
ASCII____ stands for American Standard Code for Information Interchange. It's a 7-bit character code where every single bit represents a unique character. On this webpage you will find 8 bits, 256 characters, according to ISO 8859-1 and Microsoft?Windows Latin-1 increased characters, which is available in certain programs such as Microsoft Word.
ASCII control characters (character code 0-31)
DEC
OCT
HEX
BIN
Symbol
HTML Number
HTML Name
Description
0____000____00____00000000____NUL____&#000;____ ____Null char
1____001____01____00000001____SOH____&#001;____ ____Start of Heading
2____002____02____00000010____STX____&#002;____ ____Start of Text
3____003____03____00000011____ETX____&#003;____ ____End of Text
4____004____04____00000100____EOT____&#004;____ ____End of Transmission
5____005____05____00000101____ENQ____&#005;____ ____Enquiry
6____006____06____00000110____ACK____&#006;____ ____Acknowledgment
7____007____07____00000111____BEL____&#007;____ ____Bell
8____010____08____00001000____ BS____&#008;____ ____Back Space
9____011____09____00001001____ HT____&#009;____ ____Horizontal Tab
10____012____0A____00001010____ LF____&#010;____ ____Line Feed
11____013____0B____00001011____ VT____&#011;____ ____Vertical Tab
12____014____0C____00001100____ FF____&#012;____ ____Form Feed
13____015____0D____00001101____ CR____&#013;____ ____Carriage Return
14____016____0E____00001110____ SO____&#014;____ ____Shift Out / X-On
15____017____0F____00001111____ SI____&#015;____ ____Shift In / X-Off
http://www.qidian.com/BookReader/2489034,41971709.aspx,我要文章的内容,我写的代码能获取到网页的源代码,但内容没有,里头是这样的
<script language="javascript" type="text/javascript">
document.domain = "qidian.com"; //跨域引用
var readChapterData = {
bookId: "2489034",
bookName:"\u8d85\u7ea7\u7279\u79cd\u5175\u7cfb\u7edf",
authorId: '2956759',
authorName: '\u60e8\u53eb\u8fde\u8fde',
chapterId: '41971709',
chapterName: '第四十六章 女朋友的义务',
bookType: '1',
IsHaveMonthTicket: 0,
IsHaveDaShang: 1,
IsHaveUpdateTicket:1,
EnableRemoteGetChapterId: 'False',
enableSendAutoBookMark: 'True',
autoBookRequestUrl: 'http://afav.if.qidian.com/ajax.ashx',
monthIframe: "<iframe id='monthIframe' src='/BookReader/Tips/MonthVote.aspx' style='width:355px;height:222px;' scrolling='no' frameborder='0'></iframe>",
vipMonthIframe: "<iframe id='VipMonthIframe' src='' style='width:355px;height:200px;' scrolling='no' frameborder='0'></iframe>",
cpLogoImageUrl: 'http://file1.qidian.com/',
postReviewUrl: 'http://c.pingba.qidian.com/Pop/PostReview.aspx?BookId=2489034&ChapterId=41971709',
prevChapterId:'41964052',
nextChapterId:'0',
ForumUrl2:'http://Forum.qidian.com',
PersonalUrl:'http://me.qidian.com/financial/update',
IsCoopSign: 'False',
bookSignType: '2',
CategoryId: '4',
IsVip: '0',
IsBig5:'0',
CategoryName:'\u90fd\u5e02'
};
(function(win){
var t={dt:null,c:0,ms:0,ls:0}
win.bc_Data = t;
bc_Data.dt= new Date();
bc_Data.c= readChapterData.chapterId;
var scrollFunc=function(e){e=e || window.event;if((e.wheelDelta || e.detail) != 0)bc_Data.ms=1;}
if(document.addEventListener){document.addEventListener('DOMMouseScroll',scrollFunc,false);}else{win.onmousewheel=document.onmousewheel=scrollFunc;}
eventBind(window, 'scroll', function(){bc_Data.ls=1});
})(window); var isOpenAuthorRecommend = 'false';
</script>