如何从网页中抽取需要的信息

就把他当作一个xml来解析就完了

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

-----加上point.js脚本
<script language="JavaScript" src="../../jscss/point.js"></script>
-----调用js脚本中的方法
<input type="button" name="button" value="摘取" id="Submit" onClick="doPrint();" />
------添加你摘取的内容在<div>中
   <div id="doctitle">
你的内容
</div>-----------point.js内容function doPrint() {
var newWin = window.open('about:blank','1111','height=350,width=800,top=230,left=280,toolbar=no,menubar=yes,scrollbars=yes, resizable=no,location=no, status=no');
var titleHTML = document.getElementById("doctitle").innerHTML;
newWin.document.write(titleHTML);
newWin.document.location.reload();
//newWin.portrait = false  ;
//newWin.leftMargin = 1.0 ;
//newWin.topMargin = 1.0 ;
//newWin.rightMargin = 1.0 ;
//newWin.bottomMargin = 1.0 ;
//newWin.print();
//newWin.close();
}
-------------
通过测试完全可用
后台？为什么不直接读数据库如果html比较标准的话，直接当xml读就行了。
估计就是个爬虫啊什么的，分析别人网站里的数据，不能够读别人数据库，只能从别人的网页里抓东西分析，然后插自己库
这种东西，一般貌似就是正则，要么就是当XML解析吧。
恩对就是这个意思，我现在就想把得到的html先变成xml，然后再用正则去掉没用的东西
1.使用 org.apache.oro.text.regex
/**
* 按正则和组下标匹配数据
*
* @param content
* @param regx
* @param index
* @return
* @throws MalformedPatternException
*/
public static String getMatchString(String content, String regx, int index) throws MalformedPatternException {
PatternCompiler orocom = new Perl5Compiler();
Pattern pattern1 = orocom.compile(regx);
PatternMatcher matcher = new Perl5Matcher();
String sentence = "";
if (matcher.contains(content, pattern1)) {
MatchResult result = matcher.getMatch();
sentence = result.group(index);
}
return sentence;
}
用法：getMatchString(html内容, "<table  width="100%" border="0" cellpadding="0" cellspacing="0" class="table01" >
(*.?)</table>", int index)2.使用html parser。。
3.xml
对就是要用xml解析。要不就太麻烦了
用xml解析要小心格式问题
html容错性是很强的，哪里少个闭标签，弄不好你就得痛不欲生，嘿嘿
用正则会好点
<td[^>]*>企业类型[^<]*</td><td[^>]*>([^<]*)</td>
这个正则表达式可以取企业类型的值，用java的matcher.group(1)可以取到
其余的类型类推
public static String getMatchString(String content, String regx, int index) throws MalformedPatternException {}
int index代表什么？
匹配到的第几块数据吧，表达式可以被()分成几个部分
如下面的表达式
abc(1)de(2)f
执行后abc1de2f会被match出来，group()/group(0)表示整个，group(1)表示1，group(2)表示2
类推
那个index就是group(index)
噢明白了，另外要是匹配<table></table>标签里的内容得正则应该怎么写，我写了几个都不正确啊
<table  width="100%" border="0" cellpadding="0" cellspacing="0" class="table01" ></table>
匹配这个标签里的内容得正则怎么写啊