首先请看一段很简单的HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<base href="http://localhost:80/myjsp/">
<title>My JSP 'index.jsp' starting page</title>
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="cache-control" content="no-cache">
<meta http-equiv="expires" content="0">
<meta http-equiv="keywords" content="keyword1,keyword2,keyword3">
<meta http-equiv="description" content="This is my page">
<!--
<link rel="stylesheet" type="text/css" href="styles.css">
-->
</head>
<body>
<input type="hidden" value="7">
This is my JSP page. 哈哈<br>
for Index,index.<br>
<a href="index.jsp?currentPage=2">1</a>
</body>
</html>我用正则表达式<[^<|^>]*>来试图清楚HTML的标签,但是由于有CSS的存在,最后结果存在<!-- -->。所以我寻求另一个既能清除HTML元素又能清除CSS元素的正则式,我用<.*>,代码如下:import java.util.regex.Matcher;
import java.util.regex.Pattern;import com.heaton.bot.HTTPSocket;public class Experiment {
public static void main(String args[]){
try {
HTTPSocket http = new HTTPSocket();
http.send(args[0], null);
System.out.println(http.getBody());
String output = getTxtWithoutHTMLElement(http.getBody());
System.out.println(output);
} catch (Exception e) {
}
}
public static String getTxtWithoutHTMLElement (String original)
{
if(original==null||"".equals(original.trim()))
{
return original;
}
Pattern pattern = Pattern.compile("<.*>",Pattern.DOTALL);
Matcher matcher = pattern.matcher(original);
StringBuffer strbuffer = new StringBuffer();
while (matcher.find())
{
matcher.appendReplacement(strbuffer,"");
}
matcher.appendTail(strbuffer);
return strbuffer.toString();
}
}但是奇怪的是,结果竟然什么都没有,但是我分析过如果只要存在<>无论<>里面是什么都将这个字符串清除,这样不就能达到既能清除HTML的标签又能清除CSS的元素的效果了吗?
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<base href="http://localhost:80/myjsp/">
<title>My JSP 'index.jsp' starting page</title>
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="cache-control" content="no-cache">
<meta http-equiv="expires" content="0">
<meta http-equiv="keywords" content="keyword1,keyword2,keyword3">
<meta http-equiv="description" content="This is my page">
<!--
<link rel="stylesheet" type="text/css" href="styles.css">
-->
</head>
<body>
<input type="hidden" value="7">
This is my JSP page. 哈哈<br>
for Index,index.<br>
<a href="index.jsp?currentPage=2">1</a>
</body>
</html>我用正则表达式<[^<|^>]*>来试图清楚HTML的标签,但是由于有CSS的存在,最后结果存在<!-- -->。所以我寻求另一个既能清除HTML元素又能清除CSS元素的正则式,我用<.*>,代码如下:import java.util.regex.Matcher;
import java.util.regex.Pattern;import com.heaton.bot.HTTPSocket;public class Experiment {
public static void main(String args[]){
try {
HTTPSocket http = new HTTPSocket();
http.send(args[0], null);
System.out.println(http.getBody());
String output = getTxtWithoutHTMLElement(http.getBody());
System.out.println(output);
} catch (Exception e) {
}
}
public static String getTxtWithoutHTMLElement (String original)
{
if(original==null||"".equals(original.trim()))
{
return original;
}
Pattern pattern = Pattern.compile("<.*>",Pattern.DOTALL);
Matcher matcher = pattern.matcher(original);
StringBuffer strbuffer = new StringBuffer();
while (matcher.find())
{
matcher.appendReplacement(strbuffer,"");
}
matcher.appendTail(strbuffer);
return strbuffer.toString();
}
}但是奇怪的是,结果竟然什么都没有,但是我分析过如果只要存在<>无论<>里面是什么都将这个字符串清除,这样不就能达到既能清除HTML的标签又能清除CSS的元素的效果了吗?
<.*>
是匹配所有尖括号中的内容,而整个html文档的第一个字符和最后一个字符刚好凑成一个<.*>
所以你的整个文档都会被匹配
可以换成
<[^>]*>
意思是说:在<>之内,并且不再包含>的内容才符合匹配标准
也就避免了<>的嵌套问题另外,最后的StringBuffer操作和while循环可以用matcher.replaceAll方法代替,用不着写那么麻烦的循环
那段的逻辑我没认真看,不知道有没有问题,我的写法是:public static String getTxtWithoutHTMLElement (String original)
{
if(original==null||"".equals(original.trim()))
{
return original;
}
Pattern pattern = Pattern.compile("<[^>]*>",Pattern.DOTALL);
Matcher matcher = pattern.matcher(original);
return matcher.replaceAll("");
return strbuffer.toString();
}
呃,最后一个return strbuffer.toString(); 忘记删除了,我想楼主能够看懂:P
另外Pattern.DOTALL参数也可以去掉