现有如下9句话:
C1: Human machine interface for Lab ABC computer applications
C2: A survey of user opinion of computer system response time
C3: The EPS user interface management system
C4: System and human system engineering testing of EPS
C5: Relation of user-perceived response time to error measurement
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees.
M3: Graph minors IV: Widths of trees and well-quasi-ordering
M4: Graph minors: A survey 提取句子中的单词,这里的单词具有如下特征: 1 必须曾在>=2个句子中出现过 (如上面红色表示的单词)
2 不是停用词 (英语的停用词列表网上可以找到 stoplist)
根据提取到的单词(在这个样例中应该提取到的是红色表示的单词 一共12个单词 ),构造一个矩阵A(用二维数组表示)
A=(aij),aij表示第i个单词在第j个句中出现的权重。这里的权重计算方法是wij = tf*idf 。
最终得到类似如下结构的矩阵矩阵A=
C1 C2 C3 C4 C5 M1 M2 M3 M4
huaman 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
respones 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
eps 0 0 1 1 0 0 0 0 0
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1像这样的程序如何实现? 哪位大侠帮解决一下?
C1: Human machine interface for Lab ABC computer applications
C2: A survey of user opinion of computer system response time
C3: The EPS user interface management system
C4: System and human system engineering testing of EPS
C5: Relation of user-perceived response time to error measurement
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees.
M3: Graph minors IV: Widths of trees and well-quasi-ordering
M4: Graph minors: A survey 提取句子中的单词,这里的单词具有如下特征: 1 必须曾在>=2个句子中出现过 (如上面红色表示的单词)
2 不是停用词 (英语的停用词列表网上可以找到 stoplist)
根据提取到的单词(在这个样例中应该提取到的是红色表示的单词 一共12个单词 ),构造一个矩阵A(用二维数组表示)
A=(aij),aij表示第i个单词在第j个句中出现的权重。这里的权重计算方法是wij = tf*idf 。
最终得到类似如下结构的矩阵矩阵A=
C1 C2 C3 C4 C5 M1 M2 M3 M4
huaman 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
respones 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
eps 0 0 1 1 0 0 0 0 0
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1像这样的程序如何实现? 哪位大侠帮解决一下?
解决方案 »
- 用replaceFirst 替换文件路径时,应该怎样处理反斜杠
- 用java流复制文件不能复制全,少几十兆!!
- Socket问题,应该比较简单,请大家帮忙看一下
- 有关Java的一个小问题!
- 读取txt文件显示在表格中
- 我用的是orocle,数据表字段是日期型,用日期的时候出错
- 一道数据库的面试题!!!
- 大家使用jTree,那么jTree树型结构是如何保存的,以便下次程序打开可以继续使用,谢谢,50分
- 谁了解enhydra能否给我简单的介绍一下
- 关于Context和InitialContext的问题 每up一次给5分分数不够的我在开贴。很急
- 询问Core Java中文第八版的一个问题
- 关于java可视化开发IDE
A=
C1 C2 C3 C4 C5 M1 M2 M3 M4
huaman 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
respones 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
eps 0 0 1 1 0 0 0 0 0
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
在这个矩阵中,行代表提取到的单词 列表示句子 矩阵的值aij表示单词i在句子j中的权重。拿a[0][0]来说就是单词human在句子C1中出现的权重值。
A=
C1 C2 C3 C4 C5 M1 M2 M3 M4
huaman 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
respones 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
eps 0 0 1 1 0 0 0 0 0
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
但是代码写的太臭了,而且没有加紧去tfidf权重计算,只是统计了单词在一个句中出现的次数。把代码贴出来让大家瞅瞅,小心肚皮不要笑破了。
(这里在当前路径下有一个corpus文件夹,里面有9个txt,存放的就是刚才的那9个句子。还有一个stoplist文件夹,有一个stoplist.txt,存放的就是英语中的停用词)
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import javax.swing.JOptionPane;public class LatentSemanticAnalysis {
private ArrayList<String> termsList = new ArrayList<String>();
private ArrayList<String> terms = new ArrayList<String>();
private ArrayList<String> stopList = new ArrayList<String>();
private final HashMap<String, Integer> wordFrequence = new HashMap<String, Integer>();
private Double[][] terms_document_matrix;
private String stoplistFilePath = null;
private String stringReg= "\\b[A-Za-z]+\\b"; //设置停用词列表文件路径
public void setStoplistFilePath(String filePath){
this.stoplistFilePath = filePath;
}
//获取停用词列表文件路径
public String getStopListFilePath(){
return stoplistFilePath;
}
//读取停用词列表的内容
public ArrayList<String> stoplistReader(String filePath,String reg) throws IOException {
ArrayList<String> arrlit = new ArrayList<String>();
//将停用词读入bfr stop_list中存放的是所有的停用词;
BufferedReader bfr = new BufferedReader(new FileReader(
filePath));
Pattern p = Pattern.compile(reg);
for (String str = ""; str != null; str = bfr.readLine()) {
if (str.length() == 0)
continue;// 如果某一行的内容为空行,则扫描下一行,提高效率;
Matcher m = p.matcher(str.toLowerCase());
while (m.find())
arrlit.add(m.group());
}
return arrlit;
}
public static void main(String args[]) throws IOException { String stoplistFilePath = ".\\stoplist\\stop_list.txt";
LatentSemanticAnalysis lsa = new LatentSemanticAnalysis();
lsa.setStoplistFilePath(stoplistFilePath);
String path = lsa.getStopListFilePath();
// System.out.println("The stoplist file path is : "+path);
String corpusPath = JOptionPane
.showInputDialog("Please choose the directory! ");
new LatentSemanticAnalysis(corpusPath); }
public LatentSemanticAnalysis(){};
public LatentSemanticAnalysis(String path) throws IOException {
File file = new File(path);
File[] files = file.listFiles(); int[] term_frequence = new int[files.length]; //读取停用词列表
stopList = stoplistReader(".\\stoplist\\stop_list.txt",stringReg);
for (int i = 0; i < files.length; i++) {
wordFrequence.clear();
for (int i1 = 0; i1 < files.length; i1++) {
BufferedReader in = new BufferedReader(new FileReader(path
+ "\\" + files[i1].getName()));
Pattern p = Pattern.compile(stringReg);
// 匹配单词;
for (String temp = ""; temp != null; temp = in.readLine()) {
if (temp.length() == 0)
continue;
Matcher m = p.matcher(temp.toLowerCase());
while (m.find()) {
wordFrequence
.put((temp = m.group()), wordFrequence
.containsKey(temp) ? wordFrequence
.get(temp) + 1 : 1);
if (!termsList.contains(temp)) {
boolean flag = false;
for (int j = 0; j < stopList.size(); j++) {
if (stopList.get(j).equals(temp)) {
flag = true;
}
}
if (flag == false) {
termsList.add(temp);
}
}
} // 存储单词到HashMap中,如果之前存在这个单词,则在(key)String域覆盖该单词,并且在value域自加1。而get()则获得的是该单词当前在big.txt
// 的次数。最终实现了在nWords中存放不同的单词,并且知道每一个单词出现的次数。
}
in.close();
} }
for (int i = 0; i < termsList.size(); i++) {
String str = termsList.get(i); int num = 0;
for (int j = 0; j < files.length; j++) {
wordFrequence.clear(); BufferedReader in1 = new BufferedReader(new FileReader(path
+ "\\" + files[j].getName())); // 创建pattern对象,并对正则表达式进行编译
Pattern p = Pattern
.compile(stringReg);// 匹配单词;
for (String temp = ""; temp != null; temp = in1.readLine()) {
if (temp.length() == 0)
continue;
Matcher m = p.matcher(temp.toLowerCase());
while (m.find())
wordFrequence
.put((temp = m.group()), wordFrequence
.containsKey(temp) ? wordFrequence
.get(temp) + 1 : 1);
}
if (wordFrequence.containsKey(str)) {
num = num + wordFrequence.get(str);
}
in1.close();
}
//保证term中的单词至少出现在两篇文章中
if (num >= 2) {
terms.add(str);
}
} System.out.println("\n Terms:\n" + terms );
System.out.println("\n The original Term_Document Matrix is :\n");
for (int t = 0; t < terms.size(); t++) {
String str = terms.get(t);
for (int j = 0; j < files.length; j++) {
int num = 0;
wordFrequence.clear(); BufferedReader in1 = new BufferedReader(new FileReader(path
+ "\\" + files[j].getName())); Pattern p = Pattern
.compile(stringReg);// 匹配单词; for (String temp = ""; temp != null; temp = in1.readLine()) {
if (temp.length() == 0)
continue;
Matcher m = p.matcher(temp.toLowerCase());
while (m.find())
if (str.equals(m.group()))
num++;
}
// 一篇文章中目标单词的个数;
term_frequence[j] = num;
in1.close();
} for (int i = 0; i < term_frequence.length; i++) { System.out.print(term_frequence[i] + "\t");
}
System.out.println();
}
}
}