英语词频统计计算，如何实现？

现有如下9句话：
C1: Human machine interface for Lab ABC computer applications
C2: A survey of user opinion of computer system response time
C3: The EPS user interface management system
C4: System and human system engineering testing of EPS
C5: Relation of user-perceived response time to error measurement
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees.
M3: Graph minors IV: Widths of trees and well-quasi-ordering
M4: Graph minors: A survey 提取句子中的单词，这里的单词具有如下特征： 1 必须曾在>=2个句子中出现过（如上面红色表示的单词）
                                           2 不是停用词（英语的停用词列表网上可以找到 stoplist）
根据提取到的单词（在这个样例中应该提取到的是红色表示的单词一共12个单词），构造一个矩阵A（用二维数组表示）
A=（aij），aij表示第i个单词在第j个句中出现的权重。这里的权重计算方法是wij = tf*idf 。
最终得到类似如下结构的矩阵矩阵A=
        C1 C2 C3 C4 C5 M1 M2 M3 M4
huaman    1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer  1 1 0 0 0 0 0 0 0
survey    0 1 0 0 0 0 0 0 1
user      0 1 1 0 1 0 0 0 0
system    0 1 1 2 0 0 0 0 0
respones  0 1 0 0 1 0 0 0 0
time      0 1 0 0 1 0 0 0 0
eps       0 0 1 1 0 0 0 0 0
trees     0 0 0 0 0 1 1 1 0
graph     0 0 0 0 0 0 1 1 1
minors    0 0 0 0 0 0 0 1 1像这样的程序如何实现？哪位大侠帮解决一下？

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

就是要得到一个类似的矩阵
A=
C1 C2 C3 C4 C5 M1 M2 M3 M4
huaman 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
respones 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
eps 0 0 1 1 0 0 0 0 0
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
在这个矩阵中，行代表提取到的单词列表示句子矩阵的值aij表示单词i在句子j中的权重。拿a[0][0]来说就是单词human在句子C1中出现的权重值。
之前有段程序实现了
A=
C1 C2 C3 C4 C5 M1 M2 M3 M4
huaman 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
respones 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
eps 0 0 1 1 0 0 0 0 0
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
但是代码写的太臭了，而且没有加紧去tfidf权重计算，只是统计了单词在一个句中出现的次数。把代码贴出来让大家瞅瞅，小心肚皮不要笑破了。
（这里在当前路径下有一个corpus文件夹，里面有9个txt，存放的就是刚才的那9个句子。还有一个stoplist文件夹，有一个stoplist.txt，存放的就是英语中的停用词）
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import javax.swing.JOptionPane;public class LatentSemanticAnalysis {
private ArrayList<String> termsList = new ArrayList<String>();
private ArrayList<String> terms = new ArrayList<String>();
private ArrayList<String> stopList = new ArrayList<String>();
private final HashMap<String, Integer> wordFrequence = new HashMap<String, Integer>();
private Double[][] terms_document_matrix;
private String stoplistFilePath = null;
private String stringReg= "\\b[A-Za-z]+\\b";    //设置停用词列表文件路径
public  void setStoplistFilePath(String filePath){
this.stoplistFilePath = filePath;
}

//获取停用词列表文件路径
public String getStopListFilePath(){
return stoplistFilePath;
}

//读取停用词列表的内容
public  ArrayList<String> stoplistReader(String filePath,String reg) throws IOException {
        ArrayList<String> arrlit = new ArrayList<String>();
//将停用词读入bfr stop_list中存放的是所有的停用词；
BufferedReader bfr = new BufferedReader(new FileReader(
filePath));
Pattern p = Pattern.compile(reg);
for (String str = ""; str != null; str = bfr.readLine()) {
if (str.length() == 0)
continue;// 如果某一行的内容为空行，则扫描下一行,提高效率；
Matcher m = p.matcher(str.toLowerCase());
while (m.find())
arrlit.add(m.group());
}
return arrlit;
}

public static void main(String args[]) throws IOException { String stoplistFilePath = ".\\stoplist\\stop_list.txt";
LatentSemanticAnalysis lsa = new LatentSemanticAnalysis();
lsa.setStoplistFilePath(stoplistFilePath);
String path = lsa.getStopListFilePath();
// System.out.println("The stoplist file path is : "+path);
String corpusPath = JOptionPane
.showInputDialog("Please choose the directory! ");
new LatentSemanticAnalysis(corpusPath); }

    public LatentSemanticAnalysis(){};

public LatentSemanticAnalysis(String path) throws IOException {
File file = new File(path);
File[] files = file.listFiles(); int[] term_frequence = new int[files.length]; //读取停用词列表
stopList = stoplistReader(".\\stoplist\\stop_list.txt",stringReg);

for (int i = 0; i < files.length; i++) {
wordFrequence.clear();
for (int i1 = 0; i1 < files.length; i1++) {
BufferedReader in = new BufferedReader(new FileReader(path
+ "\\" + files[i1].getName()));
Pattern p = Pattern.compile(stringReg);
// 匹配单词；
for (String temp = ""; temp != null; temp = in.readLine()) {
if (temp.length() == 0)
continue;
Matcher m = p.matcher(temp.toLowerCase());
while (m.find()) {
wordFrequence
.put((temp = m.group()), wordFrequence
.containsKey(temp) ? wordFrequence
.get(temp) + 1 : 1);
if (!termsList.contains(temp)) {
boolean flag = false;
for (int j = 0; j < stopList.size(); j++) {
if (stopList.get(j).equals(temp)) {
flag = true;
}
}
if (flag == false) {
termsList.add(temp);
}
}
} // 存储单词到HashMap中，如果之前存在这个单词，则在（key）String域覆盖该单词，并且在value域自加1。而get（）则获得的是该单词当前在big.txt
// 的次数。最终实现了在nWords中存放不同的单词，并且知道每一个单词出现的次数。
}
in.close();
} }

for (int i = 0; i < termsList.size(); i++) {
String str = termsList.get(i); int num = 0;
for (int j = 0; j < files.length; j++) {
wordFrequence.clear(); BufferedReader in1 = new BufferedReader(new FileReader(path
+ "\\" + files[j].getName())); // 创建pattern对象，并对正则表达式进行编译
Pattern p = Pattern
.compile(stringReg);// 匹配单词；
for (String temp = ""; temp != null; temp = in1.readLine()) {
if (temp.length() == 0)
continue;
Matcher m = p.matcher(temp.toLowerCase());
while (m.find())
wordFrequence
.put((temp = m.group()), wordFrequence
.containsKey(temp) ? wordFrequence
.get(temp) + 1 : 1);
}
if (wordFrequence.containsKey(str)) {
num = num + wordFrequence.get(str);
}
in1.close();
}
//保证term中的单词至少出现在两篇文章中
if (num >= 2) {
terms.add(str);
}
} System.out.println("\n Terms:\n" + terms );
System.out.println("\n The original Term_Document Matrix is :\n");

for (int t = 0; t < terms.size(); t++) {
String str = terms.get(t);
for (int j = 0; j < files.length; j++) {
int num = 0;
wordFrequence.clear(); BufferedReader in1 = new BufferedReader(new FileReader(path
+ "\\" + files[j].getName())); Pattern p = Pattern
.compile(stringReg);// 匹配单词； for (String temp = ""; temp != null; temp = in1.readLine()) {
if (temp.length() == 0)
continue;
Matcher m = p.matcher(temp.toLowerCase());
while (m.find())
if (str.equals(m.group()))
num++;
}
// 一篇文章中目标单词的个数；
term_frequence[j] = num;
in1.close();
} for (int i = 0; i < term_frequence.length; i++) { System.out.print(term_frequence[i] + "\t");
}
System.out.println();
}
}
}
虽然有高手很鄙视，可是我还是想请教一下，因为我们最近也在做这个作业，我专业不是学这个的，真的不太懂，望指教：“在当前路径下”指的是把文件放在哪里？是“LatentSemanticAnalysis”这个文件夹下吗？