求免费的英语分词组件或代码？

主要是搜索引擎用到的
比如，you are a nice girl
会被分为 you|are|a|nice girl|

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

在实验室接手的第一个任务，写一个英文分词程序，要将形如：Books in tuneBoxes are for Chinese-Children!断为：Book in tune Box are for Chinese child，也就是说要将复数转为单数，将连写的首字母大写的单词分开等等。复数转单数考虑的比较周全了应该，基本囊活了绝大多数情况。根据大写断词上考虑有些欠妥，比如NEC这样的词显然应该保留，但是这儿会被拆为三个单词。正在试图改进/**
* 分词
*
* @param source
* 待分的字符串
* @return String[]
*/
public String[] fenci(String source) {
/* 分隔符的集合 */
String delimiters = " \t\n\r\f~!@#$%^&*()_ |`1234567890-=\\{}[]:\";'<>?,./'";/* 根据分隔符分词 */
StringTokenizer stringTokenizer = new StringTokenizer(source,
delimiters);
Vector vector = new Vector();/* 根据大写首字母分词 */
while (stringTokenizer.hasMoreTokens()) {
String token = stringTokenizer.nextToken();
int index = 0;
flag1: while (index < token.length()) {
flag2: while (true) {
index ;
if ((index == token.length())
|| !Character.isLowerCase(token.charAt(index))) {
break flag2;
}
}
vector.addElement(token.substring(0, index));
//System.out.println("识别出" token.substring(0, index));
token = token.substring(index);
//System.out.println("剩余" token);
index = 0;
continue flag1;
}
}/*
* 复数转单数参考以下文档：
* http://ftp.haie.edu.cn/Resource/GZ/GZYY/DCYFWF/NJSYYY/421b0061ZW_0015.htm
*/
for (int i = 0; i < vector.size(); i ) {
String token = (String) vector.elementAt(i);
if (token.equalsIgnoreCase("feet")) {
token = "foot";
} else if (token.equalsIgnoreCase("geese")) {
token = "goose";
} else if (token.equalsIgnoreCase("lice")) {
token = "louse";
} else if (token.equalsIgnoreCase("mice")) {
token = "mouse";
} else if (token.equalsIgnoreCase("teeth")) {
token = "tooth";
} else if (token.equalsIgnoreCase("oxen")) {
token = "ox";
} else if (token.equalsIgnoreCase("children")) {
token = "child";
} else if (token.endsWith("men")) {
token = token.substring(0, token.length() - 3) "man";
} else if (token.endsWith("ies")) {
token = token.substring(0, token.length() - 3) "y";
} else if (token.endsWith("ves")) {
if (token.equalsIgnoreCase("knives")
|| token.equalsIgnoreCase("wives")
|| token.equalsIgnoreCase("lives")) {
token = token.substring(0, token.length() - 3) "fe";
} else {
token = token.substring(0, token.length() - 3) "f";
}
} else if (token.endsWith("oes") || token.endsWith("ches")
|| token.endsWith("shes") || token.endsWith("ses")
|| token.endsWith("xes")) {
token = token.substring(0, token.length() - 2);
} else if (token.endsWith("s")) {
token = token.substring(0, token.length() - 1);
}/* 处理完毕 */
vector.setElementAt(token, i);
}/* 转为数组形式 */
String[] array = new String[vector.size()];
Enumeration enumeration = vector.elements();
int index = 0;
while (enumeration.hasMoreElements()) {
array[index] = (String) enumeration.nextElement();
index ;
}/* 打印显示 */
for (int i = 0; i < array.length; i ) {
System.out.println(array[i]);
}/* 返回 */
return array;
}
TTS不就能够分词么？
看看Speech API吧，DotNET下现在也有包装了。
同求，现在做项目急需一个这样的组件，最好还是.net版的