gpt4 book ai didi

java - 从 Java 中的字符串中删除停用词

转载 作者:塔克拉玛干 更新时间:2023-11-03 04:04:51 25 4
gpt4 key购买 nike

我有一个包含很多单词的字符串,我有一个文本文件,其中包含一些我需要从我的字符串中删除的停用词。假设我有一个字符串

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

删除停用词后,字符串应该是这样的:

"love phone, super fast much cool jelly bean....but recently bugs."

我已经能够做到这一点,但我面临的问题是,每当字符串中有相邻的停用词时,它只删除第一个,我得到的结果是:

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"  

这是我的停用词列表.txt 文件: Stopwords

我该如何解决这个问题。这是我到目前为止所做的:

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
FileReader fr=new FileReader("F:\\stopwordslist.txt");
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null){
stopwords[k]=sCurrentLine;
k++;
}
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
StringBuilder builder = new StringBuilder(s);
String[] words = builder.toString().split("\\s");
for (String word : words){
wordsList.add(word);
}
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
}catch(Exception ex){
System.out.println(ex);
}

最佳答案

这是一个更优雅的解决方案(恕我直言),仅使用正则表达式:

    // instead of the ".....", add all your stopwords, separated by "|"
// "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
// the "\\s?" is to suppress optional trailing white space
Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
String s = m.replaceAll("");
System.out.println(s);

关于java - 从 Java 中的字符串中删除停用词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27685839/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com