gpt4 book ai didi

java - Java 中基于字典的搜索优化

转载 作者:行者123 更新时间:2023-12-01 18:18:45 25 4
gpt4 key购买 nike

我有一个Sentences 类。此类的实例代表文本文件中的每个句子。

我正在读取文件中的每个句子,并将该句子作为我的 Sentences 类的实例。对于每个句子,我需要检查其中有多少停用词/功能词。

我有一个包含英文停用词的文本文件 (stopwords.txt)。

我应该如何设计我的程序,以便对于每个句子我都不必一次又一次地读取 stopwords.txt 文件?相反,我应该“以某种方式”保存此文件的内容(停用词),然后检查我的句子中的哪些单词是停用词。

我有很多句子,我需要这个程序尽可能快。

编辑:

我创建了一个 StopWords 类

public class StopWords

我正在阅读此类中的 stopwords.txt 文件并将它们保存在 HashSet 中。

....    
while ((entries = br.readLine()) != null){
stopWordSet.add(entries.toLowerCase());
...

然后,我在 Sentences 类中创建 StopWords 类的实例:

public class Sentences {
...
private static StopWords stopList = new StopWords("languageresources/stopword.txt");
...
}

我正在从文件中读取句子并创建 Sentences 类的实例。每个句子的单词都保存在一个名为 wordList 的 ArrayList 中,并将其发送到 StopWords 类的 dealStopWord() 方法来检查哪些单词是停用词。最后,我使用 getStopWordCount() 方法获取停用词的数量:

stopList.dealStopWord(wordList);
this.totalFunctionWords = stopList.getStopWordCount();

编辑:如果我将 stopList 变量设置为 Sentences 类的本地变量,则对于每个句子,都会调用构造函数(即为每个句子读取 stopwords.txt 文件),但它比使用 stopList 的情况要快得多变量是静态的(即,当 stopwords.txt 仅被读取一次时)

编辑

StopWords.java 类

    public class StopWords {

//Instance variables
private String stopWordFile = ""; // name of the stopword file
private Set<String> stopWordSet;
private int count = 0; //number of stopwords found in a given sentence
private String[] sortedStopWords;
private ArrayList <String> noStopWordArray = new ArrayList <String> ();

//Constructor: takes the file containing stopwords
public StopWords (String fileName){
System.out.println("Stoplist constructor called");
this.stopWordFile = fileName;
FileReader stopWordFile = null;
try {
stopWordFile = new FileReader(this.stopWordFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
BufferedReader br = new BufferedReader(stopWordFile);
String entries;
stopWordSet = new TreeSet<String>();
try {
while ((entries = br.readLine()) != null){
stopWordSet.add(entries.toLowerCase());
}
} catch (IOException e) {
e.printStackTrace();
}
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
sortedStopWords = new String[stopWordSet.size()];
int i = 0;
Iterator<String> itr = stopWordSet.iterator();
while (itr.hasNext()){
sortedStopWords[i++] = itr.next();
}//end while

}//public StopWords (String fileName)

//return number of stopwords in a sentence (the sentence comes in as an arraylist of words)
public void dealStopWord(ArrayList <String> wordArray){

this.count = 0;
String temp = "";
int size = wordArray.size();
for(int i = 0; i < size; i++){
temp = wordArray.get(i).toLowerCase();
int found = Arrays.binarySearch(sortedStopWords, temp);
if(found >= 0){
this.count++;
}//end if
else{
this.noStopWordArray.add(wordArray.get(i));
}

}//while(itr.hasNext())

}

public ArrayList <String> getNoStopWordArray(){

return this.noStopWordArray;

}//public ArrayList <String> getNoStopWordArray()

public int getStopWordCount(){

return this.count;

}//public int getStopWordCount()

}//public class StopWords

Sentences.java 类的一部分:

       public class Sentences { 
static StopWords stopList = new StopWords("languageresources/stopword.txt");
public void setFunctionAndContentWords(){
//If I make stopList variable locally here, the code is much faster
stopList.dealStopWord(this.wordList); //at this point, the # of stop words and the sentence without stop word is generated
this.totalFunctionWords = stopList.getStopWordCount(); //setting the feature here.
//...set up done.
}// end method
}

这就是我创建 Sentences 类实例的方式:

Sentences[] s = new Sentences[totalSentences]; //sentence object..
for (int i = 0; i < totalSentences; i++){

System.out.println("Processing sentence # " + (i+1));


s[i].setFunctionAndContentWords();
}

最佳答案

也许你可以使用 HashSet。在开始阅读句子之前,将所有停用词放入 HashSet 中。然后对于每个单词检查该单词是否是停用词,使用:

stopWordsHashSet.contains(word);

关于java - Java 中基于字典的搜索优化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28154799/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com