gpt4 book ai didi

java - 停用词删除出错

转载 作者:行者123 更新时间:2023-12-01 22:21:43 26 4
gpt4 key购买 nike

出于某些 IR 目的,我想提取一些文本片段,并在分析之前删除停用词。为此,我制作了一个包含停用词的 txt 文件,然后使用以下代码,尝试删除那些无用的单词:

private static void stopWordRemowal() throws FileNotFoundException, IOException {

Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("StopWord.txt"));
for(String line;(line = br.readLine()) != null;)
stopWords.add(line.trim());


BufferedReader br2 = new BufferedReader(new FileReader("text"));
FileOutputStream theNewWords=new FileOutputStream(temp);

for(String readReady;(readReady = br2.readLine()) != null;)

{
StringTokenizer tokenizer =new StringTokenizer(readReady) ;
String temp=tokenizer.nextToken();
if(!stopWords.equals(temp))
{
theNewWords.write(temp.getBytes());
theNewWords.write(System.getProperty("line.separator").getBytes());
}}

}

但实际上效果并不好。考虑以下示例文本片段:

Text summarization is the process of extracting salient information from the source text and to present that 
information to the user in the form of summary

输出如下:

Text
summarization
is
the
process
of
extracting
salient
information
from
the
source
text
and
to
present
that
information
to
the
user
in
the
form
of
summary

几乎就像没有效果一样。但我不知道为什么。

最佳答案

您应该使用contains Set 方法和不等于方法如下:

 if(!stopWords.contains(temp))//does set contains my string temp?

而不是

if(!stopWords.equals(temp))//set equals to string? not possible

关于java - 停用词删除出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29598470/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com