gpt4 book ai didi

c# - 如何在 Lucene.NET 中删除复数?

转载 作者:太空宇宙 更新时间:2023-11-03 11:01:56 25 4
gpt4 key购买 nike

我正在尝试从文本中提取一些关键字。它工作得很好,但我需要删除复数。

由于我已经将 Lucene 用于搜索目的,因此我尝试使用它从索引术语中提取关键字。

首先,我在 RAMDirectory 索引中索引文档,

RAMDirectory idx = new RAMDirectory();
using (IndexWriter writer =
new IndexWriter(
idx,
new CustomStandardAnalyzer(StopWords.Get(this.Language),
Lucene.Net.Util.Version.LUCENE_30, this.Language),
IndexWriter.MaxFieldLength.LIMITED))
{
writer.AddDocument(createDocument(this._text));
writer.Optimize();
}

然后,我提取关键字:

var list = new List<KeyValuePair<int, string>>();
using (var reader = IndexReader.Open(directory, true))
{
var tv = reader.GetTermFreqVector(0, "text");
if (tv != null)
{
string[] terms = tv.GetTerms();
int[] freq = tv.GetTermFrequencies();

for (int i = 0; i < terms.Length; i++)
list.Add(new KeyValuePair<int, string>(freq[i], terms[i]));
}
}

在术语列表中我可以有像“总统”和“总统”这样的术语
我怎样才能删除它?
我的 CustomStandardAnalyzer 使用这个:

public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
//create the tokenizer
TokenStream result = new StandardTokenizer(this.version, reader);

//add in filters
result = new Lucene.Net.Analysis.Snowball.SnowballFilter(result, this.getStemmer());
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result);
result = new StopFilter(true, result, this.stopWords ?? StopWords.English);

return result;
}

所以我已经使用了 SnowballFilter(带有正确的语言特定词干分析器)。我怎样才能删除复数?

最佳答案

以下程序的输出是:

text:and
text:presid
text:some
text:text
text:with
class Program
{
private class CustomStandardAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
//create the tokenizer
TokenStream result = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_30, reader);
//add in filters
result = new Lucene.Net.Analysis.Snowball.SnowballFilter(result, new EnglishStemmer());
result = new LowerCaseFilter(result);
result = new ASCIIFoldingFilter(result);
result = new StopFilter(true, result, new HashSet<string>());
return result;
}
}

private static Document createDocument(string text)
{
Document d = new Document();
Field f = new Field("text", "", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
f.SetValue(text);
d.Add(f);
return d;
}

static void Main(string[] args)
{

RAMDirectory idx = new RAMDirectory();
using (IndexWriter writer =
new IndexWriter(
idx,
new CustomStandardAnalyzer(),
IndexWriter.MaxFieldLength.LIMITED))
{
writer.AddDocument(createDocument("some text with president and presidents"));
writer.Commit();
}

using (var reader = IndexReader.Open(idx, true))
{
var terms = reader.Terms(new Term("text", ""));
if (terms.Term != null)
do
Console.WriteLine(terms.Term);
while (terms.Next());
}
Console.ReadLine();

}
}

关于c# - 如何在 Lucene.NET 中删除复数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17390690/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com