gpt4 book ai didi

java - 如何在 Lucene 3.5.0 中提取文档术语 vector

转载 作者:搜寻专家 更新时间:2023-11-01 01:37:24 24 4
gpt4 key购买 nike

我正在使用 Lucene 3.5.0,我想输出每个文档的术语 vector 。例如,我想知道某个术语在所有文档和每个特定文档中的出现频率。我的索引代码是:

import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import java.io.File;
import java.io.FileReader;
import java.io.BufferedReader;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Indexer {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
throw new IllegalArgumentException("Usage: java " + Indexer.class.getName() + " <index dir> <data dir>");
}

String indexDir = args[0];
String dataDir = args[1];
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
}

private IndexWriter writer;

public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,
new StandardAnalyzer(Version.LUCENE_35),
true,
IndexWriter.MaxFieldLength.UNLIMITED);
}

public void close() throws IOException {
writer.close();
}

public int index(String dataDir, FileFilter filter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f: files) {
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f))) {
BufferedReader inputStream = new BufferedReader(new FileReader(f.getName()));
String url = inputStream.readLine();
inputStream.close();
indexFile(f, url);
}
}
return writer.numDocs();
}

private static class TextFilesFilter implements FileFilter {
public boolean accept(File path) {
return path.getName().toLowerCase().endsWith(".txt");
}
}

protected Document getDocument(File f, String url) throws Exception {
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("urls", url, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
return doc;
}

private void indexFile(File f, String url) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f, url);
writer.addDocument(doc);
}
}

谁能帮我写一个程序来做到这一点?谢谢。

最佳答案

首先,您不需要存储术语 vector 来仅了解文档中术语的频率。尽管如此,Lucene 仍会存储这些数字以用于 TF-IDF 计算。您可以通过调用 IndexReader.termDocs(term) 并迭代结果来访问此信息。

如果您有其他目的并且您确实需要访问术语 vector ,那么您需要通过将 Field.TermVector.YES 作为最后一个参数传递给 Lucene 来存储它们Field 构造函数。然后,您可以检索 vector ,例如与 IndexReader.getTermFreqVector() .

关于java - 如何在 Lucene 3.5.0 中提取文档术语 vector ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8776794/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com