gpt4 book ai didi

java - Lucene IndexWriter OutOfMemory 异常

转载 作者:行者123 更新时间:2023-12-01 11:10:23 26 4
gpt4 key购买 nike

我在一个目录中有两个大文件(约 200 MB),想在它们上建立索引,所以这是我的代码:

public class LuceneUtil {
private void indexDoc(IndexWriter indexWriter, Path file, long lastModified) throws IOException{
try (InputStream stream = Files.newInputStream(file)) {
Document document = new Document();

Field pathField = new StringField("path", file.toString(), Field.Store.YES);
document.add(pathField);
document.add(new LongField("modified", lastModified, Field.Store.NO));
document.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));

if (indexWriter.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE_OR_APPEND) {
// new index
indexWriter.addDocument(document);
} else {
// update existing index
indexWriter.updateDocument(new Term("path", file.toString()), document);
}
}
}

private void indexDocs(final IndexWriter indexWriter, Path path) throws ExecutionException, InterruptedException, IOException {
if (Files.isDirectory(path)) {
ForkJoinPool FJ_POOL = new ForkJoinPool(3);
List<Path> files = FSUtils.findAllFiles(path.toString());

FJ_POOL.submit(() -> files.parallelStream().forEach(t -> {
try {

indexDoc(indexWriter, t, FSUtils.getFileBasicAttribute(t).lastModifiedTime().toMillis());
} catch (Exception e) {
logger.error(e.getMessage(), e);
}
})).get();
FJ_POOL.shutdown();
// Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
// @Override
// public FileVisitResult visitFile (Path file, BasicFileAttributes attrs) throws IOException {
// try {
//
// indexDoc(indexWriter, file, attrs.lastModifiedTime().toMillis());
// } catch (IOException ex) {
// ex.printStackTrace();
// }
// return FileVisitResult.CONTINUE;
// }
// });
} else {
indexDoc(indexWriter, path, Files.getLastModifiedTime(path).toMillis());
}
}

public void buildIndex(String pathToDocsDir, String pathToIndexDir) throws ExecutionException, InterruptedException, IOException{
Path docPath = Paths.get(pathToDocsDir);
Path indexPath = Paths.get(pathToIndexDir);
long start = System.currentTimeMillis();

try(Directory dir = FSDirectory.open(indexPath.toFile());
Analyzer analyzer = new StandardAnalyzer()) {

IndexWriterConfig iwc = new IndexWriterConfig(Version.LATEST, analyzer);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
try (IndexWriter indexWriter = new IndexWriter(dir, iwc)) {
indexDocs(indexWriter, docPath);
}
}
}
public static void main(String[] args) throws ExecutionException, InterruptedException, IOException{
LuceneUtils luceneUtils = new LuceneUtils();

String docPath = "/home/TestFolder";
String indexPath = "/home/IndexFolder";
try {
luceneUtils.buildIndex(docPath, indexPath);
} catch (IOException ex) {
ex.printStackTrace();
}
}

}

因此,从我的代码中您可以看到,我对两个文件使用一个 IndexWriter 对象,并尝试并行构建索引文件。几分钟后,当我的程序启动时,我收到下一个异常:

Exception in thread "main" java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError at java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1006) at com.service.utils.LuceneUtils.indexDocs(LuceneUtils.java:70) at com.service.utils.LuceneUtils.buildIndex(LuceneUtils.java:100) at com.service.utils.LuceneUtils.main(LuceneUtils.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) Caused by: java.lang.OutOfMemoryError at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598) at java.util.concurrent.ForkJoinTask.get(ForkJoinTask.java:1005)

是否可以在并行模式下使用一个IndexWriter?我该如何解决我的问题?

最佳答案

Lucene 有一个很好的并行索引过程的功能。如果您已在 RAMDirectory 或 FSDirectory 中对文件建立索引,则可以将它们合并到一个索引中。您必须使用addIndexes准备和使用forceMerge来完成合并。因此,您可以将文件分成单独的部分,并行索引它们,最后合并它们。

关于java - Lucene IndexWriter OutOfMemory 异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32446908/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com