gpt4 book ai didi

java - Elasticsearch 无法使用 Java API 查询获取超过 10 个文档

转载 作者:行者123 更新时间:2023-12-02 11:01:23 30 4
gpt4 key购买 nike

正在从一个名为 documents 的索引中读取文件路径,并使用 java 代码读取文件并在另一个名为 documents_attachment 的索引中索引这些文件内容。

因此,在第一个过程中,我无法一次获取超过 10 条记录,它仅给出 10 条记录文档索引。我的 doucment 索引中有超过 100000 条记录。

如何一次获取所有 100000 记录。

我尝试过使用 searchSourceBuilder.size(10000); 然后它的索引直到 10K 记录不超过这个,并且这种方法不允许我给出超过 10000 作为大小。

请找到我正在使用的下面的java代码。

public class DocumentIndex {

private final static String INDEX = "documents";
private final static String ATTACHMENT = "document_attachment";
private final static String TYPE = "doc";
private static final Logger logger = Logger.getLogger(Thread.currentThread().getStackTrace()[0].getClassName());

public static void main(String args[]) throws IOException {


RestHighLevelClient restHighLevelClient = null;
Document doc=new Document();

logger.info("Started Indexing the Document.....");

try {
restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
System.out.println(e.getMessage());
}


//Fetching Id, FilePath & FileName from Document Index.
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder qb = QueryBuilders.matchAllQuery();
searchSourceBuilder.query(qb);
//searchSourceBuilder.size(10000);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = null;
try {
searchResponse = restHighLevelClient.search(searchRequest);
} catch (IOException e) {
e.getLocalizedMessage();
}

SearchHit[] searchHits = searchResponse.getHits().getHits();
long totalHits=searchResponse.getHits().totalHits;
logger.info("Total Hits --->"+totalHits);


File all_files_path = new File("d:\\All_Files_Path.txt");
File available_files = new File("d:\\Available_Files.txt");
File missing_files = new File("d:\\Missing_Files.txt");
all_files_path.deleteOnExit();
available_files.deleteOnExit();
missing_files.deleteOnExit();
all_files_path.createNewFile();
available_files.createNewFile();
missing_files.createNewFile();

int totalFilePath=1;
int totalAvailableFile=1;
int missingFilecount=1;

Map<String, Object> jsonMap ;
for (SearchHit hit : searchHits) {

String encodedfile = null;
File file=null;

Map<String, Object> sourceAsMap = hit.getSourceAsMap();


if(sourceAsMap != null) {
doc.setId((int) sourceAsMap.get("id"));
doc.setApp_language(String.valueOf(sourceAsMap.get("app_language")));
}

String filepath=doc.getPath().concat(doc.getFilename());



try(PrintWriter out = new PrintWriter(new FileOutputStream(all_files_path, true)) ){
out.println("FilePath Count ---"+totalFilePath+":::::::ID---> "+doc.getId()+"File Path --->"+filepath);
}

file = new File(filepath);
if(file.exists() && !file.isDirectory()) {
try {
try(PrintWriter out = new PrintWriter(new FileOutputStream(available_files, true)) ){
out.println("Available File Count --->"+totalAvailableFile+":::::::ID---> "+doc.getId()+"File Path --->"+filepath);
totalAvailableFile++;
}
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
fileInputStreamReader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
else
{
PrintWriter out = new PrintWriter(new FileOutputStream(missing_files, true));
out.close();
missingFilecount++;
}

jsonMap = new HashMap<>();
jsonMap.put("id", doc.getId());
jsonMap.put("app_language", doc.getApp_language());
jsonMap.put("fileContent", encodedfile);

String id=Long.toString(doc.getId());

IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
.source(jsonMap)
.setPipeline(ATTACHMENT);

PrintStream printStream = new PrintStream(new File("d:\\exception.txt"));
try {
IndexResponse response = restHighLevelClient.index(request);

} catch(ElasticsearchException e) {
if (e.status() == RestStatus.CONFLICT) {
}
e.printStackTrace(printStream);
}

totalFilePath++;


}

logger.info("Indexing done.....");
}

}

最佳答案

如果您有足够的内存,请将索引设置 index.max_result_window 从 10000 增加到您需要的数字。

参见https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#dynamic-index-settings

但请注意,这不会无限期地扩展。搜索请求占用的堆内存和时间与 from + size 成正比。此设置用于限制内存,如果设置得太高,您将耗尽内存。

最简单的设置方法是通过 REST API:

PUT /my-index/_settings
{
"index" : {
"max_result_window" : 150000
}
}

关于java - Elasticsearch 无法使用 Java API 查询获取超过 10 个文档,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51305094/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com