gpt4 book ai didi

python - Elasticsearch滚动上限-python api

转载 作者:太空宇宙 更新时间:2023-11-03 14:16:48 25 4
gpt4 key购买 nike

如果我们以特定大小的 block 滚动,是否可以使用 python api 设置检索到的文档数量的上限。假设我希望最多 100K 个文档以 2K block 的形式滚动,其中有超过 1000 万个文档可用。

我已经实现了一个类似计数器的对象,但我想知道是否有更自然的解决方案。

es_query = {"query": {"function_score": {"functions": [{"random_score": {"seed": "1234"}}]}}}
es = Elasticsearch(ADDRESS, port=PORT)


result = es.search(
index="INDEX",
doc_type="DOC_TYPE",
body=es_query,
size=2000,
scroll="1m")

data = []
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)


scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]

i = 0
while(scroll_size>0):
if i % 10000 == 0:
print("Scrolling ({})...".format(i))

result = es.scroll(scroll_id=scroll_id, scroll="1m")
scroll_id = result["_scroll_id"]
scroll_size = len(result['hits']['hits'])

data = []
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)

i += 1
if i == 100000:
break

最佳答案

对我来说,如果您只想要前 100K,您应该首先缩小查询范围。这将加快你的进程。例如,您可以添加日期过滤器。

关于代码,除了使用计数器之外,我不知道其他方法。我只是更正缩进并删除 if 语句以提高可读性。

es_query = {"query": {"function_score": {"functions": [{"random_score": {"seed": "1234"}}]}}}
es = Elasticsearch(ADDRESS, port=PORT)


result = es.search(
index="INDEX",
doc_type="DOC_TYPE",
body=es_query,
size=2000,
scroll="1m")

data = []
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)

scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]

i = 0
while(scroll_size > 0 & i < 100000):

print("Scrolling ({})...".format(i))

result = es.scroll(scroll_id=scroll_id, scroll="1m")
scroll_id = result["_scroll_id"]
scroll_size = len(result['hits']['hits'])

# data = [] why redefining the list ?
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)
i ++

关于python - Elasticsearch滚动上限-python api,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48196940/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com