gpt4 book ai didi

performance - 在多个过滤器上优化聚合查询

转载 作者:行者123 更新时间:2023-12-03 01:52:13 25 4
gpt4 key购买 nike

我有13,000个网页,其中收录了他们的正文。目标是获得一个单词,两个单词,三个单词...最多八个单词短语的前200个词组频率。

这些网页中有超过1.5亿个单词需要标记。

问题是查询大约需要15分钟,此后它用完堆空间,无法完成。

我正在4 CPU核心,8GB RAM,SSD Ubuntu服务器上对此进行测试。 6GB的RAM被分配为堆。交换已禁用。

现在,我可以通过将其分为8个不同的索引来做到这一点,查询/设置/映射组合适用于单字词组。也就是说,我可以在一个单词词组,两个单词词组等上单独运行此命令,从而获得期望的结果(尽管每个结果仍然需要5分钟左右)。我想知道是否有一种方法可以调整此完整聚合,使其与我的硬件一起使用一个索引和查询。

设置和映射:

{
"settings":{
"index":{
"number_of_shards" : 1,
"number_of_replicas" : 0,
"analysis":{
"analyzer":{
"analyzer_shingle_2":{
"tokenizer":"standard",
"filter":["standard", "lowercase", "filter_shingle_2"]
},
"analyzer_shingle_3":{
"tokenizer":"standard",
"filter":["standard", "lowercase", "filter_shingle_3"]
},
"analyzer_shingle_4":{
"tokenizer":"standard",
"filter":["standard", "lowercase", "filter_shingle_4"]
},
"analyzer_shingle_5":{
"tokenizer":"standard",
"filter":["standard", "lowercase", "filter_shingle_5"]
},
"analyzer_shingle_6":{
"tokenizer":"standard",
"filter":["standard", "lowercase", "filter_shingle_6"]
},
"analyzer_shingle_7":{
"tokenizer":"standard",
"filter":["standard", "lowercase", "filter_shingle_7"]
},
"analyzer_shingle_8":{
"tokenizer":"standard",
"filter":["standard", "lowercase", "filter_shingle_8"]
}
},
"filter":{
"filter_shingle_2":{
"type":"shingle",
"max_shingle_size":2,
"min_shingle_size":2,
"output_unigrams":"false"
},
"filter_shingle_3":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
},
"filter_shingle_4":{
"type":"shingle",
"max_shingle_size":4,
"min_shingle_size":4,
"output_unigrams":"false"
},
"filter_shingle_5":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":5,
"output_unigrams":"false"
},
"filter_shingle_6":{
"type":"shingle",
"max_shingle_size":6,
"min_shingle_size":6,
"output_unigrams":"false"
},
"filter_shingle_7":{
"type":"shingle",
"max_shingle_size":7,
"min_shingle_size":7,
"output_unigrams":"false"
},
"filter_shingle_8":{
"type":"shingle",
"max_shingle_size":8,
"min_shingle_size":8,
"output_unigrams":"false"
}
}
}
}
},
"mappings":{
"items":{
"properties":{
"body":{
"type": "multi_field",
"fields": {
"two-word-phrases": {
"analyzer":"analyzer_shingle_2",
"type":"string"
},
"three-word-phrases": {
"analyzer":"analyzer_shingle_3",
"type":"string"
},
"four-word-phrases": {
"analyzer":"analyzer_shingle_4",
"type":"string"
},
"five-word-phrases": {
"analyzer":"analyzer_shingle_5",
"type":"string"
},
"six-word-phrases": {
"analyzer":"analyzer_shingle_6",
"type":"string"
},
"seven-word-phrases": {
"analyzer":"analyzer_shingle_7",
"type":"string"
},
"eight-word-phrases": {
"analyzer":"analyzer_shingle_8",
"type":"string"
}
}
}
}
}
}
}

查询:
{
"size" : 0,
"aggs" : {
"one-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 200
}
},
"two-word-phrases" : {
"terms" : {
"field" : "body.two-word-phrases",
"size" : 200
}
},
"three-word-phrases" : {
"terms" : {
"field" : "body.three-word-phrases",
"size" : 200
}
},
"four-word-phrases" : {
"terms" : {
"field" : "body.four-word-phrases",
"size" : 200
}
},
"five-word-phrases" : {
"terms" : {
"field" : "body.five-word-phrases",
"size" : 200
}
},
"six-word-phrases" : {
"terms" : {
"field" : "body.six-word-phrases",
"size" : 200
}
},
"seven-word-phrases" : {
"terms" : {
"field" : "body.seven-word-phrases",
"size" : 200
}
},
"eight-word-phrases" : {
"terms" : {
"field" : "body.eight-word-phrases",
"size" : 200
}
}
}
}

最佳答案

您是否真的需要将整个收藏存储在内存中?您的分析可以重写为具有少量资源需求的批处理管道:

  • 解析每个爬网的站点,并将带状疱疹输出到一系列平面文件中:n-grams in python, four, five, six grams?
  • 对瓦状输出文件
  • 进行排序
  • 解析瓦片输出文件并输出瓦片计数文件
  • 解析所有带状疱疹计数文件,并输出主聚合带状疱疹计数文件
  • 按降序排序

  • (这类事情通常是在UNIX管道中完成并并行化的。)

    或者,您可以使用更多的内存来运行它。

    关于performance - 在多个过滤器上优化聚合查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39419784/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com