gpt4 book ai didi

python - 从数组中提取值的最快方法?

转载 作者:太空宇宙 更新时间:2023-11-03 19:55:56 24 4
gpt4 key购买 nike

我在elasticsearch中有一组11mm的文档,每个文档都有一个标识符数组。每个标识符都是一个包含类型、值和日期的字典。这是一个示例记录:

{
"name": "Bob",
"identifiers": [
{
"date": "2019-01-01",
"type": "a",
"value": "abcd"
},
{
"date": "2019-01-01",
"type": "b",
"value": "efgh"
}
]
}

我需要每晚将这些记录传输到 Parquet 数据存储中,其中仅将标识符的值保存在数组中。喜欢:

{
"name": "Bob",
"identifiers": ["abcd", "efgh"]
}

我通过循环所有记录并展平标识符来做到这一点。这是我的展平变压器:

    def _transform_identifier_values(self, identifiers: List[dict]):
ret = [
identifier['value']
for identifier in identifiers
]
return ret

这可行,但速度很慢。有没有更快的方法来做到这一点?可能是我可以利用的 native 实现?

编辑:

尝试了 Sunny 的建议。我惊讶地发现原版实际上表现最好。我的假设是 itemgetter 的性能会更高。

这是我的测试方法:

import time
from functools import partial
from operator import itemgetter


def main():

docs = []
for i in range(10_000_000):
docs.append({
'name': 'Bob',
'identifiers': [
{
'date': '2019-01-01',
'type': 'a',
'value': 'abcd'
},
{
'date': '2019-01-01',
'type': 'b',
'value': 'efgh'
}
]
})

start = time.time()
for doc in docs:
_transform_identifier_values_original(doc['identifiers'])
end = time.time()

print(f'Original took {end-start} seconds')

start = time.time()
for doc in docs:
_transform_identifier_values_getter(doc['identifiers'])
end = time.time()

print(f'Item getter took {end-start} seconds')

start = time.time()
for doc in docs:
_transform_identifier_values_partial_lambda(doc['identifiers'])
end = time.time()

print(f'Lambda partial took {end-start} seconds')

start = time.time()
for doc in docs:
_transform_identifier_values_partial(doc['identifiers'])
end = time.time()

print(f'Partial took {end-start} seconds')


def _transform_identifier_values_original(identifiers):
ret = [
identifier['value']
for identifier in identifiers
]
return ret


def _transform_identifier_values_getter(identifiers):
return list(map(itemgetter('value'), identifiers))


def _transform_identifier_values_partial_lambda(identifiers):
flatten_ids = partial(lambda o: list(map(itemgetter('value'), o)))
return flatten_ids(identifiers)


def _transform_identifier_values_partial(identifiers):
flatten = partial(map, itemgetter('value'))
return list(flatten(identifiers))

if __name__ == '__main__':
main()

结果:

Original took 4.6204328536987305 seconds

Item getter took 7.186180114746094 seconds

Lambda partial took 10.534514904022217 seconds

Partial took 9.07079291343689 seconds

最佳答案

这是我想出的解决方案:

def changeJSON(dictionary):
new_dict = {'name': dictionary['name'], 'identifiers': []}
for i in dictionary['identifiers']:
new_dict['identifiers'].append(i['value'])
return new_dict

此函数将接受单个字典并以您所需的新格式返回字典。然后您可以json.dumps()来自内置 json 库的函数。它接收字典列表并将它们转储到 json 文件中。

关于python - 从数组中提取值的最快方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59534709/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com