gpt4 book ai didi

python - 将多个文件中的 json 对象逐行合并到一个文件中

转载 作者:太空宇宙 更新时间:2023-11-03 21:07:25 26 4
gpt4 key购买 nike

我有一个充满 json 文件的目录,如下所示:

json/
checkpoint_01.json
checkpoint_02.json
...
checkpoint_100.json

其中每个文件都有数千个 json 对象,逐行转储。

{"playlist_id": "37i9dQZF1DZ06evO2dqn7O", "user_id": "spotify", "sentence": ["Lil Wayne", "Wiz Khalifa", "Imagine Dragons", "Logic", "Ty Dolla $ign", "X Ambassadors", "Machine Gun Kelly", "X Ambassadors", "Bebe Rexha", "X Ambassadors", "Jamie N Commons", "X Ambassadors", "Eminem", "X Ambassadors", "Jamie N Commons", "Skylar Grey", "X Ambassadors", "Zedd", "Logic", "X Ambassadors", "Imagine Dragons", "X Ambassadors", "Jamie N Commons", "A$AP Ferg", "X Ambassadors", "Tom Morello", "X Ambassadors", "The Knocks", "X Ambassadors"]}
{"playlist_id": "37i9dQZF1DZ06evO1A0kr6", "user_id": "spotify", "sentence": ["RY X", "ODESZA", "RY X", "Thomas Jack", "RY X", "Rhye", "RY X"]}
(...)
<小时/>

我知道我可以将所有文件合并为一个,如下所示:

def combine():
read_files = glob.glob("*.json")
with open("merged_playilsts.json", "wb") as outfile:
outfile.write('[{}]'.format(
','.join([open(f, "rb").read() for f in read_files])))

但最后我需要使用以下脚本解析一个大的 json 文件:

parser.py

"""
Passes extraction output into `word2vec`
and prints results as JSON.
"""
from __future__ import absolute_import, unicode_literals
import json
import click
from numpy import array as np_array
import gensim

class LineGenerator(object):
"""Reads a sentence file, yields numpy array-wrapped sentences
"""

def __init__(self, fh):
self.fh = fh

def __iter__(self):
for line in self.fh.readlines():
yield np_array(json.loads(line)['sentence'])


def serialize_rankings(rankings):
"""Returns a JSON-encoded object representing word2vec's
similarity output.
"""

return json.dumps([
{'artist': artist, 'rel': rel}
for (artist, rel)
in rankings
])

@click.command()
@click.option('-i', 'input_file', type=click.File('r', encoding='utf-8'),
required=True)
@click.option('-t', 'term', required=True)
@click.option('--min-count', type=click.INT, default=5)
@click.option('-w', 'workers', type=click.INT, default=4)
def cli(input_file, term, min_count, workers):
# create word2vec
model = gensim.models.Word2Vec(min_count=min_count, workers=workers)
model.build_vocab(LineGenerator(input_file))

try:
similar = model.most_similar(term)
click.echo( serialize_rankings(similar) )
except KeyError:
# really wish this was a more descriptive error
exit('Could not parse input: {}'.format(exc))

if __name__ == '__main__':
cli()
<小时/>

问题:

那么,如何将 json/ 文件夹中的所有 json 对象合并到一个文件中,最终每行一个 json 对象?

注意:内存是这里的一个问题,因为所有文件总计 4 GB。

最佳答案

如果内存是一个问题,您很可能希望使用生成器按需加载每一行。以下解决方案假定 Python 3:

# get a list of file paths, you can do this via os.listdir or glob.glob... however you want.
my_filenames = [...]

def stream_lines(filenames):
for name in filenames:
with open(name) as f:
yield from f

lines = stream_lines(my_filenames)

def stream_json_objects_while_ignoring_errors(lines):
for line in lines:
try:
yield json.loads(line)
except Exception as e:
print(“ignoring invalid JSON”)

json_objects = stream_json_objects_while_ignoring_errors(lines)

for obj in json_objects:
# now you can loop over the json objects without reading all the files into memory at once
# example:
print(obj["sentence"])

请注意,为了简单起见,我省略了错误处理和处理空行或文件打开失败等详细信息。

关于python - 将多个文件中的 json 对象逐行合并到一个文件中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55310418/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com