gpt4 book ai didi

Python:将两个 CSV 文件合并为多级 JSON

转载 作者:太空宇宙 更新时间:2023-11-04 08:47:23 24 4
gpt4 key购买 nike

我是 Python/JSON 的新手,所以请多多包涵。我可以在 R 中执行此操作,但我们需要使用 Python 才能将其转换为 Python/Spark/MongoDB。另外,我只是发布了一个最小的子集——我有更多的文件类型,所以如果有人能帮助我,我可以在此基础上集成更多的文件和文件类型:

回到我的问题:

我有两个 tsv 输入文件需要合并并转换为 JSON。这两个文件都有基因和样本列以及一些额外的列。但是,genesample 可能会重叠,也可能不会像我展示的那样重叠 - f2.tsv 具有 f1.tsv 中的所有基因,但还有一个额外的基因 g3 。同样,两个文件在 sample 列中都有重叠和不重叠的值。

# f1.tsv – has gene, sample and additional column other1

$ cat f1.tsv
gene sample other1
g1 s1 a1
g1 s2 b1
g1 s3a c1
g2 s4 d1

# f2.tsv – has gene, sample and additional columns other21, other22

$ cat f2.tsv
gene sample other21 other22
g1 s1 a21 a22
g1 s2 b21 b22
g1 s3b c21 c22
g2 s4 d21 d22
g3 s5 f21 f22

基因构成顶层,每个基因有多个样本构成第二层,额外的列构成第三层的extrasextras是分为两个,因为一个文件有 other1,第二个文件有 other21other22。我稍后将包含的其他文件将具有其他字段,如 other31other32 等等,但它们仍将具有基因和样本列。

# expected output – JSON by combining both tsv files. 
$ cat output.json
[{
"gene":"g1",
"samples":[
{
"sample":"s2",
"extras":[
{
"other1":"b1"
},
{
"other21":"b21",
"other22":"b22"
}
]
},
{
"sample":"s1",
"extras":[
{
"other1":"a1"
},
{
"other21":"a21",
"other22":"a22"
}
]
},
{
"sample":"s3b",
"extras":[
{
"other21":"c21",
"other22":"c22"
}
]
},
{
"sample":"s3a",
"extras":[
{
"other1":"c1"
}
]
}
]
},{
"gene":"g2",
"samples":[
{
"sample":"s4",
"extras":[
{
"other1":"d1"
},
{
"other21":"d21",
"other22":"d22"
}
]
}
]
},{
"gene":"g3",
"samples":[
{
"sample":"s5",
"extras":[
{
"other21":"f21",
"other22":"f22"
}
]
}
]
}]

如何将两个csv文件转换为基于两个公共(public)列的单层-多层JSON?

如果我能在这方面得到任何帮助,我将不胜感激。

谢谢!

最佳答案

这是另一种选择。当您开始添加更多文件时,我试图使其易于管理。您可以在命令行上运行并提供参数,每个参数对应您要添加的每个文件。基因/样本名称存储在字典中以提高效率。您所需的 JSON 对象的格式是在每个类的 format() 方法中完成的。希望这会有所帮助。

import csv, json, sys

class Sample(object):
def __init__(self, name, extras):
self.name = name
self.extras = [extras]

def format(self):
map = {}
map['sample'] = self.name
map['extras'] = self.extras
return map

def add_extras(self, extras):
#edit 8/20
#always just add the new extras to the list
for extra in extras:
self.extras.append(extra)

class Gene(object):
def __init__(self, name, samples):
self.name = name
self.samples = samples

def format(self):
map = {}
map ['gene'] = self.name
map['samples'] = sorted([self.samples[sample_key].format() for sample_key in self.samples], key=lambda sample: sample['sample'])
return map

def create_or_add_samples(self, new_samples):
# loop through new samples, seeing if they already exist in the gene object
for sample_name in new_samples:
sample = new_samples[sample_name]
if sample.name in self.samples:
self.samples[sample.name].add_extras(sample.extras)
else:
self.samples[sample.name] = sample

class Genes(object):
def __init__(self):
self.genes = {}

def format(self):
return sorted([self.genes[gene_name].format() for gene_name in self.genes], key=lambda gene: gene['gene'])

def create_or_add_gene(self, gene):
if not gene.name in self.genes:
self.genes[gene.name] = gene
else:
self.genes[gene.name].create_or_add_samples(gene.samples)

def row_to_gene(headers, row):
gene_name = ""
sample_name = ""
extras = {}
for value in enumerate(row):
if headers[value[0]] == "gene":
gene_name = value[1]
elif headers[value[0]] == "sample":
sample_name = value[1]
else:
extras[headers[value[0]]] = value[1]
sample_dict = {}
sample_dict[sample_name] = Sample(sample_name, extras)
return Gene(gene_name, sample_dict)

if __name__ == '__main__':
delim = "\t"
genes = Genes()
files = sys.argv[1:]

for file in files:
print("Reading " + str(file))
with open(file,'r') as f1:
reader = csv.reader(f1, delimiter=delim)
headers = []
for row in reader:
if len(headers) == 0:
headers = row
else:
genes.create_or_add_gene(row_to_gene(headers, row))

result = json.dumps(genes.format(), indent=4)
print(result)
with open('json_output.txt', 'w') as output:
output.write(result)

关于Python:将两个 CSV 文件合并为多级 JSON,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39043323/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com