gpt4 book ai didi

python - 将两个 python 数据处理脚本组合成一个工作流

转载 作者:太空宇宙 更新时间:2023-11-04 05:19:12 25 4
gpt4 key购买 nike

我目前正在处理一项数据处理任务。

我有两个 python 脚本,每个脚本都实现了一个单独的功能,但它们对相同的数据进行操作,我认为它们可以组合成一个单一的工作流程,但我想不出最合乎逻辑的方法来实现这一点。

数据文件是here ,它是 JSON,但它有两个不同的组件。

第一部分是这样的:

{
"links": {
"self": "http://localhost:2510/api/v2/jobs?skills=data%20science"
},
"data": [
{
"id": 121,
"type": "job",
"attributes": {
"title": "Data Scientist",
"date": "2014-01-22T15:25:00.000Z",
"description": "Data scientists are in increasingly high demand amongst tech companies in London. Generally a combination of business acumen and technical skills are sought. Big data experience ..."
},
"relationships": {
"location": {
"links": {
"self": "http://localhost:2510/api/v2/jobs/121/location"
},
"data": {
"type": "location",
"id": 3
}
},
"country": {
"links": {
"self": "http://localhost:2510/api/v2/jobs/121/country"
},
"data": {
"type": "country",
"id": 1
}
},

它是由第一个 python 脚本处理的,在这里:

import json
from collections import defaultdict
from pprint import pprint

with open('data-science.txt') as data_file:
data = json.load(data_file)

locations = defaultdict(int)

for item in data['data']:
location = item['relationships']['location']['data']['id']
locations[location] += 1

pprint(locations)

呈现这种形式的数据:

         1: 6,
2: 20,
3: 2673,
4: 126,
5: 459,
6: 346,
8: 11,
9: 68,
10: 82,

这些是位置 “id” 和分配给该位置的记录数。

JSON 对象的另一部分如下所示:

"included": [
{
"id": 3,
"type": "location",
"attributes": {
"name": "Victoria",
"coord": [
51.503378,
-0.139134
]
}
},

并由此 python 文件处理:

import json
from collections import defaultdict
from pprint import pprint

with open('data-science.txt') as data_file:
data = json.load(data_file)

locations = defaultdict(int)

for record in data['included']:
id = record.get('id', None)
name = record.get('attributes', {}).get('name', None)
coord = record.get('attributes', {}).get('coord', None)
print(id, name, coord)

它以这种格式输出数据:

3 Victoria [51.503378, -0.139134]
1 United Kingdom None
71 data science None
32 None None
3 Victoria [51.503378, -0.139134]
1 United Kingdom None
1 data mining None
22 data analysis None
33 sdlc None
38 artificial intelligence None
39 machine learning None
40 software development None
71 data science None
93 devops None
63 None None
52 Cubitt Town [51.505199, -0.018848]

我真正想要的是最终输出看起来像这样:

3, Victoria, [51.503378, -0.139134], 2673

其中 2673 引用第一个脚本中的作业计数。

如果它没有任何坐标,例如[51.503378, -0.139134] 我可以把它扔掉。

我确信可以将这些脚本组合在一起并获得该输出,但我不是一个全面的思考者,我不知道该怎么做。

所有真实项目文件live here .

最佳答案

使用函数 是结合这两个脚本的一种方法,毕竟它们处理相同的数据。因此,您应该为每个处理逻辑 block 创建一个函数,然后最后合并结果:

import json
from collections import defaultdict
from pprint import pprint

def process_locations_data(data):
# processes the 'data' block
locations = defaultdict(int)
for item in data['data']:
location = item['relationships']['location']['data']['id']
locations[location] += 1
return locations

def process_locations_included(data):
# processes the 'included' block
return_list = []
for record in data['included']:
id = record.get('id', None)
name = record.get('attributes', {}).get('name', None)
coord = record.get('attributes', {}).get('coord', None)
return_list.append((id, name, coord))
return return_list # return list of tuples

# load the data from file once
with open('data-science.txt') as data_file:
data = json.load(data_file)

# use the two functions on same data
locations = process_locations_data(data)
records = process_locations_included(data)

# combine the data for printing
for record in records:
id, name, coord = record
references = locations[id] # lookup the references in the dict
print id, name, coord, references

该函数可以有更好的名称,但这应该可以实现您正在寻找的统一。

关于python - 将两个 python 数据处理脚本组合成一个工作流,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40896330/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com