gpt4 book ai didi

python - U-SQL 与 Python 在 Azure Data Lake 存储中将 JSON 转换为 CSV

转载 作者:行者123 更新时间:2023-12-03 04:26:20 24 4
gpt4 key购买 nike

我们需要将存储在 Azure 数据湖存储中的一些大文件从嵌套 JSON 转换为 CSV。由于除了标准模块之外,Azure 数据湖分析还支持 python 模块 pandas、numpy,我相信使用 python 几乎可以实现这一目标。有谁有Python代码来实现这个吗?

源格式:

{"Loc":"TDM","Topic":"location","LocMac":"location/fe:7a:xx:xx:xx:xx","seq":"296083773","timestamp":1488986751,"op":"OP_UPDATE","topicSeq":"46478211","sourceId":"AFBWmHSe","location":{"staEthMac":{"addr":"/xxxxx"},"staLocationX":1643.8915,"staLocationY":571.04205,"errorLevel":1076,"associated":0,"campusId":"n5THo6IINuOSVZ/cTidNVA==","buildingId":"7hY/xx==","floorId":"xxxxxxxxxx+BYoo0A==","hashedStaEthMac":"xxxx/pMVyK4Gu9qG6w=","locAlgorithm":"ALGORITHM_ESTIMATION","unit":"FEET"},"EventProcessedUtcTime":"2017-03-08T15:35:02.3847947Z","PartitionId":3,"EventEnqueuedUtcTime":"2017-03-08T15:35:03.7510000Z","IoTHub":{"MessageId":null,"CorrelationId":null,"ConnectionDeviceId":"xxxxx","ConnectionDeviceGenerationId":"636243184116591838","EnqueuedTime":"0001-01-01T00:00:00.0000000","StreamId":null}}

预期输出

TDM,location,location/80:7a:bf:d4:d6:50,974851970,1490004475,OP_UPDATE,151002334,xxxxxxx,gHq/1NZQ,977.7259,638.8827,490,1,n5THo6IINuOSVZ/cTidNVA==,7hY/jVh9NRqqxF6gbqT7Jw==,LV/ZiQRQMS2wwKiKTvYNBQ==,H5rrAD/jg1Fnkmo1Zmquau/Qn1U=,ALGORITHM_ESTIMATION,FEET

最佳答案

根据您的描述,根据我的理解,我认为您的关键需求是如何使用 pandas/ 将 Azure Data Lake Store 中存储的数据从 JSON 格式转换为 Python 中的 CSV 格式numpy 包。所以我查看了你的源数据,并假设JSON中没有数组类型,然后我设计了下面的代码来进行示例数据转换。

这是我的 JSON 格式对象字符串的示例代码。作为引用,我添加了一些注释来理解我的想法,其中关键是用于转换结构的 flattern 方法 {"A": 0, "B": {"C": 1}} 到结构[["A", "B.C"], [0, 1]]

import json
import pandas as pd

# Source Data string
json_raw = '''{"Loc":"TDM","Topic":"location","LocMac":"location/fe:7a:xx:xx:xx:xx","seq":"296083773","timestamp":1488986751,"op":"OP_UPDATE","topicSeq":"46478211","sourceId":"AFBWmHSe","location":{"staEthMac":{"addr":"/xxxxx"},"staLocationX":1643.8915,"staLocationY":571.04205,"errorLevel":1076,"associated":0,"campusId":"n5THo6IINuOSVZ/cTidNVA==","buildingId":"7hY/xx==","floorId":"xxxxxxxxxx+BYoo0A==","hashedStaEthMac":"xxxx/pMVyK4Gu9qG6w=","locAlgorithm":"ALGORITHM_ESTIMATION","unit":"FEET"},"EventProcessedUtcTime":"2017-03-08T15:35:02.3847947Z","PartitionId":3,"EventEnqueuedUtcTime":"2017-03-08T15:35:03.7510000Z","IoTHub":{"MessageId":null,"CorrelationId":null,"ConnectionDeviceId":"xxxxx","ConnectionDeviceGenerationId":"636243184116591838","EnqueuedTime":"0001-01-01T00:00:00.0000000","StreamId":null}}'''

# Load source data string to a Python dict
json_data = json.loads(json_raw)

# The key method `flattern` for converting `dict` to `2D-list`
def flattern(data, key):
keys = []
values = []
if key is None:
for key in data:
if type(data[key]) is dict:
keys.extend(flattern(data[key], key)[0])
values.extend(flattern(data[key], key)[1])
else:
keys.append(key)
values.append(data[key])
else:
for subkey in data:
if type(data[subkey]) is dict:
keys.extend(flattern(data[subkey], key+"."+subkey)[0])
values.extend(flattern(data[subkey], subkey)[1])
else:
keys.append(key+"."+subkey)
values.append(data[subkey])
return [keys, values]

list2D = flattern(json_data, None)
df = pd.DataFrame([list2D[1],], columns=list2D[0])

# If you want to extract the items `Loc` & `Topic` & others like `location.staEthMac.addr`, you just need to create a list for them.
selected = ["Loc", "Topic"]
# Use `selected` list to select the columns you want.
result = df.ix[:,selected]
# Transform DataFrame to csv string
csv_raw = "\n".join([",".join(lst) for lst in pd.np.array(result)])

希望有帮助。

关于python - U-SQL 与 Python 在 Azure Data Lake 存储中将 JSON 转换为 CSV,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43016928/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com