gpt4 book ai didi

json - 在 Pig 中解析复杂的嵌套 JSON

转载 作者:可可西里 更新时间:2023-11-01 15:58:43 27 4
gpt4 key购买 nike

我想将亿万富翁 JSON 数据集解析为 Pig。可以找到 JSON 文件 here .

这是每个条目的内容:

{
"wealth": {
"worth in billions": 1.2,
"how": {
"category": "Resource Related",
"from emerging": true,
"industry": "Mining and metals",
"was political": false,
"inherited": true,
"was founder": true
},
"type": "privatized and resources"
},
"company": {
"sector": "aluminum",
"founded": 1993,
"type": "privatization",
"name": "Guangdong Dongyangguang Aluminum",
"relationship": "owner"
},
"rank": 1372,
"location": {
"gdp": 0.0,
"region": "East Asia",
"citizenship": "China",
"country code": "CHN"
},
"year": 2014,
"demographics": {
"gender": "male",
"age": 50
},
"name": "Zhang Zhongneng"
}

尝试 1

我尝试在 grunt 中使用以下命令加载此数据:

billionaires = LOAD 'billionaires.json' USING JsonLoader('wealth: (worth in billions:double, how: (category:chararray, from emerging:chararray, industry:chararray, was political:chararray, inherited:chararray, was founder:chararray), type:chararray), company: (sector:chararray,founded:int,type:chararray,name:chararray,relationship:chararray),rank:int,location:(gdp:double,region:chararray,citizenship:chararray,country code:chararray), year:int, demographics: (gender:chararray,age:int), name:chararray');

然而,这给了我错误:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'in' expecting RIGHT_PAREN

尝试 2

接下来,我尝试使用名为 com.twitter.elephantbird.pig.load.JsonLoader 的 Twitter 的 elephantbird 项目加载器。 Here是这个 UDF 的代码。这就是我所做的:

billionaires = LOAD 'billionaires.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
names = foreach billionaires generate json#'name' AS name;
dump names;

现在它运行了,我没有收到任何错误!但是什么也没有显示。我得到如下输出:

Input(s): Successfully read 0 records (1445335 bytes) from: "hdfs://localhost:9000/user/purak/billionaires.json"

Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/tmp/temp-1399280624/tmp-477607570"

Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0

Job DAG: job_1478889184960_0005

我在这里做错了什么?

最佳答案

这可能不是最好的方法,但这是我最终要做的:

  1. 从字段名称中删除空格:我在 json 数据集中用“worth_in_billions”、“from_emerging”等替换了“worth in billions”、“from emerging”等字段。 (我为此做了一个简单的“查找和替换”)

  2. 逗号分隔的 json 到换行符分隔的 json:我拥有的 json 文件的格式为 [{"_comment":"first entry"...},{ “_comment”:“第二个条目” ...}]。但是 Pig 中的 JsonLoader 将每个换行符作为一个新条目。为了使 json 文件以换行符分隔而不是逗号,我使用了 js这是一个命令行 JSON 处理器。使用 sudo apt-get install js 安装它并运行 cat billionaires.json | jq -c ".[]"> newBillionaires.json

  3. newBillionaires.json 文件现在每个条目都换行。现在使用以下命令将此文件加载到 Pig 中:

    copyFromLocal/home/purak/Desktop/newBillionaires.json/user/purak

billionaires = LOAD 'newBillionaires.json' USING JsonLoader('name:chararray, demographics: (age:int,gender:chararray),year:int,location:(country_code:chararray,citizenship:chararray,region:chararray,gdp:double),rank:int,company: (relationship:chararray,name:chararray,type:chararray,founded:int,sector:chararray), wealth:(type:chararray,how:(was_founder:chararray,inherited:chararray,was_political:chararray,industry:chararray, from_emerging:chararray,category:chararray),worth_in_biilions:double)');

注意:使用 js 颠倒了每个条目中字段的顺序。因此,在加载命令中,与问题中的加载命令相比,所有字段的顺序都是相反的。

  1. 您现在可以使用 unnest 每个元组:

billionairesFinal = foreach billionaires generate name, demographics.age as age, demographics.gender as gender, year, location.country_code as countryCode, location.citizenship as citizenship, location.region as region, location.gdp as gdp, rank, company.relationship as companyRelationship, company.name as companyName, company.type as companyType, company.founded as companyFounded, company.sector as companySector, wealth.type as wealthType, wealth.how.was_founder as wasFounder, wealth.how.inherited as inherited, wealth.how.was_political as wasPolitical, wealth.how.industry as industry, wealth.how.from_emerging as fromEmerging, wealth.how.category as category, wealth.worth_in_biilions as worthInBillions;

  1. 使用 describe billionairesFinal; 检查一次结构:

billionairesFinal: {name: chararray,age: int,gender: chararray,year: int,countryCode: chararray,citizenship: chararray,region: chararray,gdp: double,rank: int,companyRelationship: chararray,companyName: chararray,companyType: chararray,companyFounded: int,companySector: chararray,wealthType: chararray,wasFounder: chararray,inherited: chararray,wasPolitical: chararray,industry: chararray,fromEmerging: chararray,category: chararray,worthInBillions: double}

这就是我想要在 Pig 中使用的数据结构!现在我可以继续分析数据集了:)

关于json - 在 Pig 中解析复杂的嵌套 JSON,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40566949/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com