gpt4 book ai didi

json - 解析Amazon Electronics评论Apache Pig

转载 作者:行者123 更新时间:2023-12-02 21:09:35 25 4
gpt4 key购买 nike

我已经在我的cloudera VM中的Apache Pig中加载了Amazon Electronics Reviews数据集(http://jmcauley.ucsd.edu/data/amazon/)5核(1,689,188条评论)

我遵循了其他提出的问题:

Apache Pig error while dumping Json data



审查范例
{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
grunt> reviews = LOAD'amazon / amazon-pro / reviews.json'使用org.apache.pig.builtin.JsonLoader('id:chararray,asin:int,reviewerName:chararray,有用的:(int),reviewText:chararray,总体: float ,摘要:chararray,时间:int,reviewTime:chararray');

grunt> viewReview = LIMIT条评论1;

grunt> DUMP viewReview;

我收到以下错误

2016-11-17 08:05:33,797 [main]信息org.apache.pig.tools.pigstats.ScriptState-脚本中使用的Pig功能:LIMIT
2016-11-17 08:05:35,897 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler-文件串联阈值:100乐观吗?假
2016-11-17 08:05:36,531 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer-优化之前的MR计划大小:2
2016-11-17 08:05:36,532 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer-优化后的MR计划大小:2
2016-11-17 08:05:37,577 [main]信息org.apache.pig.tools.pigstats.ScriptState- pig 脚本设置已添加到作业
2016-11-17 08:05:38,183 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler-未设置mapred.job.reduce.markreset.buffer.percent,设置为默认值0.3
2016-11-17 08:05:38,225 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler-将并行度设置为1
2016-11-17 08:05:38,230 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler-创建jar文件Job974442700781595171.jar
2016-11-17 08:05:57,665 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler-jar文件Job974442700781595171.jar创建
2016-11-17 08:05:57,754 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler-设置单个商店作业
2016-11-17 08:05:58,090 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-1个 map 减少作业等待提交。
2016-11-17 08:05:58,347 [JobControl]警告org.apache.hadoop.mapred.JobClient-使用GenericOptionsParser解析参数。应用程序应实现相同的工具。
2016-11-17 08:05:58,614 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-0%完成
2016-11-17 08:06:00,041 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.df.interval已过时。而是使用fs.df.interval
2016-11-17 08:06:00,041 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.max.objects已弃用。而是使用dfs.namenode.max.objects
2016-11-17 08:06:00,041 [JobControl] WARN org.apache.hadoop.conf.Configuration-hadoop.native.lib已弃用。而是使用io.native.lib.available
2016-11-17 08:06:00,041 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.data.dir已弃用。而是使用dfs.datanode.data.dir
2016-11-17 08:06:00,041 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.name.dir已过时。而是使用dfs.namenode.name.dir
2016-11-17 08:06:00,041 [JobControl]警告org.apache.hadoop.conf.Configuration-不建议使用fs.default.name。而是使用fs.defaultFS
2016-11-17 08:06:00,041 [JobControl]警告org.apache.hadoop.conf.Configuration-不建议使用fs.checkpoint.dir。而是使用dfs.namenode.checkpoint.dir
2016-11-17 08:06:00,041 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.block.size已弃用。而是使用dfs.blocksize
2016-11-17 08:06:00,041 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.access.time.precision已弃用。而是使用dfs.namenode.accesstime.precision
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration-dfs.replication.min已弃用。而是使用dfs.namenode.replication.min
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration-dfs.name.edits.dir已弃用。而是使用dfs.namenode.edits.dir
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration-dfs.replication.considerLoad已弃用。而是使用dfs.namenode.replication.considerLoad
2016-11-17 08:06:00,042 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.balance.bandwidthPerSec已弃用。而是使用dfs.datanode.balance.bandwidthPerSec
2016-11-17 08:06:00,042 [JobControl]警告org.apache.hadoop.conf.Configuration-已弃用dfs.safemode.threshold.pct。而是使用dfs.namenode.safemode.threshold-pct
2016-11-17 08:06:00,042 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.http.address已过时。而是使用dfs.namenode.http-address
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration-dfs.name.dir.restore已弃用。而是使用dfs.namenode.name.dir.restore
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration-不建议使用dfs.https.client.keystore.resource。而是使用dfs.client.https.keystore.resource
2016-11-17 08:06:00,042 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.backup.address已过时。而是使用dfs.namenode.backup.address
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration-dfs.backup.http.address已过时。而是使用dfs.namenode.backup.http-address
2016-11-17 08:06:00,042 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.permissions已弃用。而是使用dfs.permissions.enabled
2016-11-17 08:06:00,042 [JobControl]警告org.apache.hadoop.conf.Configuration-不建议使用dfs.safemode.extension。而是使用dfs.namenode.safemode.extension
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration-dfs.datanode.max.xcievers已过时。而是使用dfs.datanode.max.transfer.threads
2016-11-17 08:06:00,042 [JobControl] WARN org.apache.hadoop.conf.Configuration-不建议使用dfs.https.need.client.auth。而是使用dfs.client.https.need-auth
2016-11-17 08:06:00,042 [JobControl]警告org.apache.hadoop.conf.Configuration-不建议使用dfs.https.address。而是使用dfs.namenode.https-address
2016-11-17 08:06:00,043 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.replication.interval已过时。而是使用dfs.namenode.replication.interval
2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration-不建议使用fs.checkpoint.edits.dir。而是使用dfs.namenode.checkpoint.edits.dir
2016-11-17 08:06:00,043 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.write.packet.size已弃用。而是使用dfs.client-write-packet-size
2016-11-17 08:06:00,043 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.permissions.supergroup已弃用。而是使用dfs.permissions.superusergroup
2016-11-17 08:06:00,043 [JobControl] WARN org.apache.hadoop.conf.Configuration-不推荐使用topology.script.number.args。而是使用net.topology.script.number.args
2016-11-17 08:06:00,043 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.umaskmode已弃用。而是使用fs.permissions.umask-mode
2016-11-17 08:06:00,043 [JobControl]警告org.apache.hadoop.conf.Configuration-dfs.secondary.http.address已过时。而是使用dfs.namenode.secondary.http-address
2016-11-17 08:06:00,045 [JobControl]警告org.apache.hadoop.conf.Configuration-不建议使用fs.checkpoint.period。而是使用dfs.namenode.checkpoint.period
2016-11-17 08:06:00,045 [JobControl] WARN org.apache.hadoop.conf.Configuration-不推荐使用topology.node.switch.mapping.impl。而是使用net.topology.node.switch.mapping.impl
2016-11-17 08:06:00,045 [JobControl] WARN org.apache.hadoop.conf.Configuration-io.bytes.per.checksum已弃用。而是使用dfs.bytes-per-checksum
2016-11-17 08:06:00,217 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat-处理的总输入路径:1
2016-11-17 08:06:00,270 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil-要处理的总输入路径(组合):11
2016-11-17 08:06:01,755 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-HadoopJobId:job_201611170800_0001
2016-11-17 08:06:01,755 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-处理别名r,reviews
2016-11-17 08:06:01,755 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-详细位置:M:评论[1,10],r [2,4] C: R:
2016-11-17 08:06:01,755 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-有关更多信息,请访问:http://localhost.localdomain:50030 / jobdetails.jsp?jobid = job_201611170800_0001
2016-11-17 08:09:30,985 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-完成50%
2016-11-17 08:09:31,500 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-作业job_201611170800_0001失败!停止运行所有依赖作业
2016-11-17 08:09:31,538 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-100%完成
2016-11-17 08:09:31,596 [main]错误org.apache.pig.tools.pigstats.SimplePigStats-错误2997:无法从支持的错误重新创建异常:org.codehaus.jackson.JsonParseException:当前 token (VALUE_STRING)非数字,不能使用数值访问器
在[来源:java.io.ByteArrayInputStream@67de0c09;行:1,列:43]
在org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1291)
在org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
在org.codehaus.jackson.impl.JsonNumericParserBase._parseNumericValue(JsonNumericParserBase.java:399)
在org.codehaus.jackson.impl.JsonNumericParserBase.getIntValue(JsonNumericParserBase.java:254)
在org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:189)
在org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157)
在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
在org.apache.hadoop.mapred.MapTask $ NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
在org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
在org.apache.hadoop.map
2016-11-17 08:09:31,597 [main]错误org.apache.pig.tools.pigstats.PigStatsUtil-1个 map 缩小作业失败!
2016-11-17 08:09:31,602 [main]信息org.apache.pig.tools.pigstats.SimplePigStats-脚本统计信息:

HadoopVersion PigVersion用户ID StartedAt FinishedAt功能
2.0.0-cdh4.7.0 0.11.0-cdh4.7.0 cloudera 2016-11-17 08:05:37 2016-11-17 08:09:31 LIMIT

失败了!

失败的工作:
JobId别名功能消息输出
job_201611170800_0001 r,评论消息:作业失败!

输入:
无法从“hdfs://localhost.localdomain:8020 / user / cloudera / amazon / amazon-pro / reviews.json”读取数据

输出:

柜台:
总记录记录:0
总写入字节数:0
Spillable Memory Manager溢出计数:0
主动溢出的行李总数:0
主动泄露的记录总数:0

职位DAG:
job_201611170800_0001-> null,
空值

2016-11-17 08:09:31,602 [main]信息org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher-失败!
2016-11-17 08:09:31,635 [main]错误org.apache.pig.tools.grunt.Grunt-错误1066:无法为别名r打开迭代器
日志文件中的详细信息:/home/cloudera/pig_1479349681179.log

最佳答案

我认为helpful的架构定义存在问题。与this other answer相关,它应该看起来像这样:

..., helpful:{t:(score:int)}, ...

关于json - 解析Amazon Electronics评论Apache Pig,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40709098/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com