gpt4 book ai didi

java - Apache Pig UDF 和 outputSchema 定制

转载 作者:太空宇宙 更新时间:2023-11-04 13:07:40 27 4
gpt4 key购买 nike

我正在尝试实现 UDF 函数来处理各种源/输入文件。输入文件因列数而异。我的目的是拥有通用的 UDF 功能。每次运行 pig 脚本都会处理一种类型的输入文件(由“|”分隔的相同数量的记录。

UDF 函数应读取由分隔符 (|) 分隔的所有输入记录,并根据某些条件生成一个包含两个元组的包,例如。输入 (1,2,3,4,5,6) 输出a) {(1,3), (2,4,5,6)}或者b) {(2,3,4), (1,5,6)}

我无法扩展outputSchema方法来处理不同大小的元组的创建。无法将额外的参数传递给 outputSchema 方法。不可能使用定义为 EvalFunc 类定义的一部分的临时变量,因为每次运行时其值都会为 null。

有什么提示吗?谢谢您

更新:

我使用 GRUNT 执行下面的命令,输入架构如您在“AS”后面看到的那样提供

sourceData = foreach sourceData generate com.pig.Data('test.json', *) as (t:(s:(VIN: chararray,Birthdate: chararray), n:(name: chararray,customerId: chararray,Mileage: chararray,Fuel_Consumption: chararray)));

UDF代码在这里...

public Schema outputSchema(Schema input) {

(第233行)System.out.println("------------------------"+ input.getFields().size());

错误:

Pig Stack Trace
---------------
ERROR 1200: java.lang.NullPointerException

Failed to parse: java.lang.NullPointerException
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:201)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1707)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1680)
at org.apache.pig.PigServer.registerQuery(PigServer.java:623)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1082)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:565)
at org.apache.pig.Main.main(Main.java:177)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at com.mortardata.pig.DataSpliter.outputSchema(DataSpliter.java:306)
at org.apache.pig.newplan.logical.expression.UserFuncExpression.getFieldSchema(UserFuncExpression.java:244)
at org.apache.pig.newplan.logical.optimizer.FieldSchemaResetter.execute(SchemaResetter.java:264)
at org.apache.pig.newplan.logical.expression.AllSameExpressionVisitor.visit(AllSameExpressionVisitor.java:143)
at org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:113)
at org.apache.pig.newplan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:70)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visitAll(SchemaResetter.java:67)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:122)
at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:245)
at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:114)
at org.apache.pig.parser.LogicalPlanBuilder.buildForeachOp(LogicalPlanBuilder.java:1055)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15896)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)

at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
... 16 more
Caused by: java.lang.NullPointerException
at com.mortardata.pig.DataSpliter.outputSchema(DataSpliter.java:233)
... 34 more
================================================================================

更新2:

好的,输入模式是从之前的 pig 命令传播的...

sourceData = 使用 PigStorage(',') 加载 'test.csv' as (VIN: chararray,出生日期: chararray,姓名: chararray,customerId: chararray,里程: chararray,Fuel_Conspiration: chararray);

sourceData = foreach sourceData 生成 com.pig.Data'test_data_desc.json', *) as (t:(s:(VIN: chararray,出生日期: chararray), n:(name: chararray,customerId: chararray,里程: chararray,Fuel_Conspiration: chararray)));

这没有用 -( 因为它不可能传播任何附加属性或者不可能在内部创建任何其他更复杂的逻辑输出模式方法;-(

最佳答案

在outputSchema函数中,您可以访问输入模式,并使用输入模式信息根据输入动态生成输出模式(如果输入以某种方式反射(reflect)了预期的输出)。示例:

  public Schema outputSchema(Schema input) {
Schema mySchema = new Schema();
if (input.getFields().size() == 3) {
mySchema.add(new Schema.FieldSchema("data1", DataType.DOUBLE));
mySchema.add(new Schema.FieldSchema("data2", DataType.DOUBLE));
mySchema.add(new Schema.FieldSchema("data3", DataType.DOUBLE));
} else {
mySchema.add(new Schema.FieldSchema("data", DataType.CHARARRAY));
}
return mySchema;
}

我希望这会有所帮助。

关于java - Apache Pig UDF 和 outputSchema 定制,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34276494/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com