hadoop - Pig : How to send all Tuples to a UDF to be Processed without Grouping them? 或者如何在不分组的情况下将元组转换为包？-6ren

hadoop - Pig : How to send all Tuples to a UDF to be Processed without Grouping them? 或者如何在不分组的情况下将元组转换为包？

转载作者：可可西里更新时间：2023-11-01 14:34:09

26

4

这就是我想要做的:

A = LOAD '...' USING PigStorage(',') AS (
    col1:int
    ,col2:chararray
);
B = ORDER A by col2;
C = CUSTOM_UDF(A);

CUSTOM_UDF 遍历需要按顺序排列的元组。 UDF 会为每几个输入元组输出一个聚合元组；即，我不会以 1:1 的方式返回元组。

本质上:

public class CustomUdf extends EvalFunc<Tuple> {
    public Tuple exec(Tuple input) throws IOException {
        Aggregate aggregatedOutput = null;

        DataBag values = (DataBag)input.get(0);
        for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
            Tuple tuple = iterator.next();
            ....
            if (some condition regarding current input tuple){
                //do something to aggregatedOutput with information from input tuple
            } else {
                //Because input tuple does not apply to current aggregateOutput
                //return current aggregateOutput and apply input tuple
                //to new aggregateOutput
                Tuple returnTuple = aggregatedOutput.getTuple();
                aggregatedOutputTuple = new Aggregate(tuple);
                return returnTuple;
            }
        }
    }
    // Establish the output Schema as a tuple
    public Schema outputSchema(Schema input) {
        Schema tupleSchema = new Schema();
        ...
        return new Schema(
            new FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), 
            tupleSchema, 
            DataType.TUPLE));
    }

    /** This inner class is simply a wrapper for the output tuple **/
    class Aggregate {
        //member variables

        public Aggregate(Tuple input) {
            //set member variables to value of input's fields
        }
        public Tuple getTuple() {
            Tuple output = TupleFactory.getInstance().newTuple(5);
            //set tuple's fields to values of member variables
            return output;
        }
    }
}

我已经能够做类似的事情了

A = LOAD '...' USING PigStorage(',') AS (
    col1:int
    ,col2:chararray
);
B = ORDER A by col2;
C = GROUP B BY col1;
D = FOREACH C {
    GENERATE CUSTOM_UDF(B);
}

然而，这似乎并没有保留 ORDER BY，而且我无法弄清楚如何订购 d，因为我不断收到无效的字段投影。

另外，我不需要分组依据(它恰好适用于此用例)并且只想将 B 别名作为元组包发送到 CUSTOM_UDF。

我怎样才能做到这一点？

最佳答案

我认为您对 CustomUdf 的编写方式有疑问。根据您的描述，这听起来应该是 EvalFunc < DataBag >，而不是 EvalFunc < Tuple >。然后在实现中，当您遍历输入包中的所有元组时，您将累积的元组附加到在方法结束时返回的 DataBag 中。

您的 Pig 代码将如下所示。我认为 ORDER BY 不会像您拥有的那样在单独的语句中保留顺序。但是，它会在嵌套的 FOREACH 中保留顺序，如下所示。

A = LOAD '...' USING PigStorage(',') AS (
    col1:int
    ,col2:chararray
);
B = FOREACH (GROUP A ALL) {
   A_ordered = ORDER A BY col2;
   GENERATE FLATTEN(CUSTOM_UDF(A_ordered));
}

exec 方法看起来像下面的修改版本。请注意我所做的更改。

public DataBag exec(Tuple input) throws IOException { // different return type
    Aggregate aggregatedOutput = null;

    DataBag result = BagFactory.newDefaultBag(); // change here
    DataBag values = (DataBag)input.get(0);
    for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
        Tuple tuple = iterator.next();
        ....
        if (some condition regarding current input tuple){
            //do something to aggregatedOutput with information from input tuple
        } else {
            //Because input tuple does not apply to current aggregateOutput
            //return current aggregateOutput and apply input tuple
            //to new aggregateOutput
            Tuple returnTuple = aggregatedOutput.getTuple();
            aggregatedOutputTuple = new Aggregate(tuple);
            result.add(returnTuple);  // change here
        }
    }
    return result; // change here
}

关于hadoop - Pig : How to send all Tuples to a UDF to be Processed without Grouping them? 或者如何在不分组的情况下将元组转换为包？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21445730/

26

4

0

文章推荐： apache - 在 hadoop 集群上使用 HBase 设置 Nutch 2.2.1

文章推荐： hadoop - 启用 hadoop 和 kerberos 的 datastax enterprise 出错

python : Find tuples from a list of tuples having duplicate data in the 0th element(of the tuple)
我有一个包含文件名和文件路径的元组列表。我想找到重复的 filename(但 filepath 可能不同)，即文件名相同但 filepath 可能不同的元组。元组列表示例: file_info
c++ - std::tuple 和 std::tuple 是否被 std::vector 视为同一类型？
我有一个像这样定义的变量 auto drum = std::make_tuple ( std::make_tuple ( 0.3f , Ex
swift 4 : pattern match an object against a tuple (Tuple pattern cannot match values of the non-tuple type)
我有一个包含几个字段的自定义结构，我想在快速 switch 语句中对其进行模式匹配，这样我就可以通过将其中一个字段与另一个字段进行比较来自定义匹配正则表达式。例如鉴于这种结构: struct MyS
c++ - 过滤嵌套动态元组(dynamic tuple of tuples)
我有一种动态元组结构: template //Should only be tuples class DynamicTuple { vector data; //All data is st
c# Tuple - 什么是 Tuple 的实际用途
这个问题在这里已经有了答案: What and When to use Tuple? [duplicate] (5 个答案) 关闭 8 年前。我正在查看 Tuple 的在线示例，但我没有看到任何理
tuples - common-lisp 中有 'tuple' 等价物吗？
在我的项目中我有很多坐标要处理，在二维情况下我发现(cons x y)的构造比(list x y)快和 (vector x y)。但是，我不知道如何将 cons 扩展到 3D 或更进一步，因为我没有
Scala Function.tupled 与 f.tupled
我有以下 Scala 代码: def f(x: Int, y: Int): Option[String] = x*y match { case 0 => None case n =>
scala - N-Tuple of Options to Option of N-Tuple
我的直觉告诉我，在一般情况下，只有宏或复杂类型的体操才能解决这个问题。 Shapeless 或 Scalaz 可以在这里帮助我吗？这是 N=2 问题的具体实例，但我正在寻找的解决方案适用于所有合理的
scala - 为什么 Scala 在解包 Tuple 时要构造一个新的 Tuple？
为什么这段 Scala 代码是这样的: class Test { def foo: (Int, String) = { (123, "123") } def bar: Unit
python - 类型错误 : can only concatenate tuple (not "vector") to tuple
我是 python 和 pygame 的新手，我正在尝试学习向量和类的基础知识，但在这个过程中我搞砸了，而且我在理解和修复标题中的错误消息方面苦苦挣扎。这是我的 Vector 类的代码: impor
python - "TypeError: can only concatenate tuple (not " float ") to tuple"
我正在编写一个程序来打开和读取一个 txt 文件，并在每一行中循环。将第 2 列和第 4 列中的值相乘并将其分配给第 5 列。 A 500.00 A 84.15 ? B 648.80 B 77.61
Python 类型错误 : can only concatenate tuple (not "str") to tuple
我知道还有其他几个问题提出了完全相同的问题，但是当我运行时: 导入命令从 pyDes 导入 * def encrypt(data, password,): k = des(password,
python 3 : Removing an empty tuple from a list of tuples
我有一个元组列表，内容如下: >>>myList [(), (), ('',), ('c', 'e'), ('ca', 'ea'), ('d',), ('do',), ('dog', 'ear', '
c++ - std::tuple 和 boost::tuple 之间的转换
给定一个 boost::tuple 和 std::tuple，你如何在它们之间进行转换？也就是说，您将如何实现以下两个功能？ template boost::tuple asBoostTuple(
c++ - 为什么不能用兼容类型的 std::tuple 按元素构造 std::tuple？
我无法初始化 std::tuple来自 std::tuple 的逐元素元素兼容类型。为什么它不像 boost::tuple 那样工作？ #include #include template st
java - 创建一个 backtype.storm.tuple.Tuple 用于测试目的？
我是 Storm 的新手并且我正在尝试找出如何编写一个 bolt 测试来测试子类 BaseRichBolt 中的 execute(Tuple tuple) 方法。问题是 Tuple 似乎是不可变的，
Python:从不考虑顺序的 "set of tuples"生成 "list of tuples"
如果我有如下元组列表: [('a', 'b'), ('c', 'd'), ('a', 'b'), ('b', 'a')] 我想删除重复的元组(在内容和内部项目顺序方面重复)以便输出为: [('a',
python - 类型错误 : can only concatenate tuple (not "list") to tuple"
我编写了一个简单的脚本来模拟基于每用户平均收入 (ARPU)、利润率和客户保持客户的年数 (ltvYears) 的客户生命周期值(value) (LTV)。下面是我的脚本。它在“ltvYears =
Python: Append tuple to a set with tuples(Python：将元组附加到具有元组的集合)
以下是我的代码，它是一组元组：。输出：设置([(‘A’，20160129，36.44)，(‘A’，20160104，41.06)，(‘A’，20160201，37.37)])。如何将另一个元组(‘A’
python - 类型错误 : Type Tuple cannot be instantiated; use tuple() instead
我用以下代码编写了一个程序: import pandas as pd import numpy as np from typing import Tuple def split_data(self,

首页

博学

6Ren·AI

商城

hadoop - Pig : How to send all Tuples to a UDF to be Processed without Grouping them? 或者如何在不分组的情况下将元组转换为包？