gpt4 book ai didi

hadoop - 当调用Apache Crunch管道在两个不同的源上读取两次时会发生什么?

转载 作者:行者123 更新时间:2023-12-02 20:37:58 25 4
gpt4 key购买 nike

进行以下 call 时:

    PCollection<KeyValue> data1 = pipeline.read(source1);
PCollection<KeyValue> data2 = pipeline.read(source2);
PCollection<KeyValue> data3 = data1.union(data2);

根据Apache Crunch阅读文档,是从两个来源读取数据然后将数据连接在一起的管道吗?

最佳答案

Apache Crunch Pipeline可以读取所需的任意多个源,然后可以开始转换所需的数据,例如PCollections联合,将这些源通过DoFn或MapFn传递,以便使用MapReduce进行Documents对象组合等。

需要记住的一件事是,与Apache Spark一样,Apache Crunch也使用了惰性执行模型,这意味着在执行操作之前不会触发任何数据转换过程。在下面,我引用了Crunch documentation的一小部分。

Crunch uses a lazy execution model. No jobs are run or outputs created until the user explicitly invokes one of the methods on the Pipeline interface that controls job planning and execution. The simplest of these methods is the PipelineResult run() method, which analyzes the current graph of PCollections and Target outputs and comes up with a plan to ensure that each of the outputs is created and then executes it, returning only when the jobs are completed. The PipelineResult returned by the run method contains information about what was run, including the number of jobs that were executed during the pipeline run and the values of the Hadoop Counters for each of those stages via the StageResult component classes.



回答您的问题,是的,同一管道将读取两个源。

旁注:您可能只希望有一个管道用于数据转换。

关于hadoop - 当调用Apache Crunch管道在两个不同的源上读取两次时会发生什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50502748/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com