gpt4 book ai didi

java - 在MapReduce作业的Reducer中通过Text输入值进行多次迭代

转载 作者:行者123 更新时间:2023-12-02 21:45:38 25 4
gpt4 key购买 nike

我在HDFS上有两个非常大的数据集(表)。我想在某些列上将它们加入,然后在某些列上将它们分组,然后在某些列上执行某些组功能
我的步骤是:

1- Create two jobs.

2- In the first job, in mapper read the rows of each dataset as mapinput value and emit join columns' values as map output key andremaining columns' values as map output value.

After mapping, the MapReduce framework performs shuffling and groupsall the map output values according to map output keys.

Then, in reducer it reads each map output key and its values which man includemany rows from both datasets.

What I want is to iterate through reduce input value many times so that I can perform cartesian product.

To illustrate:

Let's say for a join key x, I have 100 matches from one dataset and200 matches from the other. It means joining them on join key xproduces 100*200 = 20000 combination. I want to emit NullWritable asreduce output key and each cartesianproduct as reduce output value.

An example output might be:

for join key x:

From (nullWritable),(first(1),second(1))

Over (nullWritable),(first(1),second(200))

To (nullWritable),(first(100),second(200))

How can I do that?

I can iterate only once. And I could not cash the values because they dont fit into memory.

3- If I do that, I will start the second job, which takes the firstjob's result file as input file. In mapper, I emit group columns'values as map output key, and the remaining columns' values as mapoutput value. Then in reducer by iterating through each key's value, Iperform some functions on some columns like sum, avg, max, min.


非常感谢。

最佳答案

由于您的第一个MR作业使用join键作为映射输出键,因此您的第一个reduce程序将为每个reduce调用获取(K join_key,List 值)。您可以做的就是将值分成两个单独的列表,每个列表用于一个数据源,然后使用嵌套的for循环执行笛卡尔积。

关于java - 在MapReduce作业的Reducer中通过Text输入值进行多次迭代,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25587811/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com