gpt4 book ai didi

hadoop - 使用PIG的电影数据集分析

转载 作者:行者123 更新时间:2023-12-02 20:57:50 25 4
gpt4 key购买 nike

我为电影数据库设置了以下数据:

评分:UserID,MovieID,评分::电影:MovieID,标题::用户:UserID,性别,年龄

现在,我必须加入以上3个数据集,并确定哪部电影在女性中收视率最高,在男性中收视率最低,反之亦然。
我已经完成了JOIN:

myusers = LOAD '/user/cloudera/movies/input/users.dat' 
USING PigStorage(':')
AS (user:int, n1, gender:chararray, n2, age:int);

ratings = LOAD '/user/cloudera/movies/input/ratings.dat'
USING PigStorage(':')
AS (user:int, n1, movie:int, n2, rating:int);

movies = LOAD '/user/cloudera/movies/input/movies.dat'
USING PigStorage(':')
AS (movie:int,n1,title:chararray);

data = JOIN ratings BY user, myusers BY user;
data2= JOIN data BY ratings::movie, movies BY movie;

但是在此之后,当我尝试从data2打印列时,遇到了许多问题,例如“错误0:标量在输出中有多行”。有什么想法可以帮助我完成这项任务吗?

最佳答案

经过以下步骤

data = JOIN ratings BY user, myusers BY user;

通过使用性别作为过滤器为男性创建两个数据集,为女性创建另一个数据集,对数据集进行排序,并获取两个数据集的最大值和最小值。
male = FILTER data by gender == 'M'; -- Use the gender value for male
female = FILTER data by gender == 'F';
m_max = LIMIT (ORDER male by rating DESC) 1;
f_max = LIMIT (ORDER female by rating DESC) 1;
m_min = LIMIT (ORDER male by rating ASC) 1;
f_min = LIMIT (ORDER female by rating ASC) 1;

关于hadoop - 使用PIG的电影数据集分析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44002544/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com