gpt4 book ai didi

hadoop - 为什么我的 pig 查询返回错误的值

转载 作者:行者123 更新时间:2023-12-02 20:26:46 25 4
gpt4 key购买 nike

我正在尝试在 pig 中使用以下数据集
https://www.kaggle.com/zynicide/wine-reviews/version/4 ?
我从查询中得到了错误的值,我能想到的唯一原因是与数据集中缺少数据有关
但我不知道是不是这样或究竟为什么我得到错误的值

allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);

allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted 5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;

DESCRIBE allWinesPricePoints;

我得到的实际结果是
(56203,黄油 toast 和香料口味,包裹成奶油质地。应该可以保存一两年。”)
(61341,甜单宁。新鲜的酸度给它一个额外的提升。给它时间。最好的 2007-2012。“)
(16417,霞多丽也有名)
(115384,杏仁和 Vanilla )
(136804,杏仁和 Vanilla )

我认为输出应该是
(56203, 23)
(61341, 30)
(16417, 16)
(115384, 250)
(136804, 250)

我本来希望第二个值是数字并且在价格列中

最佳答案

进行如下操作:

allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);

--comments
--add below foreach to generate the values this will help you out to parse data correctly
--generate column in the same order as it is in the text file
allWines= FOREACH allWines GENERATE
id AS id,
country AS country,
description AS description,
designation AS designation,
points AS points,
price AS price,
province AS provience,
region_2 AS region_2,
region_1 AS region_1,
variety AS variety,
winery AS winery;

allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted 5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;
DESCRIBE allWinesPricePoints;

希望这会帮助你。
如有任何疑问,请告诉我。

关于hadoop - 为什么我的 pig 查询返回错误的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55909565/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com