hadoop - PIG 脚本根据指定的单词将大文本文件拆分成多个部分-6ren

hadoop - PIG 脚本根据指定的单词将大文本文件拆分成多个部分

转载作者：可可西里更新时间：2023-11-01 16:50:55

26

4

我正在尝试构建一个 pig 脚本，它接收教科书文件并将其分成章节，然后比较每一章中的单词，并仅返回出现在所有章节中的单词并计算它们。这些章节很容易被 CHAPTER - X 分隔。

这是我目前所拥有的:

lines = LOAD '../../Alice.txt' AS (line:chararray);
lineswithoutspecchars = FOREACH lines GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line; 
words = FOREACH lineswithoutspecchars GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

抱歉，与我通常在 stackoverflow 上提出的问题相比，这个问题可能太简单了，我用谷歌搜索了它，但也许我没有使用正确的关键字。我是 PIG 的新手，正在尝试学习它以适应新的工作任务。

提前致谢!

最佳答案

有点冗长，但你会得到结果。不过，您可以根据您的文件减少不必要的关系。在脚本中提供了适当的注释。

输入文件:

Pig does not know whether integer values in baseball are stored as ASCII strings, Java
serialized values, binary-coded decimal, or some other format. So it asks the load func-
tion, because it is that function’s responsibility to cast bytearrays to other types. In
general this works nicely, but it does lead to a few corner cases where Pig does not know
how to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will not know
how to perform casts on it because that bytearray is not generated by a load function.
CHAPTER - X
In a strongly typed computer language (e.g., Java), the user must declare up front the
type for all variables. In weakly typed languages (e.g., Perl), variables can take on values
of different type and adapt as the occasion demands.
CHAPTER - X
In this example, remember we are pretending that the values for base_on_balls and
ibbs turn out to be represented as integers internally (that is, the load function con-
structed them as integers). If Pig were weakly typed, the output of unintended would
be records with one field typed as an integer. As it is, Pig will output records with one
field typed as a double. Pig will make a guess and then do its best to massage the data
into the types it guessed.

pig 脚本:

A = LOAD 'file' as (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line; 
//we need to split on CHAPTER X but the above load function would give us a tuple for each newline. so
group everything and convert that bag to string which will give a single tuple with _ as delimiter.
C = GROUP B ALL; 
D = FOREACH C GENERATE BagToString(B) as (line:chararray); 
//now we dont have any commas so convert our delimiter CHAPTER X to comma. We do this becuase if we pass this
to TOKENIZE it would split that into separarte column that would be useful to RANK it.
E = FOREACH D GENERATE REPLACE(line,'_CHAPTER  X_',',') AS (line:chararray);
F = FOREACH E GENERATE REPLACE(line,'_',' ') AS (line:chararray); //remove the delimiter created by BagToString
//create separate columns
G = FOREACH F GENERATE FLATTEN(TOKENIZE(line,',')) AS (line:chararray);
//we need to rank each chapter so that would be easy when you are doing the count of each word.
H = RANK G;
J = FOREACH H GENERATE rank_G,FLATTEN(TOKENIZE(line)) as (line:chararray);
J1 = GROUP J BY (rank_G, line);
J2 = FOREACH J1 GENERATE COUNT(J) AS (cnt:long),FLATTEN(group.line) as (word:chararray),FLATTEN(group.rank_G) as (rnk:long); 
//So J2 result will not have duplicate word within each chapter now.
//So if we group it by word and then filter teh count of that by 2 we are sure that the word is present in all chapters.
J3 = GROUP J2 BY word;
J4 = FOREACH J3 GENERATE SUM(J2.cnt) AS (sumval:long),COUNT(J2) as (cnt:long),FLATTEN(group) as (word:chararray);
J5 = FILTER J4 BY cnt > 2;
J6 = FOREACH J5 GENERATE word,sumval;
dump J6;
//result in order word,count across chapters

输出:

(a,8)
(In,5)
(as,6)
(the,9)
(values,4)

关于hadoop - PIG 脚本根据指定的单词将大文本文件拆分成多个部分，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33355904/

26

4

0

文章推荐： hadoop - 洗牌和排序后的 n-Records 到 reducer

文章推荐： c - 如何使用 winpcap 修改 HTTP 响应数据包？

文章推荐： java - 配置 Tomcat 以发送兼容 IE 7 或 6 的网页

文章推荐： hadoop - NullWritable 的目的

typescript - A 部分部分 io-ts
我在使用 io-ts 时遇到一些问题。我发现它确实缺乏文档，我取得的大部分进展都是通过 GitHub issues 取得的。不，我不明白 HKT，所以没有帮助。基本上，我在其他地方创建一个类型，ty
java - 匹配完整文件正则表达式中的 A 部分，但不匹配 B 部分
我必须创建一个正则表达式来搜索整个文件，以找到与 Java XML 解析器的第一部分(但不是第二部分)的匹配项。这将用于防止某些 XXE 攻击。不幸的是，它确实必须是单个正则表达式，并且它确实需要搜索
c# - 部分/部分中的 asp.net mvs 部分？
我有一些简单的 Shared/_Header.cshtml 文件中的内容。 My Shared/_Layout.cshtml 通过调用插入该代码 @Html.Partial("_Header") 目前
java - Selenium 只执行循环的 if != null 部分，不运行循环的 "else if null "部分
我有一个 if-else 语句，其中: 条件 1:ID 匹配并且自动填充某些字段。然后 if 语句只填充其余字段条件 2:ID 不匹配，所有字段均为空白。 ELSE 语句将它们全部填充当我使条件
javascript - 无法在 JSFIDDLE 中使用滚动魔法(第 1 部分，共 2 部分)
我正在开发一个单页滚动网站。我正在尝试实现 ScrollMagic 并固定第一部分，以便网站的其余部分滚动到固定部分的顶部。我尝试创建一个 jsfiddle 来显示问题，但我似乎无法让 jsfiddl
javascript - 既然有

首页

博学

6Ren·AI

商城

hadoop - PIG 脚本根据指定的单词将大文本文件拆分成多个部分