gpt4 book ai didi

hadoop - 如何从复杂的 pig 数据类型中提取简单的 pig 数据类型

转载 作者:可可西里 更新时间:2023-11-01 14:58:54 26 4
gpt4 key购买 nike

我正在尝试使用内置的 BuildBloomBloom UDF 在 PIG 中编写布隆过滤器构建器。调用 BuildBloom UDF 的语法是:

define bb BuildBloom('hash_type', 'vector_size', 'false_positive_rate');

其中向量大小和误报率参数作为字符数组传入。因为我不一定事先知道矢量大小,但在调用 BuildBloom UDF 之前它总是在脚本中可用,所以我想使用内置的 COUNT UDF而不是一些硬编码的值。像这样的东西:

records = LOAD '$input' using PigStorage();
records = FOREACH records GENERATE
(long) $0 AS value_fld:long,
(chararray)$1 AS filter_fld:chararray;
records_fltr = FILTER records by (filter_fld=='$filter_value') AND (value_fld is not null);
records_grp = GROUP records_fltr all;
records_count = FOREACH records_grp GENERATE (chararray) COUNT(records_fltr.value_fld) AS count:chararray;
n = FOREACH records_count GENERATE flatten(count);
define bb BuildBloom('jenkins', n, '$false_positive_rate');

问题是,当我描述 n 时,我得到:n: {count: chararray}。不出所料,BuildBloom UDF 调用失败了,因为它得到了一个元组作为输入,而它期望的是一个简单的 chararray。我应该如何只提取 chararray(即从 COUNT 转换为 chararray 的整数返回)并将其分配给 n 以便在调用中使用到 BuildBloom(...)

编辑:这是我尝试将 N::count 传递到 BuildBloom(...) UDF 时产生的错误。 describe N 产生:N {count: chararray}。违规行(第 40 行)显示为:define bb BuildBloom('jenkins', N::count, '$fpr');

ERROR 1200: <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. <file buildBloomFilter.pig, line 40, column 32> mismatched input 'N::count' expecting set null
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:604)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: Failed to parse: <file buildBloomFilter.pig, line 40, column 32> mismatched input 'N::count' expecting set null
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:235)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
... 14 more

最佳答案

如果您使用的是 grunt shell,那么执行此操作的明显方法是调用 DUMP n;,等待作业完成运行,然后将值复制到您的 define绽放... 调用。

我猜这不是一个非常令人满意的答案。您很可能希望在脚本中运行它。这是一个非常 hacky 的方法。您需要 3 个文件:

  1. 'n_start.txt' 其中包含:

    n='
  2. 'n_end.txt' 包含单个字符:

    '
  3. 'bloom_build.pig' 包含:

    define bb BuildBloom('jenkins', '$n', '0.0001');

一旦你有了这些,你就可以运行这个脚本:

records = LOAD '$input' using PigStorage();
records = FOREACH records GENERATE
(long) $0 AS value_fld:long,
(chararray)$1 AS filter_fld:chararray;
records_fltr = FILTER records by (filter_fld=='$filter_value')
AND (value_fld is not null);
records_grp = GROUP records_fltr all;
records_count = FOREACH records_grp GENERATE
(chararray) COUNT(records_fltr.value_fld) AS count:chararray;
n = FOREACH records_count GENERATE flatten(count);

--the new part
STORE records_count INTO 'n' USING PigStorgae(',');
--this will copy what you just stored into a local directory
fs -copyToLocal n n
--this will cat the two static files we created prior to running pig
--with the count we just generated. it will pass it through tr which will
--strip out the newlines and then store it into a file called 'n.txt' which we
--will use as a parameter file
sh cat -s nstart.txt n/part-r-00000 nend.txt| tr -d '\n' > n.txt
--RUN makes pig call one script within another. Be forewarned that if pig returns
--a message with an error on a certain line, it is the line number of the expanded script
RUN -param_file n.txt bloom_bulid.pig;

在此之后,您可以像之前那样调用 bb。这很丑陋,可能更精通 unix 的人可以摆脱 n_start.txt 和 n_end.txt 文件。

另一个更简洁但更复杂的选择是编写一个新的 UDF(如 BuildBloom)扩展 BuildBloomBase.java 但具有空构造函数并且可以处理 exec() 方法中的所有内容。

关于hadoop - 如何从复杂的 pig 数据类型中提取简单的 pig 数据类型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23373000/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com