gpt4 book ai didi

apache-pig - Pig - 将 Databag 传递给 UDF 构造函数

转载 作者:行者123 更新时间:2023-12-01 12:45:17 25 4
gpt4 key购买 nike

我有一个脚本正在加载一些关于 field 的数据:

venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);

然后我想创建一个 UDF,它有一个接受 field 类型的构造函数。

所以我尝试这样定义这个 UDF:

DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);

这是实际的 UDF:

public class GenerateVenues extends EvalFunc<Tuple> {

TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();

private static final String ALLCHARS = "(.*)";
private ArrayList<String> venues;

private String regex;

public GenerateVenues(DataBag venuesBag) {
Iterator<Tuple> it = venuesBag.iterator();
venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
String current = "";
regex = "";
while (it.hasNext()){
Tuple t = it.next();
try {
current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
venues.add((String) t.get(0));
} catch (ExecException e) {
throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
}
regex += current + (it.hasNext() ? "|" : "");
}
}

@Override
public Tuple exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() != 2) {
throw new IllegalArgumentException(
"BagTupleExampleUDF: requires two input parameters.");
}
try {
String tweet = (String) tuple.get(0);
for (String venue: venues)
{
if (tweet.matches(ALLCHARS + venue + ALLCHARS))
{
Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
return output;
}
}
return null;
} catch (Exception e) {
throw new IOException(
"BagTupleExampleUDF: caught exception processing input.", e);
}
}
}

执行脚本时,在 (venues); 之前的 DEFINE 部分触发错误:

2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60>  mismatched input 'venues' expecting RIGHT_PAREN

显然我做错了什么,你能帮我找出问题所在吗?是UDF不能接受 field 关系作为参数吗?或者关系不是像这样的 DataBag 表示的 public GenerateVenues(DataBag venuesBag)?谢谢!

PS 我使用的是 Pig 版本 0.11.1.1.3.0.0-107

最佳答案

正如@WinnieNicklaus 所说,您只能将字符串传递给 UDF 构造函数。

话虽如此,您的问题的解决方案是使用分布式缓存,您需要覆盖 public List<String> getCacheFiles()返回将通过分布式缓存提供的文件名列表。这样,您就可以将该文件作为本地文件读取并构建您的表。

缺点是Pig没有初始化函数,所以你必须实现类似的东西

private void init() {
if (!this.initialized) {
// read table
}
}

然后将其称为 exec 中的第一件事.

关于apache-pig - Pig - 将 Databag 传递给 UDF 构造函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20682407/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com