gpt4 book ai didi

google-cloud-platform - 使用属性文件向 Google Dataproc 提交 Pig 作业时出错

转载 作者:行者123 更新时间:2023-12-05 05:30:05 26 4
gpt4 key购买 nike

我是 Dataproc 的新手,正在尝试通过 gcloud 向 google dataproc 提交一份 pig 作业

   gcloud config set project PROJECT

gcloud dataproc jobs submit pig --cluster=cluster-workaround --region=us-east4 --verbosity=debug --properties-file=gs://bucket/cvr_gcs_one.properties --file=gs://bucket-temp/intellibid-intermediat-cvr.pig

具有以下属性文件

jarLocation=gs://bucket-data-science/emr/jars/pig.jar
pigScriptLocation=gs://bucket-data-science/emr/pigs
logLocation=gs://bucket-data-science/prod/logs
udf_path=gs://bucket-data-science/emr/jars/udfs.jar
csv_dir=gs://bucket-db-dump/prod
currdate=2022-12-13
train_cvr=gs://bucket-temp/{2022-12-09}
output_dir=gs://analytics-bucket/outoout

下面是上传到GCS的pig脚本示例

 register $udf_path;

SET default_parallel 300;
SET pig.exec.mapPartAgg true; -- To remove load on combiner

SET pig.tmpfilecompression TRUE -- To make Compression true between
MapReduce Job Mainly when using Joins
SET pig.tmpfilecompression.codec gz -- To Specify the type of compression between MapReduce Job
SET mapreduce.map.output.compress TRUE --To make Compression true between Map and Reduce
SET mapreduce.map.output.compress.codec org.apache.hadoop.io.compress.GzipCodec
set mapred.map.tasks.speculative.execution false
SET mapreduce.task.timeout 10800000
set mapreduce.output.fileoutputformat.compress true
set mapreduce.output.fileoutputformat.compress.codec
org.apache.hadoop.io.compress.GzipCodec
SET mapreduce.map.maxattempts 16
SET mapreduce.reduce.maxattempts 16
SET mapreduce.job.queuename HIGH_PRIORITY

define GSUM com.java.udfs.common.javaSUM();
define get_cvr_key com.java.udfs.common.ALL_CTR_MODEL('$csv_dir', 'variableList.ini')
define multiple_file_generator com.java.udfs.common.CVR_KEY_GENERATION('$csv_dir','newcampaignToKeyMap')

train_tmp1 = load '$train_cvr/' using PigStorage('\t','-noschema') as (cookie,AdvID,nviews,ls_dst,ls_src,ls_di,ls_ft,ls_np,tos,nsess,e100_views,e200_views,e300_views,e400_views,e100_tos,e200_tos,e300_tos,e400_tos,uniq_prod,most_seen_prod_freq,uniq_cat,uniq_subcat,search_cnt,click_cnt,cart_cnt,HSDO,os,bwsr,dev,hc_c_v,hc_c_tp,hc_c_up,hc_c_ls,hc_s_v,hc_s_tp,hs_s_up,hc_s_ls,hc_clk_pub,hc_clk_cnt,hc_clk_lm,hp_ls_v,hp_ls_c,hp_ls_s,hp_ms_v,hp_ms_c,hp_ms_s,hu_v,hu_c,hu_s,purchase_flag,hp_ls_cvr,hp_ls_crr,hp_ms_cvr,hp_ms_crr,mpv,gc_c_tp,gc_clk_cnt,gc_c_up,gc_clk_lm,gc_c_v,gc_c_ls,gc_s_v,gc_s_lsts,gc_s_tp,gc_s_up,gc_clk_pub,epoch_ms,gc_ac_s,gc_ac_clk,gc_ac_vclk,udays,hc_vclk_cnt,gc_vclk_cnt,e205_view,e205_tos,AdvID_copy,hc_p_ms_p,hc_c_ms_p,most_seen_cat_freq,hc_p_ls_p,currstage,hc_c_city);

低于错误

INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
ERROR org.apache.pig.impl.PigContext - Undefined parameter : udf_path
2022-12-13 11:58:51,504 [main]
ERROR org.apache.pig.Main - ERROR 2997: Encountered IOException.
org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : udf_path

也尝试了大多数使用控制台的方法,但没有得到很好的文档。

查询参数字段(指定要插入的参数名称和值以代替查询文件中的参数条目。查询在运行时使用这些值。)和属性字段(键列表-值对来配置作业。) 在用户界面中

有人可以在这里指导我做错了什么以及我如何在 Dataproc 中运行 pig 脚本

最佳答案

像下面这样传递,

  gcloud config set project PROJECT

gcloud dataproc jobs submit pig --cluster=cluster-workaround --region=us-east4 --verbosity=debug --properties-file=gs://bucket/cvr_gcs_one.properties --file=gs://bucket-temp/your_pig.pig --params udf_path=gs://your_udfs.jar

关于google-cloud-platform - 使用属性文件向 Google Dataproc 提交 Pig 作业时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74784729/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com