作者热门文章
- android - RelativeLayout 背景可绘制重叠内容
- android - 如何链接 cpufeatures lib 以获取 native android 库?
- java - OnItemClickListener 不起作用,但 OnLongItemClickListener 在自定义 ListView 中起作用
- java - Android 文件转字符串
我正在尝试查看我的 hdfs 中的文件并评估哪些文件早于特定日期。我想执行一个 hdfs ls
并将它的输出传递给一个 pig LOAD
命令。
在对 How Can I Load Every File In a Folder Using PIG? 的回答中@DonaldMiner 包含一个输出文件名的 shell 脚本;我借用它来传递文件名列表。但是,我不想加载文件的内容,我只想加载 ls
命令的输出并将文件名视为文本。
这是 myfirstscript.pig:
test = LOAD '$files' as (moddate:chararray, modtime:chararray, filename:chararray);
illustrate test;
我这样称呼:
pig -p files="`./filesysoutput.sh`" myfirstscript.pig
其中 filesysoutput.sh 包含:
hadoop fs -ls -R /hbase/imagestore | grep '\-rw' | awk 'BEGIN { FS = ",[ \t]*|[ \t]+" } {print $6, $7, $8}' | tr '\n' ','
这会生成如下输出:
2012-07-27 17:56 /hbase/imagestore/.tableinfo.0000000001,2012-04-23 19:27 /hbase/imagestore/08e36507d743367e1de57c359360b64c/.regioninfo,2012-05-10 12:13 /hbase/imagestore/08e36507d743367e1de57c359360b64c/0/7818124910159371133,2012-05-10 15:01 /hbase/imagestore/08e36507d743367e1de57c359360b64c/1/5537238047267916113,2012-05-09 19:40 /hbase/imagestore/08e36507d743367e1de57c359360b64c/2/6836317764645542272,2012-05-10 07:04 /hbase/imagestore/08e36507d743367e1de57c359360b64c/3/7276147895747401630,...
因为我只需要日期和时间以及文件名,所以我只在输入 pig 脚本的输出中包含这些字段。当我尝试运行它时,它肯定是在尝试将实际文件加载到 test
别名中:
$ pig -p files="`./filesysoutput.sh`" myfirstscript.pig
2013-05-29 17:40:10.773 java[50830:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:10.827 java[50830:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:20,570 [main] INFO org.apache.pig.Main - Logging error messages to: /Users/username/Environment/pig-0.9.2-cdh4.0.1/scripts/test/pig_1369863620569.log
2013-05-29 17:40:20,769 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://stage-hadoop101.cluster:8020
2013-05-29 17:40:20,771 [main] WARN org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
2013-05-29 17:40:20,773 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS
2013-05-29 17:40:20.836 java[50847:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:20.879 java[50847:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:21,138 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS
2013-05-29 17:40:21,452 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<file myfirstscript.pig, line 3, column 7> pig script failed to validate: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2012-07-27 17:56%20/hbase/imagestore/.tableinfo.0000000001
Details at logfile: /Users/username/Environment/pig-0.9.2-cdh4.0.1/scripts/test/pig_1369863620569.log
最佳答案
您可以尝试另一种方法 - 使用 dummy.txt 输入文件(单行),然后使用 STREAM alias THROUGH
命令处理 hadoop fs 的输出 - ls
你现在的样子:
grunt> dummy = load '/tmp/dummy.txt';
grunt> fs -cat /tmp/dummy.txt;
dummy
grunt> files = STREAM dummy THROUGH
`hadoop fs -ls -R /hbase/imagestore | grep '\-rw' | awk 'BEGIN { OFS="\t"; FS = ",[ \t]*|[ \t]+" } {print $6, $7, $8}'`
AS (moddate:chararray, modtime:chararray, filename:chararray);
请注意,以上内容未经测试 - 我模拟了与本地模式 pig 类似的东西并且它有效(请注意我向 awk OFS 添加了一些选项并且不得不稍微更改 grep):
grunt> files = STREAM dummy THROUGH \
`ls -l | grep "\\-rw" | awk 'BEGIN { OFS = "\t"; FS = ",[ \t]*|[ \t]+" } {print $6, $7, $9}'` \
AS (month:chararray, day:chararray, file:chararray);
grunt> dump files
(Dec,31,CTX.DAT)
(Dec,23,examples.desktop)
(Feb,8,installer.img.gz)
(Feb,8,install.py)
(Apr,25,mapred-site.xml)
(Apr,14,sqlite)
关于hadoop - pig : How to load the output of an hdfs ls into an alias?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16824499/
我是一名优秀的程序员,十分优秀!