作者热门文章
- android - RelativeLayout 背景可绘制重叠内容
- android - 如何链接 cpufeatures lib 以获取 native android 库?
- java - OnItemClickListener 不起作用,但 OnLongItemClickListener 在自定义 ListView 中起作用
- java - Android 文件转字符串
此示例使用 nutch 2.3.1 抓取数据,其中我需要获取标题和 url 内部链接和网站附带的外部链接,欢迎任何建议。
我用这个命令从hbase导入数据到pig
`data9 = load 'hbase://htest15_webpage' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('f:cnt', '-loadKey true');
`
column=f:cnt, timestamp=1487743991250, value=<!DOCTYPE htm
l>\x0D\x0A<!--[if IE 7]>\x0D\x0A<html class="ie ie7" lang=
"en-US" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns
/fb#">\x0D\x0A<![endif]-->\x0D\x0A<!--[if IE 8]>\x0D\x0A<h
tml class="ie ie8" lang="en-US" prefix="og: http://ogp.me/
ns# fb: http://ogp.me/ns/fb#">\x0D\x0A<![endif]-->\x0D\x0A
<!--[if !(IE 7) | !(IE 8) ]><!-->\x0D\x0A<html lang="en-U
S" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#"
>\x0D\x0A<!--<![endif]-->\x0D\x0A<head>\x0D\x0A <meta cha
rset="UTF-8" /> \x0D\x0A <meta name="viewport" content="w
idth=device-width" /> \x0D\x0A \x0D\x0A<title>www.hardhe
ro.com</title>\x0A\x0A<!-- SEO Ultimate (http://www.seodes
ignsolutions.com/wordpress-seo/) -->\x0A\x09<meta name="de
scription" content="hardhero.com blog that provides intere
sting articles on general topics and ideas." />\x0A\x09<me
ta name="keywords" content="Internet Blog,SEO Blog,Interne
t Marketing Guide" />\x0A\x09<meta property="og:type" cont
ent="blog" />\x0A\x09<meta property="og:title" content="ww
w.hardhero.com" />\x0A\x09<meta property="og:url" content=
"http://hardhero.com/" />\x0A\x09<meta property="og:site_n
ame" content="www.hardhero.com" />\x0A\x09<meta name="twit
ter:card" content="summary" />\x0A<!-- /SEO Ultimate -->\x
0A\x0A<link rel="alternate" type="application/rss+xml" tit
le="www.hardhero.com » Feed" href="http://hardhero.c
om/feed/" />\x0A<link rel="alternate" type="application/rs
s+xml" title="www.hardhero.com » Comments Feed" href
="http://hardhero.com/comments/feed/" />\x0A\x09\x09<scrip
t type="text/javascript">\x0A\x09\x09\x09window._wpemojiSe
ttings = {"baseUrl":"http:\x5C/\x5C/s.w.org\x5C/images\x5C
/core\x5C/emoji\x5C/72x72\x5C/","ext":".png","source":{"co
ncatemoji":"http:\x5C/\x5C/hardhero.com\x5C/wp-includes\x5
C/js\x5C/wp-emoji-release.min.js?ver=4.2.11"}};\x0A\x09\x0
9\x09!function(a,b,c){function d(a){var c=b.createElement(
"canvas"),d=c.getContext&&c.getContext("2d");return d&&d.f
illText?(d.textBaseline="top",d.font="600 32px Arial","fla
g"===a?(d.fillText(String.fromCharCode(55356,56812,55356,5
6807),0,0),c.toDataURL().length>3e3):(d.fillText(String.fr
omCharCode(55357,56835),0,0),0!==d.getImageData(16,16,1,1)
.data[0])):!1}function e(a){var c=b.createElement("script"
);c.src=a,c.type="text/javascript",b.getElementsByTagName(
"head")[0].appendChild(c)}var f,g;c.supports={simple:d("si
mple"),flag:d("flag")},c.DOMReady=!1,c.readyCallback=funct
ion(){c.DOMReady=!0},c.supports.simple&&c.supports.flag||(
g=function(){c.readyCallback()},b.addEventListener?(b.addE
ventListener("DOMContentLoaded",g,!1),a.addEventListener("
load",g,!1)):(a.attachEvent("onload",g),b.attachEvent("onr
eadystatechange",function(){"complete"===b.readyState&&c.r
eadyCallback()})),f=c.source||{},f.concatemoji?e(f.concate
moji):f.wpemoji&&f.twemoji&&(e(f.twemoji),e(f.wpemoji)))}(
window,document,window._wpemojiSettings);\x0A\x09\x09</scr
ipt>\x0A\x09\x09<style type="text/css">\x0Aimg.wp-smiley,\
x0Aimg.emoji {\x0A\x09display: inline !important;\x0A\x09b
order: none !important;\x0A\x09box-shadow: none !important
;\x0A\x09height: 1em !important;\x0A\x09width: 1em !import
ant;\x0A\x09margin: 0 .07em !important;\x0A\x09vertical-al
ign: -0.1em !important;\x0A\x09background: none !important
;\x0A\x09padding: 0 !important;\x0A}\x0A</style>\x0A<link
rel='stylesheet' id='es-widget-css-css' href='http://hard
hero.com/wp-content/plugins/email-subscribers/widget/es-wi
dget.css?ver=4.2.11' type='text/css' media='all' />\x0A<li
nk rel='stylesheet' id='shootingstar-style-css' href='htt
p://hardhero.com/wp-content/themes/shootingstar/style.css?
ver=4.2.11' type='text/css' media='all' />\x0A<link rel='s
tylesheet' id='shootingstar-elegantfont-css' href='http:/
/hardhero.com/wp-content/themes/shootingstar/css/elegantfo
>
最佳答案
很难重新生成您的用例来为您提供适当的脚本,因为您的内容不完整而且我无法使用它。我会给你一个一般性的建议,让你对来自 org.apache.pig.piggybank.evaluation.xml.XPath 包的文本使用 XPATH。如果您的值中有有效的 HTML 内容,您将能够执行一组 XPATH 查询,这将返回标题和 url。
查看此 link通过 XPATH 使用 Pig 的示例
关于hadoop - 如何使用 pig 脚本从网络爬网数据中提取特定数据(nutch),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42385643/
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。 要求我们推荐或查找工具、库或最喜欢的场外资源的问题对于 Stack Overflow 来说是偏离主题的,
我是一名优秀的程序员,十分优秀!