gpt4 book ai didi

hdfs - Cloudera 5.4.2 : Avro block size is invalid or too large when using Flume and Twitter streaming

转载 作者:行者123 更新时间:2023-12-02 04:42:48 31 4
gpt4 key购买 nike

当我尝试 Cloudera 5.4.2 时出现了一个小问题。基于这篇文章

Apache Flume - 获取 Twitter 数据 http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

它尝试使用 Flume 和 Twitter 流获取推文以进行数据分析。一切顺利,创建 Twitter 应用程序,在 HDFS 上创建目录,配置 Flume 然后开始获取数据,在推文之上创建模式。

那么,问题来了。 Twitter 流将推文转换为 Avro 格式并将 Avro 事件发送到下游 HDFS 接收器,当 Avro 支持的 Hive 表加载数据时,我收到错误消息“Avro block 大小无效或太大”。

哦,什么是avro block 以及 block 大小的限制?我可以改变它吗?根据此消息,这是什么意思?是文件的错吗?是某些唱片的错吗?如果 Twitter 的流媒体遇到错误数据,它应该关闭核心。如果将推文转换为 Avro 格式一切正常,反过来,Avro 数据应该被正确读取,对吧?

我也尝试了 avro-tools-1.7.7.jar

java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232

{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}

{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`・ω・´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)

at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more

同样的问题。我谷歌了很多,根本没有答案。

如果你也遇到这个问题,谁能给我一个解决方案?或者如果你完全理解 Avro 的东西或下面的 Twitter 流,有人会帮助提供线索。

这真是个有趣的问题。想想看。

最佳答案

使用 Cloudera TwitterSource

否则会遇到这个问题。

Unable to correctly load twitter avro data into hive table

文章中:This is apache TwitterSource

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.

不过应该​​是cloudera TwitterSource:

https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

并且不要只下载pre build jar,因为我们的cloudera版本是5.4.2,否则你会得到这个错误:

Cannot run Flume because of JAR conflict

你应该使用maven来编译它

https://github.com/cloudera/cdh-twitter-example

下载并编译:flume-sources.1.0-SNAPSHOT.jar。这个 jar 包含 Cloudera TwitterSource 的实现。

步骤:

wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip

sudo yum 安装 apache-maven放入flume plugins目录:

/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar 

mvn包

注意:yum更新到最新版本,否则编译(mvn package)会因为一些安全问题而失败。

关于hdfs - Cloudera 5.4.2 : Avro block size is invalid or too large when using Flume and Twitter streaming,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36053306/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com