hdfs - Cloudera 5.4.2 : Avro block size is invalid or too large when using Flume and Twitter streaming-6ren

hdfs - Cloudera 5.4.2 : Avro block size is invalid or too large when using Flume and Twitter streaming

转载作者：行者123 更新时间：2023-12-02 04:42:48

31

4

当我尝试 Cloudera 5.4.2 时出现了一个小问题。基于这篇文章

Apache Flume - 获取 Twitter 数据 http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

它尝试使用 Flume 和 Twitter 流获取推文以进行数据分析。一切顺利，创建 Twitter 应用程序，在 HDFS 上创建目录，配置 Flume 然后开始获取数据，在推文之上创建模式。

那么，问题来了。 Twitter 流将推文转换为 Avro 格式并将 Avro 事件发送到下游 HDFS 接收器，当 Avro 支持的 Hive 表加载数据时，我收到错误消息“Avro block 大小无效或太大”。

哦，什么是avro block 以及 block 大小的限制？我可以改变它吗？根据此消息，这是什么意思？是文件的错吗？是某些唱片的错吗？如果 Twitter 的流媒体遇到错误数据，它应该关闭核心。如果将推文转换为 Avro 格式一切正常，反过来，Avro 数据应该被正确读取，对吧？

我也尝试了 avro-tools-1.7.7.jar

java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232

{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}

{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`･ω･´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)

at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more

同样的问题。我谷歌了很多，根本没有答案。

如果你也遇到这个问题，谁能给我一个解决方案？或者如果你完全理解 Avro 的东西或下面的 Twitter 流，有人会帮助提供线索。

这真是个有趣的问题。想想看。

最佳答案

使用 Cloudera TwitterSource

否则会遇到这个问题。

Unable to correctly load twitter avro data into hive table

文章中:This is apache TwitterSource

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.

不过应该是cloudera TwitterSource:

https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

并且不要只下载pre build jar，因为我们的cloudera版本是5.4.2，否则你会得到这个错误:

Cannot run Flume because of JAR conflict

你应该使用maven来编译它

https://github.com/cloudera/cdh-twitter-example

下载并编译:flume-sources.1.0-SNAPSHOT.jar。这个 jar 包含 Cloudera TwitterSource 的实现。

步骤:

wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip

sudo yum 安装 apache-maven放入flume plugins目录:

/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar

mvn包

注意:yum更新到最新版本，否则编译(mvn package)会因为一些安全问题而失败。

关于hdfs - Cloudera 5.4.2 : Avro block size is invalid or too large when using Flume and Twitter streaming，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36053306/

31

4

0

文章推荐： php - MailChimp API v3.0 事件计划错误

文章推荐： logging - 如何禁用 webmocks 日志记录？

文章推荐： java - 将java转换为c#的问题

size - ValueError : Target size (torch. Size([16])) 必须与输入大小相同 (torch.Size([16, 1]))
ValueError Traceback (most recent call last) in 23 out
CSS Percent size specifier sizing element to more than specified size
在 CSS 中，我从来没有真正理解为什么会发生这种情况，但每当我为某物分配 margin-top:50% 时，该元素就会被推到页面底部，几乎完全消失这一页。我假设 50% 时，该元素将位于页面的中间位
neural-network - ValueError : Target size (torch. Size([1000])) must be the same as input size (torch.Size([1000, 1]))
我正在尝试在 pyTorch 中训练我的第一个神经网络(我不是程序员，只是一个困惑的化学家)。网络本身应该采用 1064 个元素向量并用 float 对它们进行评级。到目前为止，我遇到了各种各样的
c# - 数组移位/错误索引/i = [x+y*size+z*size*size]
我有一个简单的问题。如何在 3 个维度上移动线性阵列？这似乎太有效了，但在 X 和 Y 轴上我遇到了索引问题。我想这样做的原因很简单。我想创建一个带有 block 缓冲区的体积地形，所以我只需要在视口
python - 如何解决与输入大小 (torch.Size([1])) 不同的 UserWarning : Using a target size (torch. Size([]))？
我正在尝试运行我购买的一本关于 Pytorch 强化学习的书中的代码。代码应该按照本书工作，但对我来说，模型没有收敛，奖励仍然为负。它还会收到以下用户警告: /home/user/.local/li
python - PyTorch ValueError : Target size (torch. Size([64])) 必须与输入大小相同 (torch.Size([15]))
我目前正在使用 this repo使用我自己的数据集执行 NLP 并了解有关 CNN 的更多信息，但我一直遇到有关形状不匹配的错误: ValueError: Target size (torch.Si
objective-c - UIScrollView.size = view.size - allAdditionalBars.size(如 TabBar 或 NavigationBar)以编程方式
UIScrollView 以编程方式设置，请不要使用 .xib 文件发布答案。我的 UIScrollView 位于我的模型类中，所以我希望代码能够轻松导入到另一个项目中，例如。适用于 iPad 或旋
css - Bootstrap 4 : How Can I Set $font-size-base for Different Monitor Sizes using Responsive Font Sizing?
我在我的 Ruby on Rails 应用程序(版本 4.3.1)中使用 Bootstrap gem。我最近发现了响应式字体大小功能 (rfs)。根据 Bootstrap 文档，它刚刚在 4.3 版中
Android App开发错误: "Bad XML block: header size 60 or total size 3932356 is larger than data size 0"
这个问题不太可能帮助任何 future 的访客；它仅与一个小地理区域、一个特定时刻或一个非常狭窄的情况相关，而这些情况通常不适用于互联网的全局受众。如需帮助使这个问题更广泛地适用，visit the
scala - size 和 size 的区别是
size 之间的语义区别是什么？和 sizeIs ?例如， List(1,2,3).sizeIs > 1 // true List(1,2,3).size > 1 // true Luis 在 c
javascript - 从子元素中删除 Size 和 font-size
我想从 div 中删除一些元素属性。我的 div 是自动生成的。我想遍历每个 div 和子 div，并想删除所有 font-size (font-size: Xpx)和 size里面font tag
python - 使用 self.size = size 时语法无效
super ，对 Python 和一般编程 super 新手。我有一个问题应该很简单。我正在使用一本使用 Python 3.1 版的 python 初学者编程书。我目前正在写书中的一个程序，我正在学
size - native 库 : change thumbnail default size
我无法从 NativeBase 更改缩略图的默认大小。我可以显示默认圆圈，即小圆圈和大圆圈，但我想显示比默认大小更大的圆圈。这是我的缩略图代码: Prop 大小不起作用，缩略图仍然很小。我的 Na
pytorch - pytorch中张量torch.Size([])和torch.Size([1])的形状差异
我是pytorch的新手。在玩张量时，我观察到了两种类型的张量- tensor(58) tensor([57.3895]) 我打印了它们的形状，输出分别是 - torch.Size([]) torch
Docker 镜像 : virtual size vs real size
这是我的 docker images 命令的输出: $ docker images REPOSITORY TAG IMAGE ID CREATED
java - 为什么使用 "s = --size"而不是 "s = size"？
来自 PriorityQueue 的代码: private E removeAt(int i) { assert i >= 0 && i < size; modCount++;
c++ - sizeof() : the size of a class isn't the same as the size of it's members together?
首先，在我的系统上保留以下内容:sizeof(char) == 1 和 sizeof(char*) == 4。很简单，当我们计算下面类的总大小时: class SampleClass { char c
iphone - cocos2d content.size、boundingBox 和 size
我正在编写一个游戏来查找 2 个图像之间的差异。我创建了 CCSprite 的子类 Spot。首先我尝试创建小图像并根据其位置添加自身，但后来我发现位置很难确定，因为很难避免 1 或 2 个像素的偏移
javascript - Tumblr:photoUrl-(size) - size depending on class？
我有一个 Tumblr Site每个帖子的宽度由标签决定。如果一篇文章被标记为 #width200，CSS 类 .width200 被分配。问题是，虽然帖子的宽度不同，但它们都使用主题运算符加载相
c++ - 为什么动态分配的数组大小在插入时是初始数组的 2*size，而不是 size+1？
这个问题在这里已经有了答案: What is the ideal growth rate for a dynamically allocated array? (12 个答案) 关闭 8 年前。我

首页

博学

6Ren·AI

商城

hdfs - Cloudera 5.4.2 : Avro block size is invalid or too large when using Flume and Twitter streaming