docker - 是什么导致带有GCS接收器的水槽抛出OutOfMemoryException-6ren

docker - 是什么导致带有GCS接收器的水槽抛出OutOfMemoryException

转载作者：行者123 更新时间：2023-12-02 20:52:13

25

4

我正在使用水槽写入Google Cloud Storage。 Flume收听HTTP:9000。我花了一些时间使其工作(添加gcs库，使用凭据文件...)，但是现在它似乎可以通过网络进行通信。

我正在为测试发送非常小的HTTP请求，并且我有很多可用的RAM:

curl -X POST -d '[{ "headers" : { timestamp=1417444588182, env=dev, tenant=myTenant, type=myType }, "body" : "some body ONE"  }]' localhost:9000

我在第一个请求时遇到此内存异常(然后，它停止工作):

2014-11-28 16:59:47,748 (hdfs-hdfs_sink-call-runner-0) [INFO - com.google.cloud.hadoop.util.LogUtil.info(LogUtil.java:142)] GHFS version: 1.3.0-hadoop2
2014-11-28 16:59:50,014 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:467)] process failed
java.lang.OutOfMemoryError: Java heap space
        at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:76)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:79)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:820)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)

(有关详细信息，请参见 complete stack trace as a gist)

奇怪的是，文件夹和文件是按我想要的方式创建的，但是文件为空。

gs://my_bucket/dev/myTenant/myType/2014-12-01/14-36-28.1417445234193.json.tmp

我配置flume + GCS的方式有问题吗，还是 GCS.jar中的错误？

我应该在哪里检查以收集更多数据？

ps:我正在docker内部运行flume-ng。

我的 flume.conf文件:

# Name the components on this agent
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem

# Describe/configure the source
a1.sources.http.type =  org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000

# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = gs://my_bucket/%{env}/%{tenant}/%{type}/%Y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute

# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 10000
a1.channels.mem.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem

我的水槽/ gcs旅程中的相关问题: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?

最佳答案

上载文件时，GCS Hadoop FileSystem实现为每个FSDataOutputStream(打开供写的文件)留出相当大的写缓冲区(64MB)。可以通过在core-site.xml中将"fs.gs.io.buffersize.write"设置为较小的值(以字节为单位)来更改。我想1MB就可以满足小批量日志收集的需求。

另外，检查启动用于水槽的JVM时将最大堆大小设置为什么。 flume-ng脚本将JAVA_OPTS的默认值设置为-Xmx20m，以将堆限制为20MB。可以在flume-env.sh中将其设置为更大的值(有关详细信息，请参见flume tarball发行版中的conf / flume-env.sh.template)。

关于docker - 是什么导致带有GCS接收器的水槽抛出OutOfMemoryException，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27232966/

25

4

0

文章推荐： docker - Docker外部文件访问不在OSX上的/Users/中

文章推荐： python - Apache Oozie工作流程

文章推荐： hadoop - 最后的 reducer 在MapReduce中非常慢

文章推荐： docker - 无法更改docker容器的mac地址

hadoop - 为什么我需要节俭来 build 水槽？
我已经从“https://github.com/apache/flume/downloads”下载了水槽..但我无法构建它..我需要先安装节俭才能构建水槽吗？如果是这样，原因是什么..我得到了当我运行
scribe - 水槽 vs 卡夫卡 vs 其他
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用资料或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the
hadoop - 水槽+卡夫卡+HDFS : Split messages
我有以下 flume 代理配置来从 kafka 源读取消息并将它们写回 HDFS 接收器 tier1.sources = source1 tier 1.channels = channel1 tie
java - 水槽 : Avro event deserializer To Elastic Search
我想获取由 AVRO 反序列化器创建的记录并将其发送到 ElasticSearch。我意识到我必须编写自定义代码来执行此操作。使用 LITERAL 选项，我得到了 JSON 模式，这是使用 Gene
scala - yarn 上的 Spark ；如何将指标发送到 Graphite 水槽？
我是 spark 的新手，我们正在运行 spark on yarn。我可以很好地运行我的测试应用程序。我正在尝试收集 Graphite 中的 Spark 指标。我知道要对 metrics.proper
hadoop - 水槽 :Exec source cat command is not writing on HDFS
我正在尝试使用 Flume-ng 将数据写入 Hdfs 作为 exec 源。但它总是以退出代码 127 结束。它还显示类似警告无法从 VM 获取 maxDirectMemory:NoSuchMeth

首页

博学

6Ren·AI

商城

docker - 是什么导致带有GCS接收器的水槽抛出OutOfMemoryException