amazon-s3 - 正确配置 Kafka Connect S3 Sink TimeBasedPartitioner-6ren

amazon-s3 - 正确配置 Kafka Connect S3 Sink TimeBasedPartitioner

转载作者：行者123 更新时间：2023-12-02 11:29:20

24

4

我正在尝试使用 Confluence S3 接收器的 TimeBasedPartitioner。这是我的配置:

{  
"name":"s3-sink",
"config":{  
    "connector.class":"io.confluent.connect.s3.S3SinkConnector",
    "tasks.max":"1",
    "file":"test.sink.txt",
    "topics":"xxxxx",
    "s3.region":"yyyyyy",
    "s3.bucket.name":"zzzzzzz",
    "s3.part.size":"5242880",
    "flush.size":"1000",
    "storage.class":"io.confluent.connect.s3.storage.S3Storage",
    "format.class":"io.confluent.connect.s3.format.avro.AvroFormat",
    "schema.generator.class":"io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
    "partitioner.class":"io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
    "timestamp.extractor":"Record",
    "timestamp.field":"local_timestamp",
    "path.format":"YYYY-MM-dd-HH",
    "partition.duration.ms":"3600000",
    "schema.compatibility":"NONE"
}

}

数据是二进制的，我使用 avro 方案。我想使用实际的记录字段“local_timestamp”，它是一个 UNIX 时间戳来对数据进行分区，比如每小时的文件。

我使用通常的 REST API 调用启动连接器

curl -X POST -H "Content-Type: application/json" --data @s3-config.json http://localhost:8083/connectors

不幸的是，数据没有按照我的意愿进行分区。我还尝试删除冲洗尺寸，因为这可能会干扰。但后来我得到了错误

{"error_code":400,"message":"Connector configuration is invalid and contains the following 1 error(s):\nMissing required configuration \"flush.size\" which has no default value.\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"}%

知道如何正确设置 TimeBasedPartioner 吗？我找不到有效的例子。

此外，如何调试此类问题或进一步了解连接器实际在做什么？

非常感谢任何帮助或进一步的建议。

最佳答案

研究了TimeBasedPartitioner.java处的代码后和日志

confluent log connect tail -f

我意识到时区和区域设置都是强制性的，尽管 Confluent S3 Connector 中没有这样指定。文档。以下配置字段解决了问题，让我将正确分区的记录上传到 S3 存储桶:

"flush.size": "10000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "US",
"timezone": "UTC",
"partition.duration.ms": "3600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "local_timestamp",

请注意两件事:首先，flush.size 的值也是必要的，文件最终会被分割成较小的 block ，但不能大于flush.size 指定的大小。其次，如上所示，更好地选择 path.format，以便生成正确的树结构。

我仍然不能 100% 确定记录字段 local_timestamp 是否真的用于对记录进行分区。

非常欢迎任何意见或改进。

关于amazon-s3 - 正确配置 Kafka Connect S3 Sink TimeBasedPartitioner，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48128620/

24

4

0

文章推荐： Mercurial 仅提交提示

文章推荐： aws-lambda - 在 API Gateway 端点后面本地测试 AWS Lambda

文章推荐： socket.io:好的部分和坏的部分

r - summary.connection(connection) : invalid connection 中的错误
使用 caret::train() 运行逻辑回归模型时出现问题。LR = caret::train(Satisfaction ~., data= log_train, method = "glm",
docker nginx代理nginx connect()失败(111 : Connection refused) while connecting to upstream
我正在尝试将nginx容器作为我所有网站和Web服务的主要入口点。我设法将portainer作为容器运行，并且可以从互联网上访问它。现在，我正在尝试访问由另一个Nginx容器托管的静态网站，但这样做失
c# - TcpClient.Connect 说 "A connection attempt failed because the connected party..."
我有一个在 Windows XP SP3 x86 上运行的 Visual Studio 2008 C# .NET 3.5 应用程序。在我的应用程序中，我有一个事件处理程序 OnSendTask 可以同
java.net.ConnectException : Connection refused: connect for HTTPS connections 异常
我在 Eclipse 中创建了作为独立程序执行的此类，它可以毫无问题地连接所有 http URL(例如:http://stackoverflow.com)，但是当我尝试连接到 https(例如 htt
python - Bottle + nginx : connect() failed (111: Connection refused) while connecting to upstream
我在我的 nginx 错误日志中收到大量以下错误: connect() failed (111: Connection refused) while connecting to upstream 我的
java - Log4j2 套接字附加器 "connect java.net.ConnectException: Connection refused: connect"
我正在尝试将新的 log4j2 与 Socket Appender 一起使用，但我有点不走运。这是我的 XML 配置文件:
java.net.ConnectException : Connection timed out: connect error connecting to remote database
我目前正在尝试寻找 Android 应用程序后端的替代方案。目前，我使用 php servlet 来查询 Mysql 数据库。数据库(Mysql)托管在我大学的计算机上，因此我无法更改任何配置，因为我
connection - sqlalchemy中MapperExtension的 `connection`参数是什么？
类MapperExtension有一些方法，before_insert, before_update, ...都有一个参数connection. def before_insert(self, map
connection - IBM Connection 更改文档所有者
嗨，我正在尝试更改位于连接库 (v 5.5) 中的文档的文档所有者，我仍在等待 IBM 的回复，但对我来说可能需要太长时间，这就是我尝试的原因逆向工程。我尝试使用标准编辑器 POST 请求将编辑器更
nginx - uWSGI nginx 错误 : connect() failed (111: Connection refused) while connecting to upstream
我在 nginx( http://52.xx.xx.xx/ )上访问我的 IP 时遇到 502 网关错误，日志只是这样说: 2015/09/18 13:03:37 [error] 32636#0: *
image-processing - 4-connected vs 8-connected in Connected Component Labeling。一个相对于另一个的优点是什么？
我要实现 Connected-Component Labeling但我不确定我应该以 4-connected 还是 8-connected 的方式来做。我已经阅读了大约 3 种 Material ，但
python - jython脚本: How to modify the maximum connection of connection pool under Websphere MQ Connection factories
我在Resources ->JMS ->Connection Factories下有两个连接工厂。 1) 连接工厂 2)集成连接工厂我想修改两个连接工厂下连接池的最大连接数。资源 ->JMS ->连
python - mongoengine.connection.ConnectionError : Cannot connect to database default : [Errno 111] Connection refused
我在将 mongoengine 合并到我的 django 应用程序时遇到问题。以下是我收到的错误: Traceback (most recent call last): File "/home/d
macos - 异常 : connect: does not exist (Connection refused) when trying to connect to TCP socket in Haskell
上下文我正在关注 tutorial on writing a TCP server last week in Real World Haskell .一切顺利，我的最终版本可以正常工作，并且能够在
django - Nginx+Gunicorn+Django1.5 - connect() 失败 (111 : Connection refused) while connecting to upstream
我在访问我的域时遇到了这个问题:我看到了我的默认 http500 错误 django 模板正在显示。我有 gunicorn 设置: command = '/usr/local/bin/gunicor
java - org.dom4j.DocumentException : Connection timed out: connect and Nested exception: Connection Nested exception
我更换了电脑，并重新安装了所有版本:tomcat 8 和 6、netbeans 8、jdk 1.7、hibernate 4.3.4，但是当我运行 Web 应用程序时，出现此错误。过去使用我的旧电脑时，
django - cookie-cutter django nginx connect() 失败 (111 : Connection refused) while connecting to upstream,
您好，我是这个项目的新手，我在 CentOS7 ec2 实例上托管它时遇到问题。当我访问我的域时出现此错误: 2017/02/17 05:53:35 [error] 27#27: *20 connec
node.js - nginx 和 Node : connect() failed (111: Connection refused) while connecting to upstream
在开始之前，我已经查看了所有我能找到的类似问题，但没有找到解决我的问题的方法。我正在运行 2 个 docker 容器，1 个用于 nginx，1 个用于 nodejs api。我正在使用 nginx
ubuntu - initctl : Unable to connect to Upstart: Failed to connect to socket/com/ubuntu/upstart: Connection refused
使用 debian 包将 kaa -iot 平台配置为单节点时。我收到以下错误。 himanshu@himpc:~/kaa/deb$ sudo dpkg -i kaa-node-0.10.0.deb
app-store-connect - 无法登录 iTunes Connect : "Your Apple ID isn' t enabled for iTunes Connect"
我是我公司开发团队的成员，担任管理员角色。我可以通过 https://developer.apple.com/ 访问团队的成员(member)中心但是，当我尝试在 https://itunescon

首页

博学

6Ren·AI

商城

amazon-s3 - 正确配置 Kafka Connect S3 Sink TimeBasedPartitioner