google-cloud-platform - 如何计算Google Dataflow文件处理的输入文件中的行数？-6ren

google-cloud-platform - 如何计算Google Dataflow文件处理的输入文件中的行数？

转载作者：行者123 更新时间：2023-12-04 00:54:14

26

4

我正在尝试计算输入文件中的行数，并且我正在使用 Cloud dataflow Runner 创建模板。在下面的代码中，我从 GCS 存储桶中读取文件，对其进行处理，然后将输出存储在 Redis 实例中。

但是我无法计算输入文件的行数。

主类

 public static void main(String[] args) {
    /**
     * Constructed StorageToRedisOptions object using the method PipelineOptionsFactory.fromArgs to read options from command-line
     */
    StorageToRedisOptions options = PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(StorageToRedisOptions.class);

    Pipeline p = Pipeline.create(options);
    p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()))
            .apply("Transforming data...",
                    ParDo.of(new DoFn<String, String[]>() {
                        @ProcessElement
                        public void TransformData(@Element String line, OutputReceiver<String[]> out) {
                            String[] fields = line.split("\\|");
                            out.output(fields);
                        }
                    }))
            .apply("Processing data...",
                    ParDo.of(new DoFn<String[], KV<String, String>>() {
                        @ProcessElement
                        public void ProcessData(@Element String[] fields, OutputReceiver<KV<String, String>> out) {
                            if (fields[RedisIndex.GUID.getValue()] != null) {

                                out.output(KV.of("firstname:"
                                        .concat(fields[RedisIndex.FIRSTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));

                                out.output(KV.of("lastname:"
                                        .concat(fields[RedisIndex.LASTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));

                                out.output(KV.of("dob:"
                                        .concat(fields[RedisIndex.DOB.getValue()]), fields[RedisIndex.GUID.getValue()]));

                                out.output(KV.of("postalcode:"
                                        .concat(fields[RedisIndex.POSTAL_CODE.getValue()]), fields[RedisIndex.GUID.getValue()]));

                            }
                        }
                    }))
            .apply("Writing field indexes into redis",
            RedisIO.write().withMethod(RedisIO.Write.Method.SADD)
                    .withEndpoint(options.getRedisHost(), options.getRedisPort()));
    p.run();

}

示例输入文件

xxxxxxxxxxxxxxxx|bruce|wayne|31051989|444444444444
yyyyyyyyyyyyyyyy|selina|thomas|01051989|222222222222
aaaaaaaaaaaaaaaa|clark|kent|31051990|666666666666

执行管道的命令

mvn compile exec:java \
  -Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
  -Dexec.args="--project=my-project-id \
  --jobName=dataflow-job \
  --inputFile=gs://my-input-bucket/*.txt \
  --redisHost=127.0.0.1 \
  --stagingLocation=gs://pipeline-bucket/stage/ \
  --dataflowJobFile=gs://pipeline-bucket/templates/dataflow-template \
  --runner=DataflowRunner"

我尝试使用 StackOverflow solution 中的以下代码但这对我不起作用。

PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
    p.apply(TextIO.Read.from("gs://..."))
     .apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);

我也浏览了 Apache Beam 文档，但没有找到任何有用的信息。对此的任何帮助将不胜感激。

最佳答案

我通过添加 Count.globally() 解决了这个问题并申请 PCollection<String>管道读取文件后。

我添加了以下代码:

PCollection<String> lines = p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()));

 lines.apply(Count.globally()).apply("Count the total records", ParDo.of(new RecordCount()));

我在其中创建了一个新类 (RecordCount.java)，它扩展了 DoFn ，它只记录计数。

RecordCount.java

import org.apache.beam.sdk.transforms.DoFn;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class RecordCount extends DoFn<Long, Void> {

    private static final Logger LOGGER = LoggerFactory.getLogger(RecordCount.class);

    @ProcessElement
    public void processElement(@Element Long count) {
       LOGGER.info("The total number of records in the input file is: ", count);

        }
    }

}

关于google-cloud-platform - 如何计算Google Dataflow文件处理的输入文件中的行数？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63944012/

26

4

0

文章推荐： python - 在 Matplotlib (Python) 中处理子图的比例

文章推荐： r - 尝试将形式为 "−0.06"的字符串转换为数字时获取 NA

文章推荐： vespa - vespa.ai 出现以下错误的原因是什么？

google-cloud-platform - 如何自动启动 AI-Platform 作业？
我创建了一个训练作业，我从大查询中获取数据、执行训练和部署模型。我想在这两种情况下自动开始训练: 向数据集添加了 1000 多个新行有时间表(例如，每周一次) 我检查了 GCP Cloud Sche
google-cloud-platform - Google Cloud Platform 服务帐户无法访问项目
我遇到以下警告: WARNING: You do not appear to have access to project [$PROJECT] or it does not exist. 在本地运行
google-cloud-platform - Google Cloud Platform 中的身份验证
我正在使用 Google Cloud Platform，我必须使用 java 非 Web 应用程序访问云功能，就像我尝试使用 Google Cloud Storage JSON API 从 Googl
google-cloud-platform - Google Identity Platform 第三方访问？
我的问题是第三方开发人员如何通过我的身份平台登录用户？我查看了文档，但一无所获。本质上，我想将 Identity Platform 用作 OIDC 提供者，但我不知道这是否受支持。最佳答案 Clo
google-cloud-platform - Google Cloud Platform 凭据页面未加载
在我去这里的过去 12 个小时左右: https://console.developers.google.com/apis/credentials?project=MYPROJECTNAME 我只是得
python - platform.system 和 platform.linux_distribution 究竟输出什么？
我正在尝试创建一个 python 脚本来在 linux 机器上自动安装和配置某些程序。我的想法是使用平台和多处理库来询问系统信息(platform.system、platform.linux_dis
google-cloud-platform - 在没有控制台页面的情况下创建 Google Cloud Platform 项目。
我正在尝试创建没有控制台网页的 Google Cloud Platform 项目，因为我考虑创建多个项目。因为我查了gcloud，目前只支持project describe和list。 https:
google-cloud-platform - 如何在 Google Cloud Platform 中获取用户托管服务帐户的公钥
我正在使用 Google Cloud Scheduler 调用外部应用程序。 Google Cloud Scheduler 使用 OIDC 身份验证并使用服务帐户。我只能从 Google 服务帐户 U
google-cloud-platform - 在 Google Cloud Platform 中启用双因素身份验证
如何在我的 Google Cloud Platform 帐户上启用 Google Authenticator 双重身份验证？我在 Web 界面中上下查看了“IAM 和管理员”，但没有看到在帐户上启用
google-cloud-platform - 如何在 Google Cloud Platform 上安排虚拟机的开启和关闭？
我们在 Google Cloud 上设置了一个虚拟机，并希望能够自动或计划打开和关闭它。我们内部有自动脚本，之后可以完成工作，到目前为止，我在 google 的文献中读到的更多与这些实例有关，但我找
google-cloud-platform - 无法删除 Google Cloud Platform 项目
我试图删除一个 GCP 项目，但不断弹出以下错误。 Lien origin You cannot delete this project because it is linked with a Dia
google-cloud-platform - 在 Google Cloud Platform 中重命名组织的权限
我从 Google Domains 购买了一个域，称为 example.com。我已订阅 G Suite 基本版并创建了一个 admin@example.com 帐户以在 GCP 上使用，而不是我的
google-cloud-platform - Google AI Platform 训练 - 等待作业完成
我构建了一个包含许多并行进程的 AI Platform 流水线。每个流程都会在 AI Platform 上启动一个训练作业，如下所示: gcloud ai-platform jobs submit t
windows-runtime - 如何区分空 Platform.String 和空 Platform.String^
我们正在验证函数输入时方法参数不为空，但这不适用于 Platform::String (或 Platform.String ，C# 或 C++ 之间没有区别)，因为它们用空实例重载空字符串的语义。考
google-cloud-platform - Google Cloud Platform HTTP 函数是否支持路由参数？
这个问题比我想来这里的问题要简单一些，但我一直在努力寻找答案，但我绝对不能—— 谷歌云平台 HTTP 函数是否支持路由参数，如此处？ http://expressjs.com/en/guide/rou
google-cloud-platform - 如何增加 Google Cloud Platform 中的后端服务配额？
我正在使用 Kubernetes，我正在尝试创建一个 ingress resource .我使用以下方法创建它: $ kubectl create -f my-ingress.yaml 我等了一会儿，
google-cloud-platform - 您能否将项目从一个 Google Cloud Platform 组织转移到另一个组织
我是 Google Cloud 的新手，所以我希望得到一些有关“组织”的指导。我可以将项目从一个“组织”转移到另一个“组织”吗？我正在我的个人 GSuite 组织下启动一些项目，但我必须将它们转移到
api-platform.com - 如何在 api-platform GET 操作中始终过滤特定字段值的集合？
在 GET 操作中，我想从返回的集合中排除具有等于“true”的“存档”字段的实体。我希望这是我的端点(如/users 或/companies)的默认设置，并且我想避免手动添加 URL 过滤器，如
google-cloud-platform - 在 Google Cloud Platform 中创建实例模板
实例模板对于创建托管实例组至关重要。事实上，托管实例组对于在 GCP 中创建自动扩缩组至关重要。这个问题是另一个问题 question's answer 的一部分，这是关于构建一个自动缩放和负载平衡
google-cloud-platform - Google Cloud Platform GPU 配额并不总是显示
我正在将 GCP 用于多个相同的项目。对于每个新项目我都需要一个1 个 GPU 的配额(Tesla K80)。为了申请增加我的GPU配额，我打开console并导航至“IAM 和管理”>“配额”。我在

首页

博学

6Ren·AI

商城

google-cloud-platform - 如何计算Google Dataflow文件处理的输入文件中的行数？