Collecting output from Apache Beam pipeline and displaying it to console(收集来自阿帕奇光束管道的输出并将其显示到控制台)-6ren

Collecting output from Apache Beam pipeline and displaying it to console(收集来自阿帕奇光束管道的输出并将其显示到控制台)

转载作者：bug小助手更新时间：2023-10-28 10:07:53

I have been working on Apache Beam for a couple of days. I wanted to quickly iterate on the application I am working and make sure the pipeline I am building is error free. In spark we can use sc.parallelise and when we apply some action we get the value that we can inspect.

我已经在阿帕奇光束上工作了几天。我想快速迭代我正在工作的应用程序，并确保我正在构建的管道是没有错误的。在Spark中，我们可以使用sc.parallise，当我们应用某个操作时，我们会得到可以检查的值。

Similarly when I was reading about Apache Beam, I found that we can create a PCollection and work with it using following syntax

类似地，当我在阅读有关Apache Beam的文章时，我发现我们可以创建一个PCollection并使用以下语法来使用它

with beam.Pipeline() as pipeline:
    lines = pipeline | beam.Create(["this is test", "this is another test"])
    word_count = (lines 
                  | "Word" >> beam.ParDo(lambda line: line.split(" "))
                  | "Pair of One" >> beam.Map(lambda w: (w, 1))
                  | "Group" >> beam.GroupByKey()
                  | "Count" >> beam.Map(lambda (w, o): (w, sum(o))))
    result = pipeline.run()

I actually wanted to print the result to console. But I couldn't find any documentation around it.

我其实想把结果打印到控制台上。但我在它周围找不到任何文件。

Is there a way to print the result to console instead of saving it to a file each time?

更多回答

I have the same question as this post's. I'm working with Java and don't know how to print intermediate values onto the console. Would appreciate if anybody could help me out.

我和这篇文章有同样的问题。我正在使用Java，不知道如何将中间值打印到控制台上。如果有人能帮我的话我会很感激的。

优秀答案推荐

You don't need the temp list. In python 2.7 the following should be sufficient:

你不需要临时工名单。在python2.7中，以下内容应该足够了：

def print_row(row):
    print row

(pipeline 
    | ...
    | "print" >> beam.Map(print_row)
)

result = pipeline.run()
result.wait_until_finish()

In python 3.x, print is a function so the following is sufficient:

在python3.x中，print是一个函数，因此以下内容就足够了：

(pipeline 
    | ...
    | "print" >> beam.Map(print)
)

result = pipeline.run()
result.wait_until_finish()

After exploring furthermore and understanding how I can write testcases for my application I figure out the way to print the result to console. Please not that I am right now running everything to a single node machine and trying to understand functionality provided by apache beam and how can I adopt it without compromising industry best practices.

在进一步探索并了解如何为应用程序编写测试用例之后，我想出了将结果打印到控制台的方法。请注意，我现在正在一台单节点机器上运行所有东西，并试图了解由apache Beam提供的功能，以及如何才能在不影响行业最佳实践的情况下采用它。

So, here is my solution. At the very last stage of our pipeline we can introduce a map function that will print result to the console or accumulate the result in a variable later we can print the variable to see the value

所以，这是我的解决方案。在流水线的最后阶段，我们可以引入一个map函数，该函数将结果打印到控制台或将结果累积到变量中，稍后我们可以打印变量以查看值

import apache_beam as beam

# lets have a sample string
data = ["this is sample data", "this is yet another sample data"]

# create a pipeline
pipeline = beam.Pipeline()
counts = (pipeline | "create" >> beam.Create(data)
    | "split" >> beam.ParDo(lambda row: row.split(" "))
    | "pair" >> beam.Map(lambda w: (w, 1))
    | "group" >> beam.CombinePerKey(sum))

# lets collect our result with a map transformation into output array
output = []
def collect(row):
    output.append(row)
    return True

counts | "print" >> beam.Map(collect)

# Run the pipeline
result = pipeline.run()

# lets wait until result a available
result.wait_until_finish()

# print the output
print output

Maybe logging info instead of print?

也许记录信息而不是打印？

def _logging(elem):
    logging.info(elem)
    return elem

P | "logging info" >> beam.Map(_logging)

Follow an example from pycharm Edu

效仿来自pycharm edu的例子

import apache_beam as beam

class LogElements(beam.PTransform):
    class _LoggingFn(beam.DoFn):

        def __init__(self, prefix=''):
            super(LogElements._LoggingFn, self).__init__()
            self.prefix = prefix

        def process(self, element, **kwargs):
            print self.prefix + str(element)
            yield element

    def __init__(self, label=None, prefix=''):
        super(LogElements, self).__init__(label)
        self.prefix = prefix

    def expand(self, input):
        input | beam.ParDo(self._LoggingFn(self.prefix))

class MultiplyByTenDoFn(beam.DoFn):

    def process(self, element):
        yield element * 10

p = beam.Pipeline()

(p | beam.Create([1, 2, 3, 4, 5])
   | beam.ParDo(MultiplyByTenDoFn())
   | LogElements())

p.run()

Output

输出

10
20
30
40
50
Out[10]: <apache_beam.runners.portability.fn_api_runner.RunnerResult at 0x7ff41418a210>

with beam.Pipeline() as pipeline:
lines = pipeline | beam.Create(["this is test", "this is another test"])
word_count = (lines 
              | "Word" >> beam.ParDo(lambda line: line.split(" "))
              | "Pair of One" >> beam.Map(lambda w: (w, 1))
              | "Group" >> beam.GroupByKey()
              | "Count" >> beam.Map(lambda o: (o[0],str(sum(o[1])))))
word_count | beam.ParDo(lambda x: print(x))
result = pipeline.run()

This works like a charm !

这就像一种护身符！

Output:
('this', '2')
('is', '2')
('test', '2')
('another', '1')
('this', '2')
('is', '2')
('test', '2')
('another', '1')

输出：(‘This’，‘2’)(‘is’，‘2’)(‘test’，‘2’)(‘Another’，‘1’)(‘This’，‘2’)(‘is’，‘2’)(‘test’，‘2’)(‘Another’，‘1’)

I know it isn't what you asked for but why don't you store it to a text file? It's always better than printing it via stdout and it isn't volatile

我知道这不是你想要的，但你为什么不把它存储到文本文件中呢？它总是比通过标准输出打印它更好，而且它不会挥发

更多回答

Note that if you try to add this in the middle of your pipeline, you may get the error TypeError: 'NoneType' object is not subscriptable from your pipeline. This is because print returns None, which gets passed along to your following instructions. In this case you will need somewhat different code, to print the value and then return it.

请注意，如果您尝试在管道中间添加此代码，可能会收到错误TypeError：‘NoneType’对象无法从您的管道订阅。这是因为print返回NONE，它将传递给您的以下指令。在这种情况下，您需要稍有不同的代码来打印值，然后返回它。

Nice idea, but this won't work if your pipeline is executed in a distributed manner as for instance in Apache Yarn (Hadoop) or within Google Dataflow. There must be another way to collect the results. But I'm still searching for it.

好主意，但如果您的管道是以分布式方式执行的，例如在ApacheYarn(Hadoop)或Google Dataflow中，这将不起作用。肯定还有其他方法来收集结果。但我仍在寻找它。

When I am using pipeline.run() I am getting this error - 'PBegin' object has no attribute 'windowing'

当我使用bineline.run()时，我收到这个错误-‘pbegin’对象没有‘windowing’属性

This is great for unit tests in a DirectRunner.

这对于DirectRunner中的单元测试非常有用。

In the more general case of not printing, but having the value available in runtime, I do have a use case (although I might be using it wrong). In the context of Tensorflow and Tensorflow Transform which I am dealing with, I wanted to count during the transform context, which uses Beam, and then use this value in operations during training. So keeping the count in memory is more handy than saving it to file and loading it again. But as said, this is not printing.

在更一般的情况下，不打印，但在运行时拥有可用的值，我确实有一个用例(尽管我可能用错了它)。在我正在处理的TensorFlow和TensorFlow转换的上下文中，我想要在Transform上下文(使用Beam)期间进行计数，然后在训练期间的操作中使用此值。因此，将计数保存在内存中比将其保存到文件并再次加载要方便得多。但正如所说，这不是印刷。

This is more of a comment than an answer

与其说这是一个回答，不如说是一个评论

javascript - 捕获 JavaScript console.log console.dir、console.table
已经有几个关于捕获或重定向 console.log 的问题: redirect Javascript syntax errors and console.log to somewhere else C
javascript - console.log(String(console.log ('Not undefined' )) === 'undefined' ); console.log(String(console.log ('Not undefined' )) !== 'Not undefined' );
console.log(String(console.log('Not undefined')) === 'undefined'); console.log(String(console.log('N
console - system.console NullPointerException
我知道这是一个新手错误，但我不知道如何修复它。 public static void main (String args[]){ Console kitty = System.console(); S
console - 使用/Logger :Console does not print debug output from test method 运行时的 VSTest.Console
我正在使用 Visual Studio 2015。我试图打印一些语句只是为了跟踪一个非常长时间运行的测试。当使用 VSTest.Console 和/Logger:trx 时，调试输出(无论我们使用
google-chrome-extension - console.log - Chrome 扩展不执行 console.log，它只是跳过 console.log
这个问题在这里已经有了答案: Accessing console and devtools of extension's background.js (8 个回答) 5年前关闭。我的 Chrome
javascript - "unexpected console statement no-console"
我在括号中收到此错误。我想强调一个事实，这是我第二次打开 JS 文件。正如我强调的那样，我还想强调一个事实，即我不知道 Eslint 和 node.js 是什么。 StackOverflow 和其
console - 使用 drupal/console 的问题
我按照文档中的描述安装了 Drupal Console Launcher: curl https://drupalconsole.com/installer -L -o drupal.phar mv
console - Console.writeline()/trace.writeline() 之间的区别
Console.WriteLine() 和有什么区别和Trace.WriteLine() ？最佳答案从“调试”的角度来看这些。我们开始使用 Console.WriteLine() 进行调试后来
console - 安卓事物 : Connect to Serial Debug Console
我一直在尝试连接到 serial console of a Raspberry Pi 3 with Android Things使用USB to TTL cable从我的 Linux (Ubuntu)
c# - Console.Error 和 Console.Out 不写入重定向文件流
namespace Pro { class ErrorLog { public ErrorLog(RenderWindow app) {
c# - Console.ReadKey 与带有计时器的 Console.ReadLine
以下代码是一个众所周知的示例，用于显示调试版本和发布版本之间的区别: using System; using System.Threading; public static class Program
javascript - window.console && console.log 是什么？
if (open_date) { open_date = get_date_from_string(open_date); window.console && cons
console - Xcode 12 : How to get a full window console?
在 Xcode 中工作时，我通常只为控制台打开一个单独的窗口，以便我可以看到尽可能多的输出行。我今天刚刚更新到 Xcode 12，在更新之前，您可以将编辑器 Pane 和控制台 Pane 之间的分隔线
google-play-console - Google Play Console 上的已安装受众和用户获取有什么区别？
在 Google Play Console 上，在所有应用程序的第一页，它会显示已安装的受众和用户获取。我知道已安装的受众是在他们的设备上安装我的应用程序的受众。但什么是用户获取？通常，用户获取的数
qt - 替换 console.debug() 的日志记录后端 console.warn()
Qt Quick uses qDebug执行日志记录，其中标准 Javascript 日志记录方法映射到 Qt 日志类型 console.log() -> qDebug() console.deb
qt - 替换 console.debug() 的日志记录后端 console.warn()
Qt Quick uses qDebug执行日志记录，其中标准 Javascript 日志记录方法映射到 Qt 日志类型 console.log() -> qDebug() console.deb
c# - 在 Console.ReadLine 期间使用 Console.WriteLine
我有以下代码: bool loop = true; LongbowWorkerThread Worker = new LongbowWorkerThread(); Th
c# - Console.ReadLine 和 Console.In.ReadLine 之间的区别
我遇到了这两个 API，用于在 C# 的简单控制台应用程序中读取用户的输入: System.Console.ReadLine() System.Console.In.ReadLine() 这是一个我试
javascript - 了解 console.log(console.log(object))
我是编程和 js 的新手，我正在尝试学习 javascript 的关键。 var obj1 = { name: 'rawn', fn: function() { con
c# - Console.Read() 和 Console.ReadLine() FormatException
using System; namespace ConsoleApplication1 { class Program { static void Main(strin

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Collecting output from Apache Beam pipeline and displaying it to console(收集来自阿帕奇光束管道的输出并将其显示到控制台)