How can I monitor stalled tasks?(我如何监视延迟的任务？)-6ren

How can I monitor stalled tasks?(我如何监视延迟的任务？)

翻译作者：bug小助手更新时间：2023-10-26 22:25:50

I am running a Rust app with Tokio in prod. In the last version i had a bug, and some requests caused my code to go into an infinite loop.

我正在与Tokio一起运行一款铁锈应用程序。在上一个版本中，我有一个错误，一些请求导致我的代码进入无限循环。

What happened is while the task that got into the loop was stuck, all the other task continue to work well and processing requests, that happened until the number of stalling tasks was high enough to cause my program to be unresponsive.

发生的情况是，当进入循环的任务陷入停滞时，所有其他任务继续正常工作并处理请求，直到停止任务的数量足够高，导致我的程序没有响应。

My problem is took a lot of time to our monitoring systems to identify that something go wrong. For example, the task that answer to Kubernetes' health check works well and I wasn't able to identify that I have stalled tasks in my system.

我的问题是，我们的监控系统花了很多时间来确定有什么地方出了问题。例如，响应Kubernetes的运行状况检查的任务运行良好，而我无法确定我的系统中的任务已经停止。

So my question is if there's a way to identify and alert in such cases?

所以我的问题是，在这种情况下，是否有一种方法可以识别和警觉？

If i could find way to define timeout on task, and if it's not return to the scheduler after X seconds/millis to mark the task as stalled, that will be a good enough solution for me.

如果我能找到定义任务超时的方法，并且在X秒/毫秒之后没有返回到调度器以将任务标记为已停止，这对我来说将是一个足够好的解决方案。

更多回答

优秀答案推荐

Using tracing might be an option here: following issue 2655 every tokio task should have a span. Alongside tracing-futures this means you should get a tracing event every time a task is entered or suspended (see this example), by adding the relevant data (e.g. task id / request id / ...) you should then be able to feed this information to an analysis tool in order to know:

在这里，使用跟踪可能是一种选择：在问题2655之后，每个Tokio任务都应该有一个跨度。除了跟踪期货，这意味着您应该通过添加相关数据(例如，任务id/请求id/...)，在每次输入或挂起任务时获取跟踪事件(请参见本例)然后，您应该能够将此信息提供给分析工具，以便了解：

that a task is blocked (was resumed then never suspended again)

if you add your own spans, that a "userland" span was never exited / closed, which might mean it's stuck in a non-blocking loop (which is also an issue though somewhat less so)

I think that's about the extent of it: as noted by issue 2510, tokio doesn't yet use the tracing information it generates and so provide no "built-in" introspection facilities.

我认为这就是问题的范围：正如第2510期所指出的那样，Tokio还没有使用它生成的跟踪信息，因此没有提供“内置”的自省设施。

Tokio Console is a monitoring solution built by the Tokio team. It can be used to monitor for stalled tasks among other things.

Tokio控制台是Tokio团队打造的监控解决方案。除其他功能外，它还可用于监视停滞任务。

In spirit, it is like the top command but specifically for Tokio.

在精神上，它就像最高指挥部，但专门为东京。

https://github.com/tokio-rs/console

Https://github.com/tokio-rs/console

更多回答

thanks for the answer - This sound cool , but if i understand it correctly - this solution it's require to monitor logs files that crated by trace, and it's required to log all this event that tasks handle/suspended. or could i create a task in my app that will handle those events instead of using log ?

谢谢你的回答-这听起来很酷，但如果我理解正确的话-这个解决方案需要监视按跟踪记录的日志文件，并且需要记录任务处理/挂起的所有事件。或者我可以在我的应用程序中创建一个任务来处理这些事件，而不是使用日志？

tracing is actually an instrumentation system, though it has easy ways to use it as a logging system (if only for easy migration) you should be able to build your own subscriber to process those events however you wish. Take a look at tracing_subscriber and the OpenTelemetry or Gelf subscribers, tracing-gelf looks especially relevant as it works by spawning a gelf Logger into a separate task.

跟踪实际上是一个工具系统，尽管它有一些简单的方法可以将其用作日志记录系统(如果只是为了方便迁移)，您应该能够构建自己的订阅者来处理这些事件。请看一下Tracing_Subscriber和OpenTelemeter或GELF订阅者，Tracing-GELF看起来特别相关，因为它通过将GELF记录器派生到一个单独的任务中来工作。

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

How can I monitor stalled tasks?(我如何监视延迟的任务？)