gpt4 book ai didi

.net - AWS EC2中的服务段错误

转载 作者:行者123 更新时间:2023-12-04 04:32:50 26 4
gpt4 key购买 nike

我的服务在EC2(在systemd下)中运行。这是一个为.Net Core 2.1构建的自包含应用程序。
有时(一周几次)会随着SEGV崩溃。

Apr 30 21:20:51 ip-10-4-226-55 kernel: traps: App.Name[26176] general protection ip:7f22da3609da sp:7f1fedf11510 error:0 in libc-2.26.so[7f22da2e3000+1ad000]

Apr 30 21:20:51 ip-10-4-226-55 systemd: appname.service: main process exited, code=killed, status=11/SEGV

Apr 30 21:20:51 ip-10-4-226-55 systemd: Unit appname.service entered failed state.

Apr 30 21:20:51 ip-10-4-226-55 systemd: appname.service failed.


由于某些原因,不会创建崩溃转储(即使我删除了大小限制)。
我该如何进一步调查问题?问题的根源是什么?

最佳答案

How can I investigate the problem further?



我在ArchLinux上,所以情况可能有所不同(即使 systemd都存在),但是这是我尝试的方法:
  • 系统会以某种方式创建任何内核吗?

  • 让我们转储一个无害的核心进行测试:
    bash shell中:

    sleep 200 & kill -11 "$!"

    这在 dmesg -w中显示了以下内容:
    [17894.861369] systemd[1]: Started Process Core Dump (PID 31964/UID 0).
    [17895.030166] systemd-coredump[31975]: Process 31963 (bash) of user 1000 dumped core.

    Stack trace of thread 31963:
    #0 0x00007c0aff6c642b kill (libc.so.6)
    #1 0x000056e836d6c56a termsig_handler.part.2 (bash)
    #2 0x000056e836d6c6d3 termsig_handler (bash)
    #3 0x000056e836d3a1b3 execute_simple_command (bash)
    #4 0x000056e836d3b20e execute_command_internal (bash)
    #5 0x000056e836d3b469 execute_command_internal (bash)
    #6 0x000056e836d3cf12 execute_command (bash)
    #7 0x000056e836d247f2 reader_loop (bash)
    #8 0x000056e836d2320d main (bash)
    #9 0x00007c0aff6b21bb __libc_start_main (libc.so.6)
    #10 0x000056e836d235ce _start (bash)

    [17895.030324] systemd[1]: systemd-coredump@5-31964-0.service: Succeeded.

    并以 coredumpctl -r |head -2列为最新的:
    TIME                            PID   UID   GID SIG COREFILE  EXE
    Sat 2019-05-18 21:48:22 CEST 31963 1000 1000 11 present /usr/bin/bash

    还:
    $ ls -rlat /var/lib/systemd/coredump/|tail -n1
    -rw-r-----+ 1 root root 3907584 18.05.2019 21:48 core.bash.1000.6d7dce73cd2342759a18d47914c16007.31963.1558208902000000

    因此,由于它是最新的,所以我可以只运行 coredumpctl gdb来在其上启动 gdb,然后在 gdb内输入 thread apply all bt full来查看一些信息:
    $ coredumpctl gdb
    PID: 31963 (bash)
    UID: 1000 (user)
    GID: 1000 (user)
    Signal: 11 (SEGV)
    Timestamp: Sat 2019-05-18 21:48:22 CEST (3min 51s ago)
    Command Line: -bash
    Executable: /usr/bin/bash
    Control Group: /user.slice/user-1000.slice/session-1.scope
    Unit: session-1.scope
    Slice: user-1000.slice
    Session: 1
    Owner UID: 1000 (user)
    Boot ID: 6d7dce73cd2342759a18d47914c16007
    Machine ID: 5767ef25f523419aaa049f3d74481940
    Hostname: i87k
    Storage: /var/lib/systemd/coredump/core.bash.1000.6d7dce73cd2342759a18d47914c16007.31963.1558208902000000
    Message: Process 31963 (bash) of user 1000 dumped core.

    Stack trace of thread 31963:
    #0 0x00007c0aff6c642b kill (libc.so.6)
    #1 0x000056e836d6c56a termsig_handler.part.2 (bash)
    #2 0x000056e836d6c6d3 termsig_handler (bash)
    #3 0x000056e836d3a1b3 execute_simple_command (bash)
    #4 0x000056e836d3b20e execute_command_internal (bash)
    #5 0x000056e836d3b469 execute_command_internal (bash)
    #6 0x000056e836d3cf12 execute_command (bash)
    #7 0x000056e836d247f2 reader_loop (bash)
    #8 0x000056e836d2320d main (bash)
    #9 0x00007c0aff6b21bb __libc_start_main (libc.so.6)
    #10 0x000056e836d235ce _start (bash)

    GNU gdb (GDB) 8.2.1
    Copyright (C) 2018 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    Type "show copying" and "show warranty" for details.
    This GDB was configured as "x86_64-pc-linux-gnu".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>.
    Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

    For help, type "help".
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from /usr/bin/bash...done.
    [New LWP 31963]
    Core was generated by `-bash'.
    Program terminated with signal SIGSEGV, Segmentation fault.
    #0 0x00007c0aff6c642b in kill () at ../sysdeps/unix/syscall-template.S:78
    78 ../sysdeps/unix/syscall-template.S: No such file or directory.
    (gdb) thread apply all bt full

    Thread 1 (LWP 31963):
    #0 0x00007c0aff6c642b in kill () at ../sysdeps/unix/syscall-template.S:78
    No locals.
    #1 0x000056e836d6c56a in termsig_handler.part ()
    No symbol table info available.
    #2 0x000056e836d6c6d3 in termsig_handler ()
    No symbol table info available.
    #3 0x000056e836d3a1b3 in execute_simple_command ()
    No symbol table info available.
    #4 0x000056e836d3b20e in execute_command_internal ()
    No symbol table info available.
    #5 0x000056e836d3b469 in execute_command_internal ()
    No symbol table info available.
    #6 0x000056e836d3cf12 in execute_command ()
    No symbol table info available.
    #7 0x000056e836d247f2 in reader_loop ()
    No symbol table info available.
    #8 0x000056e836d2320d in main ()
    No symbol table info available.
    (gdb)

    没什么好看的,因为 bash并未使用调试符号进行编译或剥离了它们。
    在执行 bash之前,使用额外的 CFLAGS重新编译 ./configure ... && make:

    export CFLAGS="${CFLAGS} -fstack-protector-strong -fno-omit-frame-pointer -ftrack-macro-expansion=2 -ggdb -fvar-tracking-assignments -O0"

    (如果您想保留当前程序的行为,则可能不需要 -O0,否则它可能不再崩溃)
    然后重新运行上面的 sleep以创建新的coredump,会产生以下更丰富的结果:
    $ coredumpctl gdb
    PID: 29241 (bash)
    UID: 1000 (user)
    GID: 1000 (user)
    Signal: 11 (SEGV)
    Timestamp: Sat 2019-05-18 22:01:41 CEST (13s ago)
    Command Line: -bash
    Executable: /usr/bin/bash
    Control Group: /user.slice/user-1000.slice/session-1.scope
    Unit: session-1.scope
    Slice: user-1000.slice
    Session: 1
    Owner UID: 1000 (user)
    Boot ID: 6d7dce73cd2342759a18d47914c16007
    Machine ID: 5767ef25f523419aaa049f3d74481940
    Hostname: i87k
    Storage: /var/lib/systemd/coredump/core.bash.1000.6d7dce73cd2342759a18d47914c16007.29241.1558209701000000
    Message: Process 29241 (bash) of user 1000 dumped core.

    Stack trace of thread 29241:
    #0 0x00007775d0d2642b kill (libc.so.6)
    #1 0x000060b781bce2c8 termsig_handler (bash)
    #2 0x000060b781b9107b execute_simple_command (bash)
    #3 0x000060b781b8aa1c execute_command_internal (bash)
    #4 0x000060b781b8dde0 execute_connection (bash)
    #5 0x000060b781b8ade5 execute_command_internal (bash)
    #6 0x000060b781b89f45 execute_command (bash)
    #7 0x000060b781b72e66 reader_loop (bash)
    #8 0x000060b781b70906 main (bash)
    #9 0x00007775d0d121bb __libc_start_main (libc.so.6)
    #10 0x000060b781b6fe2e _start (bash)

    GNU gdb (GDB) 8.2.1
    Copyright (C) 2018 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    Type "show copying" and "show warranty" for details.
    This GDB was configured as "x86_64-pc-linux-gnu".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>.
    Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

    For help, type "help".
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from /usr/bin/bash...done.
    [New LWP 29241]
    Core was generated by `-bash'.
    Program terminated with signal SIGSEGV, Segmentation fault.
    #0 0x00007775d0d2642b in kill () at ../sysdeps/unix/syscall-template.S:78
    78 ../sysdeps/unix/syscall-template.S: No such file or directory.
    (gdb) thread apply all bt full

    Thread 1 (LWP 29241):
    #0 0x00007775d0d2642b in kill () at ../sysdeps/unix/syscall-template.S:78
    No locals.
    #1 0x000060b781bce2c8 in termsig_handler (sig=11) at sig.c:597
    handling_termsig = 1
    i = -2097452368
    core = 24759
    mask = {__val = {140729597269152, 106341271890333, 106341294191024, 106341271640912, 106341291662592, 106341294178848,
    140729597269200, 106341271910973, 140729597269200, 106341294191024, 106341271640912, 106341271911463,
    106341272462763, 0, 140729597269232, 106341271911163}}
    #2 0x000060b781b9107b in execute_simple_command (simple_command=0x60b78310a8c0, pipe_in=-1, pipe_out=-1, async=1,
    fds_to_close=0x60b7831196b0) at execute_cmd.c:4394
    words = 0x60b78310b1b0
    lastword = 0x7ffe29a79910
    command_line = 0x0
    lastarg = 0x0
    temp = 0x0
    first_word_quoted = 0
    result = 0
    builtin_is_special = 0
    already_forked = 1
    dofork = 1
    old_last_async_pid = -1
    builtin = 0x0
    func = 0x0
    old_builtin = 0
    old_command_builtin = -2098586400
    #3 0x000060b781b8aa1c in execute_command_internal (command=0x60b783107410, asynchronous=1, pipe_in=-1, pipe_out=-1,
    fds_to_close=0x60b7831196b0) at execute_cmd.c:845
    exec_result = 0
    user_subshell = 0
    invert = 0
    ignore_return = 0
    was_error_trap = 0
    my_undo_list = 0x0
    exec_undo_list = 0x0
    tcmd = 0x0
    save_line_number = 1
    ofifo = 0
    nfifo = 0
    osize = 0
    saved_fifo = 0
    ofifo_list = 0x5b0000006e <error: Cannot access memory at address 0x5b0000006e>
    #4 0x000060b781b8dde0 in execute_connection (command=0x60b783119680, asynchronous=0, pipe_in=-1, pipe_out=-1,
    fds_to_close=0x60b7831196b0) at execute_cmd.c:2690
    tc = 0x60b783107410
    --Type <RET> for more, q to quit, c to continue without paging--c
    second = 0x0
    ignore_return = 0
    exec_result = -2098586400
    was_error_trap = 0
    invert = 3
    save_line_number = 0
    #5 0x000060b781b8ade5 in execute_command_internal (command=0x60b783119680, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_close=0x60b7831196b0) at execute_cmd.c:1018
    exec_result = 0
    user_subshell = 0
    invert = 0
    ignore_return = 0
    was_error_trap = 32766
    my_undo_list = 0x0
    exec_undo_list = 0x0
    tcmd = 0x0
    save_line_number = -2117800288
    ofifo = 24759
    nfifo = -2096071056
    osize = 24759
    saved_fifo = 0
    ofifo_list = 0x60b781b89d9c <dispose_fd_bitmap> "UH\211\345H\203\354\020H\211}\370H\213E\370H\213@\bH\205\300t\020H\213E\370H\213@\bH\211\307\350\253R\376\377H\213E\370H\211\307\350\237R\376\377\220\311\303UH\211\345SH\203\354\030H\211}\350H\203", <incomplete sequence \350>
    #6 0x000060b781b89f45 in execute_command (command=0x60b783119680) at execute_cmd.c:394
    bitmap = 0x60b7831196b0
    result = 0
    #7 0x000060b781b72e66 in reader_loop () at eval.c:175
    code = 0
    our_indirection_level = 1
    current_command = 0x60b783119680
    #8 0x000060b781b70906 in main (argc=1, argv=0x7ffe29a79918, env=0x7ffe29a79928) at shell.c:805
    i = 20
    code = 0
    old_errexit_flag = 0
    saverst = 0
    locally_skip_execution = 0
    arg_index = 1
    top_level_arg_index = 1
    (gdb)


    但是,可能创建了coredump,但一会儿之后systemd可能正在清理/删除它(例如,我早于3天前的所有coredumps都是 missing报告的 coredumpctl-不知道为什么,考虑到我的设置-也许您看到了类似的内容问题?),或者由于空间禁忌症甚至无法创建它(请参阅下面提到的所有 /etc/systemd/coredump.conf)。
    让我们来看看: systemd-coredump甚至设置为运行以创建coredump吗?
    $ sysctl -a |grep kernel.core
    kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
    kernel.core_pipe_limit = 0
    kernel.core_uses_pid = 1
    $ ls -la /usr/lib/systemd/systemd-coredump
    -rwxr-xr-x 1 root root 55296 13.05.2019 11:46 /usr/lib/systemd/systemd-coredump*


    内核是否支持核心转储?
    $ zcat /proc/config.gz |grep -i 'core.*dump'
    CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
    CONFIG_COREDUMP=y
    CONFIG_ALLOW_DEV_COREDUMP=y
    # CONFIG_PROC_VMCORE_DEVICE_DUMP is not set
    CONFIG_COREDUMP=y可能就足够了。

    我会看的其他东西:
    $ systemctl|grep core
    systemd-coredump.socket loaded active listening Process Core Dump Socket
    $ cat /etc/systemd/coredump.conf
    # This file is part of systemd.
    #
    # systemd is free software; you can redistribute it and/or modify it
    # under the terms of the GNU Lesser General Public License as published by
    # the Free Software Foundation; either version 2.1 of the License, or
    # (at your option) any later version.
    #
    # Entries in this file show the compile time defaults.
    # You can change settings by editing this file.
    # Defaults can be restored by simply deleting this file.
    #
    # See coredump.conf(5) for details.

    [Coredump]
    #Storage=external
    #Compress=yes
    Compress=no
    #ProcessSizeMax=2G
    ProcessSizeMax=10G
    #ExternalSizeMax=2G
    ExternalSizeMax=10G
    #JournalSizeMax=767M
    JournalSizeMax=10G
    #MaxUse=
    #KeepFree=
    man 5 coredump.conf显示一些信息:
           All options are configured in the "[Coredump]" section:

    Storage=
    Controls where to store cores. One of "none", "external", and "journal". When "none", the core dumps may be
    logged (including the backtrace if possible), but not stored permanently. When "external" (the default), cores
    will be stored in /var/lib/systemd/coredump/. When "journal", cores will be stored in the journal and rotated
    following normal journal rotation patterns.

    When cores are stored in the journal, they might be compressed following journal compression settings, see
    journald.conf(5). When cores are stored externally, they will be compressed by default, see below.

    Compress=
    Controls compression for external storage. Takes a boolean argument, which defaults to "yes".

    ProcessSizeMax=
    The maximum size in bytes of a core which will be processed. Core dumps exceeding this size may be stored, but
    the backtrace will not be generated.

    Setting Storage=none and ProcessSizeMax=0 disables all coredump handling except for a log entry.

    ExternalSizeMax=, JournalSizeMax=
    The maximum (uncompressed) size in bytes of a core to be saved.

    MaxUse=, KeepFree=
    Enforce limits on the disk space taken up by externally stored core dumps. MaxUse= makes sure that old core
    dumps are removed as soon as the total disk space taken up by core dumps grows beyond this limit (defaults to 10%
    of the total disk size). KeepFree= controls how much disk space to keep free at least (defaults to 15% of the
    total disk size). Note that the disk space used by core dumps might temporarily exceed these limits while core
    dumps are processed. Note that old core dumps are also removed based on time via systemd-tmpfiles(8). Set either
    value to 0 to turn off size-based clean-up.

    The defaults for all values are listed as comments in the template /etc/systemd/coredump.conf file that is installed
    by default.

    $ cd /etc/systemd && grep -nrIFi core
    coredump.conf:12:# See coredump.conf(5) for details.
    coredump.conf:14:[Coredump]
    system.conf:19:DumpCore=yes
    system.conf:20:#DefaultLimitCORE=
    system.conf:21:#^ man 2 setrlimit: RLIMIT_CORE
    system.conf:22:#This is the maximum size of a core file (see core(5)) in bytes that the process may dump. When 0 no core dump
    user.conf:34:#DefaultLimitCORE=
    user.conf:35:#^ man 2 setrlimit: RLIMIT_CORE
    user.conf:36:#This is the maximum size of a core file (see core(5)) in bytes that the process may dump. When 0 no core dump

    这些似乎对我来说是固定的。 (如果更改,则需要 sudo systemctl daemon-reload)

    另请参阅: man 8 systemd-coredump,它说核心转储保存在 /var/lib/systemd/coredump中,您甚至可以找到其他有用的信息(以及重定向到 man 5 core)

    我已经改变的另一件事:
    $ colordiff -up /etc/security/limits.conf.ORIG /etc/security/limits.conf
    --- /etc/security/limits.conf.ORIG 2017-12-29 21:26:09.000000000 +0100
    +++ /etc/security/limits.conf 2017-12-29 21:26:09.000000000 +0100
    @@ -47,4 +47,11 @@
    #ftp hard nproc 0
    #@student - maxlogins 4

    +#* soft core unlimited
    +#^ this doesn't affect the root user!! what the!
    +#@root soft core unlimited
    +0: soft core unlimited
    +#^ all uids from 0 upwards! so what I thought * was doing!
    +#hmm works with su -, but not with ssh !
    +
    # End of file

    IE。我正在使用以下行: 0: soft core unlimited而不是通常推荐的一个: * soft core unlimited尽管我现在注意到Arch Linux recommends: * hard core 0
    我要做的另一件事是用完整的调试和符号重新编译glibc,以便下次程序崩溃 in libc-2.26.so时可以使用它们。我这样做的方法是确保 strip(来自 PKGBUILD)没有运行,并且我使用:
    CPPFLAGS="${CPPFLAGS} -fno-omit-frame-pointer -ftrack-macro-expansion=2 -ggdb -fvar-tracking-assignments -O2"
    CXXFLAGS="${CXXFLAGS} -fno-omit-frame-pointer -ftrack-macro-expansion=2 -ggdb -fvar-tracking-assignments"
    CFLAGS="${CFLAGS} -fno-omit-frame-pointer -ftrack-macro-expansion=2 -ggdb -fvar-tracking-assignments"

    如果仍然没有coredump(针对您的程序!),请查看内核 /proc/<pid>/coredump_filter 中的 Documentation/filesystems/proc.txt
    更新:因为您只有一条dmesg行(并且没有coredump),所以 this answer可能会帮助您获取一些信息。您可能需要CentOS使用的glibc 2.26的源代码,除非您对只阅读汇编代码感到满意;)

    UPDATE2:尝试运行 coredumpctl 26176,即使它没有内核,您仍然应该看到堆栈跟踪,例如:
    $ coredumpctl -S '2019-05-04 23:37:56' -U '2019-05-05 23:37:56'
    TIME PID UID GID SIG COREFILE EXE
    Sat 2019-05-04 23:37:56 CEST 3888 0 0 7 missing /usr/bin/mc
    Sat 2019-05-04 23:40:08 CEST 3916 0 0 7 missing /usr/bin/mc
    $ coredumpctl info 3888
    PID: 3888 (mc)
    UID: 0 (root)
    GID: 0 (root)
    Signal: 7 (BUS)
    Timestamp: Sat 2019-05-04 23:37:56 CEST (2 weeks 0 days ago)
    Command Line: mc
    Executable: /usr/bin/mc
    Control Group: /user.slice/user-0.slice/session-5.scope
    Unit: session-5.scope
    Slice: user-0.slice
    Session: 5
    Owner UID: 0 (root)
    Boot ID: ce932e7af1f04bc3af1c9573c70a912d
    Machine ID: 5767ef25f523419aaa049f3d74481940
    Hostname: i87k
    Storage: /var/lib/systemd/coredump/core.mc.0.ce932e7af1f04bc3af1c9573c70a912d.3888.1557005876000000 (inaccessible)
    Message: Process 3888 (mc) of user 0 dumped core.

    Stack trace of thread 3888:
    #0 0x00007f54782d427e __memcmp_avx2_movbe (libc.so.6)
    #1 0x000055db1382fdad n/a (mc)
    #2 0x000055db137cb126 n/a (mc)
    #3 0x000055db1380102d n/a (mc)
    #4 0x000055db13801bff n/a (mc)
    #5 0x000055db137b2d6c n/a (mc)
    #6 0x000055db137b2f65 n/a (mc)
    #7 0x000055db137cc8e2 n/a (mc)
    #8 0x000055db137a6782 n/a (mc)
    #9 0x00007f547819dce3 __libc_start_main (libc.so.6)
    #10 0x000055db137a68fe n/a (mc)

    然后,也许您可​​以使用我上面提到的技巧(在UPDATE中)来调查每个地址,假设自从发生崩溃以来您没有更新系统!

    关于.net - AWS EC2中的服务段错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56071337/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com