gpt4 book ai didi

c - 分析编号处理代码 : 28% of time in fegetexcept() & optimal compiler flags?

转载 作者:行者123 更新时间:2023-12-01 12:48:22 24 4
gpt4 key购买 nike

我正在运行涉及大量 float 运算的 DNA 链模拟。完整代码在这里:https://github.com/RoaldFre/DNA

在用 gcc 和 clang 编译后,我用 google-perftools 做了一些分析。在这两种情况下,google-perftools 表示大约 28% 的时间花费在 fegetexcept() 上。这似乎是 C 库的一个函数,它查询 CPU 的浮点异常标志。

请注意,我将 -ffast-math 与 gcc 一起使用,如果我没记错的话,它应该忽略(全部?)浮点异常!我也在使用 -O4 和 clang(是否有单独的标志来启用不安全的浮点指令?)。

用 clang 编译的二进制文件的分析输出:

Total: 1561 samples
438 28.1% 28.1% 438 28.1% fegetexcept
263 16.8% 44.9% 263 16.8% cos
224 14.3% 59.3% 224 14.3% Vdihedral
131 8.4% 67.6% 131 8.4% nearestImageVector
102 6.5% 74.2% 102 6.5% Fexclusion
70 4.5% 78.7% 70 4.5% Fdihedral
65 4.2% 82.8% 65 4.2% Fangle
53 3.4% 86.2% 53 3.4% integratorTaskTick
46 2.9% 89.2% 46 2.9% nearestImageDistance
45 2.9% 92.1% 45 2.9% mutiallyExclusivePairForces
32 2.0% 94.1% 32 2.0% FCoulomb
24 1.5% 95.6% 24 1.5% forEveryPairD
17 1.1% 96.7% 17 1.1% calculateForces
14 0.9% 97.6% 14 0.9% atan2
8 0.5% 98.1% 8 0.5% Fstack
8 0.5% 98.7% 8 0.5% pairWrapper
6 0.4% 99.0% 6 0.4% sin
3 0.2% 99.2% 3 0.2% __finite
3 0.2% 99.4% 3 0.2% _init
2 0.1% 99.6% 2 0.1% acos
2 0.1% 99.7% 2 0.1% exp
2 0.1% 99.8% 2 0.1% log
1 0.1% 99.9% 1 0.1% _IO_file_xsputn
1 0.1% 99.9% 1 0.1% log2
1 0.1% 100.0% 1 0.1% significand

用 gcc 编译的二进制文件的分析输出:

Total: 1561 samples
438 28.1% 28.1% 438 28.1% fegetexcept
352 22.5% 50.6% 352 22.5% nearestImageVector (inline)
263 16.8% 67.5% 263 16.8% cos
131 8.4% 75.8% 131 8.4% measurementTask
52 3.3% 79.2% 331 21.2% Vbond.isra.1.part.2 (inline)
50 3.2% 82.4% 562 36.0% dumpStatsSample.2372 (inline)
46 2.9% 85.3% 46 2.9% nearestImageVector
42 2.7% 88.0% 42 2.7% FdihedralParticle (inline)
33 2.1% 90.1% 59 3.8% Vstack.isra.12.part.13 (inline)
26 1.7% 91.8% 26 1.7% neighbourStackDistance2 (inline)
16 1.0% 92.8% 16 1.0% die
14 0.9% 93.7% 44 2.8% VangleP5SB (inline)
14 0.9% 94.6% 14 0.9% atan2
13 0.8% 95.5% 88 5.6% Vangle.isra.4.part.5 (inline)
12 0.8% 96.2% 343 22.0% Vbond.isra.1 (inline)
8 0.5% 96.7% 8 0.5% printUsage.3788
6 0.4% 97.1% 6 0.4% Fangle.part.8 (inline)
6 0.4% 97.5% 6 0.4% boxFromParticle (inline)
6 0.4% 97.9% 279 17.9% nearestImageDistance (inline)
6 0.4% 98.3% 6 0.4% sin
3 0.2% 98.5% 3 0.2% Vdihedral.isra.9 (inline)
3 0.2% 98.7% 3 0.2% __finite
3 0.2% 98.8% 3 0.2% _init
3 0.2% 99.0% 3 0.2% numParticles (inline)
2 0.1% 99.2% 2 0.1% acos
2 0.1% 99.3% 8 0.5% addToGrid
2 0.1% 99.4% 2 0.1% exp
2 0.1% 99.6% 2 0.1% getAngleBaseInfo (inline)
2 0.1% 99.7% 2 0.1% log
1 0.1% 99.7% 1 0.1% Fbond.part.6 (inline)
1 0.1% 99.8% 1 0.1% _IO_file_xsputn
1 0.1% 99.9% 563 36.1% dumpStatsSample.2372
1 0.1% 99.9% 1 0.1% log2
1 0.1% 100.0% 1 0.1% significand
0 0.0% 100.0% 6 0.4% Fangle (inline)
0 0.0% 100.0% 1 0.1% Fbond (inline)
0 0.0% 100.0% 42 2.7% Fdihedral.part.11 (inline)
0 0.0% 100.0% 4 0.3% Fstack.part.14 (inline)
0 0.0% 100.0% 53 3.4% calculateForces.2880
0 0.0% 100.0% 3 0.2% getKineticTemperature (inline)
0 0.0% 100.0% 4 0.3% nearestImageDistance2 (inline)

现在,我通过函数指针调用了相当多的函数,并且 gcc 正在使用 lto 和 -O4 进行编译。我有理由相信这可能会导致 gcc 二进制文件的分析输出有些侥幸。例如,它表示“die()”中有 16 个样本。然而,这是不可能的,因为该函数会立即停止程序!

无论哪种方式,两个二进制文件似乎都同意 28% 的时间花在 fegetexcept() 上。我可以取消所有这些检查吗?

其次,我完整的编译器优化标志如下:

gcc -march=core2 -O4 -flto -mmmx -msse -msse2 -msse3 -fexcess-precision=fast -ffast-math -finline-limit=2000 -fmerge-all-constants -fmodulo-sched - fmodulo-sched-allow-regmoves -fgcse-sm -fgcse-las -fgcse-after-reload -funsafe-loop-optimizations

clang -march=core2 -O4

我可以添加一些东西来进一步提高性能吗?我不在乎编译时间是否过快,我需要我能获得的每一点性能! (关于 clang:我在那里找不到太多具体的性能标志,也许我应该手动转到 llvm 字节码,然后在那里为 llvm 编译器提供标志?)

长话短说:
(1) 代码在fegetexcept()中花费了28%的时间。是否可以通过选择“不安全的浮点代码”来避免这种情况?
(2) 我可以将哪些标志传递给 gcc 和 clang 以获得最佳性能——即使这会增加编译时间?



编辑

我将 glibc 从 2.13-r2 更新到 2.15-r2,现在分析输出已更改为:

clang :

Total: 1654 samples
381 23.0% 23.0% 381 23.0% __asin_finite
244 14.8% 37.8% 244 14.8% significand
203 12.3% 50.1% 203 12.3% Vdihedral
141 8.5% 58.6% 141 8.5% nearestImageVector
116 7.0% 65.6% 116 7.0% Fexclusion
81 4.9% 70.5% 81 4.9% integratorTaskTick
70 4.2% 74.7% 70 4.2% Fangle
63 3.8% 78.5% 63 3.8% FdihedralParticle
56 3.4% 81.9% 56 3.4% mutiallyExclusivePairForces
45 2.7% 84.6% 45 2.7% FCoulomb
42 2.5% 87.2% 42 2.5% _init
39 2.4% 89.5% 39 2.4% __isinf
35 2.1% 91.7% 35 2.1% nearestImageDistance
29 1.8% 93.4% 29 1.8% __lgamma_r_finite
21 1.3% 94.7% 21 1.3% forEveryPairD
16 1.0% 95.6% 16 1.0% Fbond
13 0.8% 96.4% 13 0.8% __isnan
11 0.7% 97.1% 11 0.7% __cosh_finite
10 0.6% 97.7% 10 0.6% Fstack
10 0.6% 98.3% 10 0.6% __acosh_finite
9 0.5% 98.9% 9 0.5% pairWrapper
6 0.4% 99.2% 6 0.4% atan2
5 0.3% 99.5% 5 0.3% Fdihedral
5 0.3% 99.8% 5 0.3% calculateForces
2 0.1% 99.9% 2 0.1% GLIBC_2.15
1 0.1% 100.0% 1 0.1% exp

海湾合作委员会:

Total: 1768 samples
385 21.8% 21.8% 385 21.8% __asin_finite
275 15.6% 37.3% 275 15.6% significand
252 14.3% 51.6% 252 14.3% nearestImageVector
199 11.3% 62.8% 299 16.9% Vdihedral.isra.4.part.5.2808
55 3.1% 66.0% 902 51.0% FdihedralParticle.2836
47 2.7% 68.6% 150 8.5% Fexclusion.part.15 (inline)
44 2.5% 71.1% 87 4.9% FCoulomb.part.16.2891
36 2.0% 73.1% 36 2.0% _init
33 1.9% 75.0% 236 13.3% mutiallyExclusivePairForces.2699
30 1.7% 76.7% 30 1.7% __lgamma_r_finite
29 1.6% 78.3% 29 1.6% isSaneNumber (inline)
28 1.6% 79.9% 28 1.6% feelExclusion (inline)
27 1.5% 81.4% 27 1.5% __isinf
25 1.4% 82.9% 35 2.0% Fangle.part.11.2855
22 1.2% 84.1% 40 2.3% Fangle.part.11 (inline)
22 1.2% 85.4% 24 1.4% randNorm.part.1.3194
20 1.1% 86.5% 20 1.1% __isnan
19 1.1% 87.6% 23 1.3% nearestImageUnitVector (inline)
19 1.1% 88.6% 19 1.1% pairWrapper.3570
18 1.0% 89.6% 105 5.9% langevinBBKhelper.3161
17 1.0% 90.6% 23 1.3% Fbond.part.10 (inline)
15 0.8% 91.5% 20 1.1% Vdihedral.isra.4.part.5 (inline)
14 0.8% 92.3% 14 0.8% length (inline)
13 0.7% 93.0% 13 0.7% getBasePairInfo (inline)
12 0.7% 93.7% 190 10.7% Fexclusion (inline)
12 0.7% 94.3% 15 0.8% Fstack.part.13 (inline)
12 0.7% 95.0% 12 0.7% __acosh_finite
12 0.7% 95.7% 12 0.7% reboxParticles (inline)
9 0.5% 96.2% 23 1.3% randNorm (inline)
8 0.5% 96.7% 8 0.5% __cosh_finite
8 0.5% 97.1% 221 12.5% visitNeighbours.part.1 (inline)
7 0.4% 97.5% 360 20.4% forEveryPairD
7 0.4% 97.9% 7 0.4% sincos
6 0.3% 98.2% 1467 83.0% calculateForces
5 0.3% 98.5% 943 53.3% Fdihedral.part.12 (inline)
4 0.2% 98.8% 33 1.9% debugVectorSanity (inline)
4 0.2% 99.0% 19 1.1% nearestImageDistance (inline)
3 0.2% 99.2% 317 17.9% Vdihedral.isra.4 (inline)
3 0.2% 99.3% 3 0.2% getAngleBaseInfo (inline)
2 0.1% 99.4% 2 0.1% resetForce.2703
2 0.1% 99.5% 2 0.1% tinymt64_generate_doubleOC (inline)
2 0.1% 99.7% 223 12.6% visitNeighbours (inline)
1 0.1% 99.7% 94 5.3% Fangle (inline)
1 0.1% 99.8% 1 0.1% calcInvDebyeLength (inline)
1 0.1% 99.8% 1 0.1% forEveryParticle
1 0.1% 99.9% 131 7.4% forEveryParticleD
1 0.1% 99.9% 1 0.1% munmap
1 0.1% 100.0% 1 0.1% neighbourStackDistance2 (inline)
0 0.0% 100.0% 1 0.1% 0x3e1341e250d56f1d
0 0.0% 100.0% 23 1.3% Fbond (inline)
0 0.0% 100.0% 1690 95.6% __libc_start_main
0 0.0% 100.0% 380 21.5% forEveryPair (inline)
0 0.0% 100.0% 1689 95.5% integratorTaskTick.3198
0 0.0% 100.0% 1690 95.6% main
0 0.0% 100.0% 1690 95.6% run (inline)
0 0.0% 100.0% 1689 95.5% seqTick.2114
0 0.0% 100.0% 1 0.1% taskStop (inline)
0 0.0% 100.0% 1689 95.5% taskTick (inline)

所以看起来 fegetexcept 可能只是一个错误的名字,被 glibc 数学例程中的一些代码识别出来了。我想这是 google-perftools 的缺点吧?

不过,我的问题的第 (2) 部分仍然存在:我可以将哪些标志传递给 gcc 和 clang 以获得最佳性能——即使这会增加编译时间?



编辑2

使用'perf'(参见,例如https://stackoverflow.com/a/10958510/153105)给出了一个不错的分析输出。貌似大部分时间花在了atan2()和cos()上,用的是sse2版本。为了完整起见,我将添加输出:

# Events: 17K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... .................... .......................................................
#
21.67% hairpin libm-2.15.so [.] __ieee754_atan2_sse2
14.12% hairpin hairpin [.] nearestImageVector
13.94% hairpin libm-2.15.so [.] __cos_sse2
11.94% hairpin hairpin [.] Vdihedral.isra.4.part.5.2808
8.27% hairpin hairpin [.] mutiallyExclusivePairForces.2699
4.81% hairpin hairpin [.] calculateForces
4.45% hairpin hairpin [.] FdihedralParticle.2836
3.89% hairpin hairpin [.] FCoulomb.part.16.2891
2.17% hairpin hairpin [.] langevinBBKhelper.3161
1.85% hairpin hairpin [.] Fangle.part.11.2855
1.83% hairpin libc-2.15.so [.] __isinf
1.64% hairpin hairpin [.] randNorm.part.1.3194
1.45% hairpin libm-2.15.so [.] __ieee754_log_sse2
1.02% hairpin hairpin [.] forEveryPairD
0.93% hairpin libm-2.15.so [.] __ieee754_acos_sse2
0.76% hairpin hairpin [.] pairWrapper.3570
0.76% hairpin hairpin [.] __isnan@plt
0.74% hairpin libc-2.15.so [.] __isnan
0.68% hairpin hairpin [.] __isinf@plt
0.59% hairpin libm-2.15.so [.] __ieee754_exp_sse2
0.58% hairpin libm-2.15.so [.] __sincos
0.55% hairpin hairpin [.] integratorTaskTick.3198
0.29% hairpin hairpin [.] __atan2_finite@plt
0.23% hairpin hairpin [.] cos@plt
0.19% hairpin libm-2.15.so [.] csloww1
0.07% hairpin hairpin [.] resetForce.2703
0.07% hairpin hairpin [.] forEveryParticle
0.06% hairpin libm-2.15.so [.] __dubcos
0.05% hairpin [kernel.kallsyms] [k] mutex_unlock
0.02% hairpin hairpin [.] __log_finite@plt
0.02% hairpin hairpin [.] forEveryParticleD
0.02% hairpin [kernel.kallsyms] [k] do_raw_spin_lock
0.02% hairpin hairpin [.] __acos_finite@plt
0.02% hairpin [kernel.kallsyms] [k] update_cpu_load
0.01% hairpin [kernel.kallsyms] [k] tick_sched_timer
0.01% hairpin [kernel.kallsyms] [k] ktime_get
0.01% hairpin hairpin [.] __exp_finite@plt
0.01% hairpin [kernel.kallsyms] [k] run_timer_softirq
0.01% hairpin [kernel.kallsyms] [k] apic_timer_interrupt
0.01% hairpin [kernel.kallsyms] [k] __cycles_2_ns
0.01% hairpin [kernel.kallsyms] [k] __local_bh_enable
0.01% hairpin [kernel.kallsyms] [k] intel_pmu_disable_all
0.01% hairpin [kernel.kallsyms] [k] r100_mm_rreg
0.01% hairpin [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
0.01% hairpin [kernel.kallsyms] [k] update_stats_wait_end.clone.15
0.01% hairpin [kernel.kallsyms] [k] ttwu_do_activate.clone.50
0.01% hairpin [kernel.kallsyms] [k] do_signal
0.01% hairpin [kernel.kallsyms] [k] tty_hung_up_p
0.01% hairpin hairpin [.] main
0.01% hairpin [kernel.kallsyms] [k] prepare_signal
0.01% hairpin libprofiler.so.0.3.0 [.] ProfileData::Evict(ProfileData::Entry const&)
0.01% hairpin [kernel.kallsyms] [k] uhci_check_ports
0.01% hairpin [kernel.kallsyms] [k] copy_siginfo_to_user
0.01% hairpin [kernel.kallsyms] [k] fxrstor_checking
0.01% hairpin [kernel.kallsyms] [k] calc_global_load
0.01% hairpin [kernel.kallsyms] [k] account_group_user_time
0.01% hairpin [kernel.kallsyms] [k] tg_load_down
0.01% hairpin [kernel.kallsyms] [k] irq_enter
0.01% hairpin [kernel.kallsyms] [k] __schedule
0.01% hairpin [kernel.kallsyms] [k] n_tty_write
0.01% hairpin libprofiler.so.0.3.0 [.] ProfileHandler::SignalHandler(int, siginfo*, void*)
0.01% hairpin [kernel.kallsyms] [k] get_cycles
0.01% hairpin [kernel.kallsyms] [k] enqueue_hrtimer
0.01% hairpin hairpin [.] seqTick.2114
0.01% hairpin [kernel.kallsyms] [k] idle_cpu
0.01% hairpin hairpin [.] sincos@plt
0.01% hairpin [kernel.kallsyms] [k] tick_program_event
0.01% hairpin [kernel.kallsyms] [k] clear_page_c
0.01% hairpin [kernel.kallsyms] [k] number.clone.1
0.01% hairpin [kernel.kallsyms] [k] task_waking_fair
0.01% hairpin [kernel.kallsyms] [k] save_i387_xstate
0.01% hairpin [kernel.kallsyms] [k] __rcu_pending
0.01% hairpin [kernel.kallsyms] [k] jiffies_to_timeval
0.01% hairpin [kernel.kallsyms] [k] iowrite16
0.01% hairpin [kernel.kallsyms] [k] hrtimer_interrupt
0.01% hairpin [kernel.kallsyms] [k] finish_task_switch
0.01% hairpin [kernel.kallsyms] [k] clockevents_program_event
0.01% hairpin [kernel.kallsyms] [k] ioread16
0.01% hairpin [kernel.kallsyms] [k] lapic_next_event
0.00% hairpin [kernel.kallsyms] [k] read_tsc
0.00% hairpin [kernel.kallsyms] [k] __zone_watermark_ok
0.00% hairpin libpthread-2.15.so [.] __libc_read
0.00% hairpin [kernel.kallsyms] [k] intel_pmu_enable_all

最佳答案

您应该使用 perf 和 Brendan Gregg 的脚本来创建一个火焰图,这样您就可以很好地直观地表示时间的去向。火焰图将使哪些函数是 fegetexcept 变得显而易见,因为它是一种可视化和总结调用堆栈的方式:

http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

没有调用堆栈的 CPU 采样通常是无用的。

确保为所有内容都安装了符号,因为许多分析器会将样本与最近 导出的名称相关联,这可能会导致严重错误——这可能解释了为什么“die”显示了一些样本。

您也可以尝试将程序加载到 gdb 中并在 fegetexcept 上设置断点。如果您同时安装了 libc 的符号和源代码,那么您可以沿着调用堆栈向上走,看看 为什么 fegetexcept 被调用。我的猜测是您将超出范围的值传递给 acos 或类似的东西。

本文讨论如何为 libc 安装符号和源。

https://randomascii.wordpress.com/2013/01/08/symbols-on-linux-part-one-g-library-symbols/

关于c - 分析编号处理代码 : 28% of time in fegetexcept() & optimal compiler flags?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12438726/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com