- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我在回答 a question on Code Review我发现 x64 和 x86 之间在性能上有一个有趣的差异(比如很多)。
class Program
{
static void Main(string[] args)
{
BenchmarkRunner.Run<ModVsOptimization>();
Console.ReadLine();
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static public ulong Mersenne5(ulong dividend)
{
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
dividend = (dividend >> 16) + (dividend & 0xFFFF);
dividend = (dividend >> 8) + (dividend & 0xFF);
dividend = (dividend >> 4) + (dividend & 0xF);
dividend = (dividend >> 4) + (dividend & 0xF);
if (dividend > 14) { dividend = dividend - 15; } // mod 15
if (dividend > 10) { dividend = dividend - 10; }
if (dividend > 4) { dividend = dividend - 5; }
return dividend;
}
}
public class ModVsOptimization
{
[Benchmark(Baseline = true)]
public ulong RawModulo_5()
{
ulong r = 0;
for (ulong i = 0; i < 1000; i++)
{
r += i % 5;
}
return r;
}
[Benchmark]
public ulong OptimizedModulo_ViaMethod_5()
{
ulong r = 0;
for (ulong i = 0; i < 1000; i++)
{
r += Program.Mersenne5(i);
}
return r;
}
}
// * Summary *
BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-5930K CPU 3.50GHz (Broadwell), ProcessorCount=12
Frequency=3415991 Hz, Resolution=292.7408 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.7.2098.0
DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.7.2098.0
Method | Mean | Error | StdDev | Scaled |
---------------------------- |---------:|----------:|----------:|-------:|
RawModulo_5 | 4.601 us | 0.0121 us | 0.0107 us | 1.00 |
OptimizedModulo_ViaMethod_5 | 7.990 us | 0.0060 us | 0.0053 us | 1.74 |
// * Hints *
Outliers
ModVsOptimization.RawModulo_5: Default -> 1 outlier was removed
ModVsOptimization.OptimizedModulo_ViaMethod_5: Default -> 1 outlier was removed
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Scaled : Mean(CurrentBenchmark) / Mean(BaselineBenchmark)
1 us : 1 Microsecond (0.000001 sec)
// ***** BenchmarkRunner: End *****
// * Summary *
BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-5930K CPU 3.50GHz (Broadwell), ProcessorCount=12
Frequency=3415991 Hz, Resolution=292.7408 ns, Timer=TSC
[Host] : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2098.0
DefaultJob : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2098.0
Method | Mean | Error | StdDev | Scaled |
---------------------------- |---------:|----------:|----------:|-------:|
RawModulo_5 | 8.323 us | 0.0042 us | 0.0039 us | 1.00 |
OptimizedModulo_ViaMethod_5 | 2.597 us | 0.0956 us | 0.0982 us | 0.31 |
// * Hints *
Outliers
ModVsOptimization.OptimizedModulo_ViaMethod_5: Default -> 2 outliers were removed
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Scaled : Mean(CurrentBenchmark) / Mean(BaselineBenchmark)
1 us : 1 Microsecond (0.000001 sec)
// ***** BenchmarkRunner: End *****
RawModulo_5
,x86 和 x64 程序集具有相同的 IL。方法:
.method public hidebysig instance uint64
RawModulo_5() cil managed
{
.custom instance void [BenchmarkDotNet.Core]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor() = ( 01 00 01 00 54 02 08 42 61 73 65 6C 69 6E 65 01 ) // ....T..Baseline.
// Code size 31 (0x1f)
.maxstack 3
.locals init ([0] uint64 r,
[1] uint64 i)
IL_0000: ldc.i4.0
IL_0001: conv.i8
IL_0002: stloc.0
IL_0003: ldc.i4.0
IL_0004: conv.i8
IL_0005: stloc.1
IL_0006: br.s IL_0014
IL_0008: ldloc.0
IL_0009: ldloc.1
IL_000a: ldc.i4.5
IL_000b: conv.i8
IL_000c: rem.un
IL_000d: add
IL_000e: stloc.0
IL_000f: ldloc.1
IL_0010: ldc.i4.1
IL_0011: conv.i8
IL_0012: add
IL_0013: stloc.1
IL_0014: ldloc.1
IL_0015: ldc.i4 0x3e8
IL_001a: conv.i8
IL_001b: blt.un.s IL_0008
IL_001d: ldloc.0
IL_001e: ret
} // end of method ModVsOptimization::RawModulo_5
最佳答案
我想对生成的汇编代码进行分析,看看发生了什么。我抓取了您的示例代码并在 Release模式下运行它。这是使用带有 .NET Framework 4.5.2 的 Visual Studio 2015。 CPU 是 Intel Ivy Bridge i5-3570K,以防 JIT 进行非常具体的优化。我运行了相同的测试,但没有使用您的基准测试套件,只使用一个简单的 Stopwatch
并将时间间隔除以迭代次数。这是我观察到的:
RawModulo_5, x86: 13721978 ticks, 13.721978 ticks per iteration
OptimizedModulo_ViaMethod_5, x86: 24641039 ticks, 24.641039 ticks per iteration
RawModulo_5, x64: 23275799 ticks, 23.275799 ticks per iteration
OptimizedModulo_ViaMethod_5, x64: 13389012 ticks, 13.389012 ticks per iteration
RawModulo_5
在 x64 中的速度快了两倍,而
OptimizedModulo_ViaMethod_5
在 x64 中快了 3.7 倍!
RawModulo_5
和 OptimizedModulo_ViaMethod_5
的输出相等,因为它们不是!正确的 Mersenne5
实现如下:
static public ulong Mersenne5(ulong dividend)
{
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
dividend = (dividend >> 16) + (dividend & 0xFFFF);
dividend = (dividend >> 8) + (dividend & 0xFF);
dividend = (dividend >> 4) + (dividend & 0xF);
// there was an extra shift by 4 here
if (dividend > 14) { dividend = dividend - 15; } // mod 15
// the 9 used to be a 10
if (dividend > 9) { dividend = dividend - 10; }
if (dividend > 4) { dividend = dividend - 5; }
return dividend;
}
System.Diagnostics.Debugger.Break()
,就在
Mersenne5
的循环和主体之前,以便我有一个明确的断点来获取生成的程序集。顺便说一下,您可以从 Visual Studio UI 获取生成的汇编代码 - 如果您处于断点处,您可以右键单击代码编辑器窗口并从上下文菜单中选择“转到反汇编”。我已经注释了程序集来解释它在做什么。对不起,疯狂的语法高亮。
System.Diagnostics.Debugger.Break();
00242DA2 in al,dx
00242DA3 push edi
00242DA4 push ebx
00242DA5 sub esp,10h
00242DA8 call 6D4C0178
ulong r = 0;
00242DAD mov dword ptr [ebp-10h],0 ; setting the low and high dwords of 'r'
00242DB4 mov dword ptr [ebp-0Ch],0
for (ulong i = 0; i < 1000; i++)
; set the high dword of 'i' to 0
00242DBB mov dword ptr [ebp-14h],0
; clear the low dword of 'i' to 0 - the compiler is using 'edi' as the loop iteration var
00242DC2 xor edi,edi
{
r += i % 5;
00242DC4 mov eax,edi
00242DC6 mov edx,dword ptr [ebp-14h]
; edx:eax together are the high and low dwords of 'i', respectively
; this is a short circuit trick so it can avoid working with the high
; dword - you can see it jumps halfway in to the div/mod operation below
00242DC9 mov ecx,5
00242DCE cmp edx,ecx
00242DD0 jb 00242DDC
; 64 bit div/mod operation
00242DD2 mov ebx,eax
00242DD4 mov eax,edx
00242DD6 xor edx,edx
00242DD8 div eax,ecx
00242DDA mov eax,ebx
00242DDC div eax,ecx
00242DDE mov eax,edx
00242DE0 xor edx,edx
; load the current low and high dwords from 'r', then add into
; edx:eax as a pair forming a qword
00242DE2 add eax,dword ptr [ebp-10h]
00242DE5 adc edx,dword ptr [ebp-0Ch]
; store the result back in 'r'
00242DE8 mov dword ptr [ebp-10h],eax
00242DEB mov dword ptr [ebp-0Ch],edx
for (ulong i = 0; i < 1000; i++)
; load the loop variable low and high dwords into edx:eax
00242DEE mov eax,edi
00242DF0 mov edx,dword ptr [ebp-14h]
; increment eax (the low dword) and propagate any carries to
; edx (the high dword)
00242DF3 add eax,1
00242DF6 adc edx,0
; store the low and high dwords back to the high word of 'i' and
; the loop iteration counter, 'edi'
00242DF9 mov dword ptr [ebp-14h],edx
00242DFC mov edi,eax
; test the high dword
00242DFE cmp dword ptr [ebp-14h],0
00242E02 ja 00242E0E
00242E04 jb 00242DC4
; (int) i < 1000
00242E06 cmp edi,3E8h
00242E0C jb 00242DC4
}
return r;
; retrieve the current value of 'r' from memory, return value is
; in edx:eax since the return value is 64 bits
00242E0E mov eax,dword ptr [ebp-10h]
00242E11 mov edx,dword ptr [ebp-0Ch]
00242E14 lea esp,[ebp-8]
00242E17 pop ebx
00242E18 pop edi
00242E19 pop ebp
00242E1A ret
System.Diagnostics.Debugger.Break();
00242E33 push edi
00242E34 push esi
00242E35 push ebx
00242E36 sub esp,8
00242E39 call 6D4C0178
ulong r = 0;
; same as above, initialize 'r' to zero using low and high dwords
00242E3E mov dword ptr [ebp-10h],0
; this time we're using edi:esi as the loop counter, rather than
; edi and a memory location. probably less register pressure in this
; function, for reasons we'll see...
00242E45 xor ebx,ebx
for (ulong i = 0; i < 1000; i++)
; initialize 'i' to 0, esi is the loop counter low dword, edi is the high dword
00242E47 xor esi,esi
00242E49 xor edi,edi
; push 'i' to the stack, high word then low word
00242E4B push edi
00242E4C push esi
; call Mersenne5 - it got put in the data section since it's static
00242E4D call dword ptr ds:[3D7830h]
; return value comes back as edx:eax, where edx is the high dword
; ebx is the existing low dword of 'r', so it's accumulated into eax
00242E53 add eax,ebx
; the high dword of 'r' is at ebp-10, that gets accumulated to edx with
; the carry result of the last add since it's 64 bits wide
00242E55 adc edx,dword ptr [ebp-10h]
; store edx:ebx back to 'r'
00242E58 mov dword ptr [ebp-10h],edx
00242E5B mov ebx,eax
; increment the loop counter and carry to edi as well, 64 bit add
00242E5D add esi,1
00242E60 adc edi,0
; make sure edi == 0 since it's the high dword
00242E63 test edi,edi
00242E65 ja 00242E71
00242E67 jb 00242E4B
; (int) i < 1000
00242E69 cmp esi,3E8h
00242E6F jb 00242E4B
}
return r;
; move 'r' to edx:eax to return them
00242E71 mov eax,ebx
00242E73 mov edx,dword ptr [ebp-10h]
00242E76 lea esp,[ebp-0Ch]
00242E79 pop ebx
00242E7A pop esi
00242E7B pop edi
00242E7C pop ebp
00242E7D ret
System.Diagnostics.Debugger.Break();
00342E92 in al,dx
00342E93 push edi
00342E94 push esi
; esi is the low dword, edi is the high dword of the 64 bit argument
00342E95 mov esi,dword ptr [ebp+8]
00342E98 mov edi,dword ptr [ebp+0Ch]
00342E9B call 6D4C0178
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
; this is a LOT of instructions for each step, but at least it's all registers.
; copy edi:esi to edx:eax
00342EA0 mov eax,esi
00342EA2 mov edx,edi
; clobber eax with edx, so now both are the high word. this is a
; shorthand for a 32 bit shift right of a 64 bit number.
00342EA4 mov eax,edx
; clear the high word now that we've moved the high word to the low word
00342EA6 xor edx,edx
; clear the high word of the original 'dividend', same as masking the low 32 bits
00342EA8 xor edi,edi
; (dividend >> 32) + (dividend & 0xFFFFFFFF)
; it's a 64 bit add, so it's the usual add/adc
00342EAA add eax,esi
00342EAC adc edx,edi
; 'dividend' now equals the temporary "variable" that held the addition result
00342EAE mov esi,eax
00342EB0 mov edi,edx
dividend = (dividend >> 16) + (dividend & 0xFFFF);
; same idea as above, but with an actual shift and mask since it's not 32 bits wide
00342EB2 mov eax,esi
00342EB4 mov edx,edi
00342EB6 shrd eax,edx,10h
00342EBA shr edx,10h
00342EBD and esi,0FFFFh
00342EC3 xor edi,edi
00342EC5 add eax,esi
00342EC7 adc edx,edi
00342EC9 mov esi,eax
00342ECB mov edi,edx
dividend = (dividend >> 8) + (dividend & 0xFF);
; same idea, keep going down...
00342ECD mov eax,esi
00342ECF mov edx,edi
00342ED1 shrd eax,edx,8
00342ED5 shr edx,8
00342ED8 and esi,0FFh
00342EDE xor edi,edi
00342EE0 add eax,esi
00342EE2 adc edx,edi
00342EE4 mov esi,eax
00342EE6 mov edi,edx
dividend = (dividend >> 4) + (dividend & 0xF);
00342EE8 mov eax,esi
00342EEA mov edx,edi
00342EEC shrd eax,edx,4
00342EF0 shr edx,4
00342EF3 and esi,0Fh
00342EF6 xor edi,edi
00342EF8 add eax,esi
00342EFA adc edx,edi
00342EFC mov esi,eax
00342EFE mov edi,edx
dividend = (dividend >> 4) + (dividend & 0xF);
00342F00 mov eax,esi
00342F02 mov edx,edi
00342F04 shrd eax,edx,4
00342F08 shr edx,4
00342F0B and esi,0Fh
00342F0E xor edi,edi
00342F10 add eax,esi
00342F12 adc edx,edi
00342F14 mov esi,eax
00342F16 mov edi,edx
if (dividend > 14) { dividend = dividend - 15; } // mod 15
; conditional subtraction
00342F18 test edi,edi
00342F1A ja 00342F23
00342F1C jb 00342F29
; 'dividend' > 14
00342F1E cmp esi,0Eh
00342F21 jbe 00342F29
; 'dividend' = 'dividend' - 15
00342F23 sub esi,0Fh
; subtraction borrow from high word
00342F26 sbb edi,0
if (dividend > 10) { dividend = dividend - 10; }
; same gist for the next two
00342F29 test edi,edi
00342F2B ja 00342F34
00342F2D jb 00342F3A
00342F2F cmp esi,0Ah
00342F32 jbe 00342F3A
00342F34 sub esi,0Ah
00342F37 sbb edi,0
if (dividend > 4) { dividend = dividend - 5; }
00342F3A test edi,edi
00342F3C ja 00342F45
00342F3E jb 00342F4B
00342F40 cmp esi,4
00342F43 jbe 00342F4B
00342F45 sub esi,5
00342F48 sbb edi,0
return dividend;
; move edi:esi into edx:eax for return
00342F4B mov eax,esi
00342F4D mov edx,edi
00342F4F pop esi
00342F50 pop edi
00342F51 pop ebp
00342F52 ret 8
AggressiveInlining
。我猜这是因为在
OptimizedModulo_ViaMethod_5
中内联函数会导致可怕的寄存器溢出,并且大量的内存读写会完全破坏内联方法的点,因此编译器选择(非常明智!)不这样做.
call
获得
OptimizedModulo_ViaMethod_5
1000 次,因此有 1000 次额外的 call/ret 开销,包括必要的 push 和 pops 以保存跨越调用边界的寄存器状态。
RawModulo_5
不会在外部进行任何调用,甚至对 64 位除法也进行了一些优化,因此它会尽可能地跳过高
dword
。
System.Diagnostics.Debugger.Break();
000007FE98C93CF0 sub rsp,28h
000007FE98C93CF4 call 000007FEF7B079C0
ulong r = 0;
; the compiler knows the high dword of rcx is already 0, so it just
; zeros the low dword. this is 'r'
000007FE98C93CF9 xor ecx,ecx
for (ulong i = 0; i < 1000; i++)
; same here, this is 'i'
000007FE98C93CFB xor r8d,r8d
{
r += i % 5;
; load 5 as a dword to the low dword of r9
000007FE98C93CFE mov r9d,5
; copy the loop counter to rax for the div below
000007FE98C93D04 mov rax,r8
; clear the lower dword of rdx, upper dword is clear already
000007FE98C93D07 xor edx,edx
; 64 bit div/mod in one instruction! but it's slow!
000007FE98C93D09 div rax,r9
; rax = quotient, rdx = remainder
; throw away the quotient since we're just doing mod, and accumulate the
; modulus into 'r'
000007FE98C93D0C add rcx,rdx
for (ulong i = 0; i < 1000; i++)
; 64 bit increment to the loop counter
000007FE98C93D0F inc r8
; i < 1000
000007FE98C93D12 cmp r8,3E8h
000007FE98C93D19 jb 000007FE98C93CFE
}
return r;
; return 'r' in rax, since we can directly return a 64 bit var in one register now
000007FE98C93D1B mov rax,rcx
000007FE98C93D1E add rsp,28h
000007FE98C93D22 ret
System.Diagnostics.Debugger.Break();
000007FE98C94040 push rdi
000007FE98C94041 push rsi
000007FE98C94042 sub rsp,28h
000007FE98C94046 call 000007FEF7B079C0
ulong r = 0;
; same general loop setup as above
000007FE98C9404B xor esi,esi
for (ulong i = 0; i < 1000; i++)
; 'edi' is the loop counter
000007FE98C9404D xor edi,edi
; put rdi in rcx, which is the x64 register used for the first argument
; in a call
000007FE98C9404F mov rcx,rdi
; call Mersenne5 - still no actual inlining!
000007FE98C94052 call 000007FE98C90F40
; accumulate 'r' with the return value of Mersenne5
000007FE98C94057 add rax,rsi
; store back to 'r' - I don't know why in the world the compiler did this
; seems like add rsi, rax would be better, but maybe there's a pipelining
; issue I'm not seeing.
000007FE98C9405A mov rsi,rax
; increment loop counter
000007FE98C9405D inc rdi
; i < 1000
000007FE98C94060 cmp rdi,3E8h
000007FE98C94067 jb 000007FE98C9404F
}
return r;
; put return value in rax like before
000007FE98C94069 mov rax,rsi
000007FE98C9406C add rsp,28h
000007FE98C94070 pop rsi
000007FE98C94071 pop rdi
000007FE98C94072 ret
System.Diagnostics.Debugger.Break();
000007FE98C94580 push rsi
000007FE98C94581 sub rsp,20h
000007FE98C94585 mov rsi,rcx
000007FE98C94588 call 000007FEF7B079C0
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
; pretty similar to before actually, except this time we do a real
; shift and mask for the 32 bit part
000007FE98C9458D mov rax,rsi
; 'dividend' >> 32
000007FE98C94590 shr rax,20h
; hilariously, we have to load the mask into edx first. this is because
; there is no AND r/64, imm64 in x64
000007FE98C94594 mov edx,0FFFFFFFFh
000007FE98C94599 and rsi,rdx
; add the shift and the masked versions together
000007FE98C9459C add rax,rsi
000007FE98C9459F mov rsi,rax
dividend = (dividend >> 16) + (dividend & 0xFFFF);
; same logic continues down
000007FE98C945A2 mov rax,rsi
000007FE98C945A5 shr rax,10h
000007FE98C945A9 mov rdx,rsi
000007FE98C945AC and rdx,0FFFFh
000007FE98C945B3 add rax,rdx
; note the redundant moves that happen every time, rax into rsi, rsi
; into rax. so there's still not ideal x64 being generated.
000007FE98C945B6 mov rsi,rax
dividend = (dividend >> 8) + (dividend & 0xFF);
000007FE98C945B9 mov rax,rsi
000007FE98C945BC shr rax,8
000007FE98C945C0 mov rdx,rsi
000007FE98C945C3 and rdx,0FFh
000007FE98C945CA add rax,rdx
000007FE98C945CD mov rsi,rax
dividend = (dividend >> 4) + (dividend & 0xF);
000007FE98C945D0 mov rax,rsi
000007FE98C945D3 shr rax,4
000007FE98C945D7 mov rdx,rsi
000007FE98C945DA and rdx,0Fh
000007FE98C945DE add rax,rdx
000007FE98C945E1 mov rsi,rax
dividend = (dividend >> 4) + (dividend & 0xF);
000007FE98C945E4 mov rax,rsi
000007FE98C945E7 shr rax,4
000007FE98C945EB mov rdx,rsi
000007FE98C945EE and rdx,0Fh
000007FE98C945F2 add rax,rdx
000007FE98C945F5 mov rsi,rax
if (dividend > 14) { dividend = dividend - 15; } // mod 15
; notice the difference in jumping logic - the pairs of jumps are now singles
000007FE98C945F8 cmp rsi,0Eh
000007FE98C945FC jbe 000007FE98C94602
; using a single 64 bit add instead of a subtract, the immediate constant
; is the 2's complement of 15. this is okay because there's no borrowing
; to do since we can do the entire sub in one operation to one register.
000007FE98C945FE add rsi,0FFFFFFFFFFFFFFF1h
if (dividend > 10) { dividend = dividend - 10; }
000007FE98C94602 cmp rsi,0Ah
000007FE98C94606 jbe 000007FE98C9460C
000007FE98C94608 add rsi,0FFFFFFFFFFFFFFF6h
if (dividend > 4) { dividend = dividend - 5; }
000007FE98C9460C cmp rsi,4
000007FE98C94610 jbe 000007FE98C94616
000007FE98C94612 add rsi,0FFFFFFFFFFFFFFFBh
return dividend;
000007FE98C94616 mov rax,rsi
000007FE98C94619 add rsp,20h
000007FE98C9461D pop rsi
000007FE98C9461E ret
RawModulo_5
在 x64 中比 x86 慢两倍,尤其是为什么
OptimizedModulo_ViaMethod_5
在 x64 下比 x86 快几乎四倍。为了得到一个完整的解释,我认为我们需要像 Peter Cordes 这样的人——他在指令时序和流水线方面的知识比我丰富得多。以下是我对优点和缺点来自何处的直觉。
div
,因为它涉及 RawModulo_5
div
需要 10 个微操作,延迟为 22 到 29 个时钟,而 64 位 div
需要 36 个微操作,延迟为 32 到 95 个时钟.RawModulo_5
中进行了优化,在每种情况下都绕过高双字 div
,因为循环保持在 int.MaxValue
之下,所以实际上它只是在每次迭代中执行单个 32 位 div
。因此,64 位 div
延迟比 32 位 div
延迟高 1.45 到 3.27 倍。两个版本都完全依赖于 div
的结果,因此 x64 代码由于更高的延迟而付出了更大的性能损失。我敢说 x86 RawModulo_5
中 64 位 add 的一对 add/adc 指令与 64 位宽 div
的巨大性能劣势相比是一个很小的惩罚。 OptimizedModulo_ViaMethod_5
中的调用开销OptimizedModulo_ViaMethod_5
在两个版本中都调用 Mersenne5
1000 次,所以 64 位版本在标准 x86 与 x64 调用约定方面付出的代价要小得多。考虑到 x86 版本必须将两个寄存器压入堆栈以传递 64 位变量,然后 Mersenne5
必须保留 esi
和 edi
,然后分别为 edx
和 eax
从堆栈中拉出高低双字。最后, Mersenne5
必须恢复 esi
和 edi
。在 x64 版本中,i
的值直接传入 ecx
中,因此完全不涉及内存访问。 x64 Mersenne5
只保存和恢复 rsi
,其他寄存器被破坏。 Mersenne5
中的指令少得多Mersenne5
在 x64 中效率更高,因为它可以在单个指令中对 64 位 dividend
执行所有操作,而 x86 中的 mov
和 add/adc
操作需要成对指令。我有一种预感,x64 中的依赖链也更好,但我的知识不足,无法谈论该主题。 Mersenne5
中更好的跳转行为Mersenne5
最后做的三个条件减法在 x64 下比在 x86 下实现得好得多。在 x86 上,每个都有两个比较和三个可能的条件跳转。在x64上,只有一次比较和一次条件跳转,无疑效率更高。 RawModulo_5
造成了相当大的伤害,同时
Mersenne5
中指令的几乎减半正在加速
OptimizedModulo_ViaMethod_5
。
OptimizedModulo_ViaMethod_5
的速度有多快,即使与 x86
RawModulo_5
相比也是如此。我想答案是
Mersenne5
方法的微操作融合和流水线在 x64 上要好得多,或者您架构上的 JIT 可能使用 Broadwell 特定的知识来输出非常不同的指令。
RawModulo_5, x86: 13722506 ticks, 13.722506 ticks per iteration
OptimizedModulo_ViaMethod_5, x86: 23640994 ticks, 23.640994 ticks per iteration
OptimizedModulo_TrueInlined, x86: 21488012 ticks, 21.488012 ticks per iteration
OptimizedModulo_TrueInlined2, x86: 21645697 ticks, 21.645697 ticks per iteration
RawModulo_5, x64: 22175326 ticks, 22.175326 ticks per iteration
OptimizedModulo_ViaMethod_5, x64: 12822574 ticks, 12.822574 ticks per iteration
OptimizedModulo_TrueInlined, x64: 7612328 ticks, 7.612328 ticks per iteration
OptimizedModulo_TrueInlined2, x64: 7591190 ticks, 7.59119 ticks per iteration
public ulong OptimizedModulo_TrueInlined()
{
ulong r = 0;
ulong dividend = 0;
for (ulong i = 0; i < 1000; i++)
{
dividend = i;
dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
dividend = (dividend >> 16) + (dividend & 0xFFFF);
dividend = (dividend >> 8) + (dividend & 0xFF);
dividend = (dividend >> 4) + (dividend & 0xF);
dividend = (dividend >> 4) + (dividend & 0xF);
if (dividend > 14) { dividend = dividend - 15; } // mod 15
if (dividend > 10) { dividend = dividend - 10; }
if (dividend > 4) { dividend = dividend - 5; }
r += dividend;
}
return r;
}
public ulong OptimizedModulo_TrueInlined2()
{
ulong r = 0;
ulong dividend = 0;
for (ulong i = 0; i < 1000; i++)
{
dividend = (i >> 32) + (i & 0xFFFFFFFF);
dividend = (dividend >> 16) + (dividend & 0xFFFF);
dividend = (dividend >> 8) + (dividend & 0xFF);
dividend = (dividend >> 4) + (dividend & 0xF);
dividend = (dividend >> 4) + (dividend & 0xF);
if (dividend > 14) { dividend = dividend - 15; } // mod 15
if (dividend > 10) { dividend = dividend - 10; }
if (dividend > 4) { dividend = dividend - 5; }
r += dividend;
}
return r;
}
关于c# - x64 与 x86 具有如此不同的性能结果背后的原因是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44556614/
我有 table 像这样 -------------------------------------------- id size title priority
我的应用在不同的 Activity (4 个 Activity )中仅包含横幅广告。所以我的疑问是, 我可以对所有横幅广告使用一个广告单元 ID 吗? 或者 每个 Activity 使用不同的广告单元
我有任意(但统一)数字列表的任意列表。 (它们是 n 空间中 bin 的边界坐标,我想绘制其角,但这并不重要。)我想生成所有可能组合的列表。所以:[[1,2], [3,4],[5,6]] 产生 [[1
我刚刚在学校开始学习 Java,正在尝试自定义控件和图形。我目前正在研究图案锁,一开始一切都很好,但突然间它绘制不正确。我确实更改了一些代码,但是当我看到错误时,我立即将其更改回来(撤消,ftw),但
在获取 Distinct 的 Count 时,我在使用 Group By With Rollup 时遇到了一个小问题。 问题是 Rollup 摘要只是所有分组中 Distinct 值的总数,而不是所有
这不起作用: select count(distinct colA, colB) from mytable 我知道我可以通过双选来简单地解决这个问题。 select count(*) from (
这个问题在这里已经有了答案: JavaScript regex whitespace characters (5 个回答) 2年前关闭。 你能解释一下为什么我会得到 false比较 text ===
这个问题已经有答案了: 奥 git _a (56 个回答) 已关闭 9 年前。 我被要求用 Javascript 编写一个函数 sortByFoo 来正确响应此测试: // Does not cras
所以,我不得不说,SQL 是迄今为止我作为开发人员最薄弱的一面。也许我想要完成的事情很简单。我有这样的东西(这不是真正的模型,但为了使其易于理解而不浪费太多时间解释它,我想出了一个完全模仿我必须使用的
这个问题在这里已经有了答案: How does the "this" keyword work? (22 个回答) 3年前关闭。 简而言之:为什么在使用 Objects 时,直接调用的函数和通过引用传
这个问题在这里已经有了答案: 关闭 12 年前。 Possible Duplicate: what is the difference between (.) dot operator and (-
我真的不明白这里发生了什么但是: 当我这样做时: colorIndex += len - stopPos; for(int m = 0; m < len - stopPos; m++) { c
思考 MySQL 中的 Group By 函数的最佳方式是什么? 我正在编写一个 MySQL 查询,通过 ODBC 连接在 Excel 的数据透视表中提取数据,以便用户可以轻松访问数据。 例如,我有:
我想要的SQL是这样的: SELECT week_no, type, SELECT count(distinct user_id) FROM group WHERE pts > 0 FROM bas
商店表: +--+-------+--------+ |id|name |date | +--+-------+--------+ |1 |x |Ma
对于 chrome 和 ff,当涉及到可怕的 ie 时,这个脚本工作完美。有问题 function getY(oElement) { var curtop = 0; if (oElem
我现在无法提供代码,因为我目前正在脑海中研究这个想法并在互联网上四处乱逛。 我了解了进程间通信和使用共享内存在进程之间共享数据(特别是结构)。 但是,在对保存在不同 .c 文件中的程序使用 fork(
我想在用户集合中使用不同的功能。在 mongo shell 中,我可以像下面这样使用: db.users.distinct("name"); 其中名称是用于区分的集合字段。 同样我想要,在 C
List nastava_izvjestaj = new List(); var data_context = new DataEvidencijaDataContext();
我的 Rails 应用程序中有 Ransack 搜索和 Foundation,本地 css 渲染正常,而生产中的同一个应用程序有一个怪癖: 应用程序中的其他内容完全相同。 我在 Chrome 和 Sa
我是一名优秀的程序员,十分优秀!