c# - 为什么通过 Pointer 转换结构很慢，而 Unsafe.As 很快？-6ren

c# - 为什么通过 Pointer 转换结构很慢，而 Unsafe.As 很快？

转载作者：太空狗更新时间：2023-10-29 18:24:23

背景

我想制作一些整数大小的 struct s(即 32 位和 64 位)可以轻松转换为相同大小的原始非托管类型(即 Int32 和 UInt32，特别是对于 32 位大小的结构)。

然后，这些结构将公开用于位操作/索引的其他功能，这些功能在整数类型上不直接可用。基本上，作为一种语法糖，提高可读性和易用性。

然而，重要的部分是性能，因为这种额外的抽象基本上应该有 0 成本(在一天结束时，CPU 应该“看到”与处理原始整数相同的位)。

示例结构

下面只是最基本的 struct我想出了。它不具备所有功能，但足以说明我的问题:

[StructLayout(LayoutKind.Explicit, Pack = 1, Size = 4)]
public struct Mask32 {
  [FieldOffset(3)]
  public byte Byte1;
  [FieldOffset(2)]
  public ushort UShort1;
  [FieldOffset(2)]
  public byte Byte2;
  [FieldOffset(1)]
  public byte Byte3;
  [FieldOffset(0)]
  public ushort UShort2;
  [FieldOffset(0)]
  public byte Byte4;

  [DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
  public static unsafe implicit operator Mask32(int i) => *(Mask32*)&i;
  [DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
  public static unsafe implicit operator Mask32(uint i) => *(Mask32*)&i;
}

测试

我想测试这个结构的性能。特别是我想看看它是否能让我像使用常规按位算术一样快速地获取单个字节:(i >> 8) & 0xFF (例如获取第 3 个字节)。

下面您将看到我提出的基准:

public unsafe class MyBenchmark {

  const int count = 50000;

  [Benchmark(Baseline = true)]
  public static void Direct() {
    var j = 0;
    for (int i = 0; i < count; i++) {
      //var b1 = i.Byte1();
      //var b2 = i.Byte2();
      var b3 = i.Byte3();
      //var b4 = i.Byte4();
      j += b3;
    }
  }


  [Benchmark]
  public static void ViaStructPointer() {
    var j = 0;
    int i = 0;
    var s = (Mask32*)&i;
    for (; i < count; i++) {
      //var b1 = s->Byte1;
      //var b2 = s->Byte2;
      var b3 = s->Byte3;
      //var b4 = s->Byte4;
      j += b3;
    }
  }

  [Benchmark]
  public static void ViaStructPointer2() {
    var j = 0;
    int i = 0;
    for (; i < count; i++) {
      var s = *(Mask32*)&i;
      //var b1 = s.Byte1;
      //var b2 = s.Byte2;
      var b3 = s.Byte3;
      //var b4 = s.Byte4;
      j += b3;
    }
  }

  [Benchmark]
  public static void ViaStructCast() {
    var j = 0;
    for (int i = 0; i < count; i++) {
      Mask32 m = i;
      //var b1 = m.Byte1;
      //var b2 = m.Byte2;
      var b3 = m.Byte3;
      //var b4 = m.Byte4;
      j += b3;
    }
  }

  [Benchmark]
  public static void ViaUnsafeAs() {
    var j = 0;
    for (int i = 0; i < count; i++) {
      var m = Unsafe.As<int, Mask32>(ref i);
      //var b1 = m.Byte1;
      //var b2 = m.Byte2;
      var b3 = m.Byte3;
      //var b4 = m.Byte4;
      j += b3;
    }
  }

}

Byte1() , Byte2() , Byte3() , 和 Byte4()只是扩展方法，确实得到内联，并通过按位运算和强制转换简单地获取第 n 个字节:

[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte1(this int it) => (byte)(it >> 24);
[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte2(this int it) => (byte)((it >> 16) & 0xFF);
[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte3(this int it) => (byte)((it >> 8) & 0xFF);
[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte4(this int it) => (byte)it;

编辑:修复代码以确保实际使用了变量。还注释掉了 4 个变量中的 3 个，以真正测试结构转换/成员访问，而不是实际使用变量。

结果

我在 x64 上优化的发布版本中运行了这些。

Intel Core i7-3770K CPU 3.50GHz (Ivy Bridge), 1 CPU, 8 logical cores and 4 physical cores
Frequency=3410223 Hz, Resolution=293.2360 ns, Timer=TSC
  [Host]     : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.6.1086.0
  DefaultJob : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.6.1086.0


            Method |      Mean |     Error |    StdDev | Scaled | ScaledSD |
------------------ |----------:|----------:|----------:|-------:|---------:|
            Direct |  14.47 us | 0.3314 us | 0.2938 us |   1.00 |     0.00 |
  ViaStructPointer | 111.32 us | 0.6481 us | 0.6062 us |   7.70 |     0.15 |
 ViaStructPointer2 | 102.31 us | 0.7632 us | 0.7139 us |   7.07 |     0.14 |
     ViaStructCast |  29.00 us | 0.3159 us | 0.2800 us |   2.01 |     0.04 |
       ViaUnsafeAs |  14.32 us | 0.0955 us | 0.0894 us |   0.99 |     0.02 |

编辑:修复代码后的新结果:

            Method |      Mean |     Error |    StdDev | Scaled | ScaledSD |
------------------ |----------:|----------:|----------:|-------:|---------:|
            Direct |  57.51 us | 1.1070 us | 1.0355 us |   1.00 |     0.00 |
  ViaStructPointer | 203.20 us | 3.9830 us | 3.5308 us |   3.53 |     0.08 |
 ViaStructPointer2 | 198.08 us | 1.8411 us | 1.6321 us |   3.45 |     0.06 |
     ViaStructCast |  79.68 us | 1.5478 us | 1.7824 us |   1.39 |     0.04 |
       ViaUnsafeAs |  57.01 us | 0.8266 us | 0.6902 us |   0.99 |     0.02 |

问题

基准测试结果让我感到惊讶，这就是为什么我有几个问题:

编辑:更改代码以便实际使用变量后，剩下的问题更少。

为什么指针的东西这么慢？
~~为什么转换花费的时间是基准情况的两倍？隐式/显式运算符不是内联的吗？~~
怎么会出现新的System.Runtime.CompilerServices.Unsafe package (v. 4.5.0) 这么快？我认为它至少会涉及一个方法调用...
更一般地说，我怎样才能制作一个基本上是零成本的结构，它可以简单地充当某些内存的“窗口”或像UInt64这样的大原始类型。以便我可以更有效地操作/读取该内存？这里的最佳做法是什么？

最佳答案

这个问题的答案似乎是，当您使用 Unsafe.As() 时，JIT 编译器可以更好地进行某些优化。 .

Unsafe.As()像这样非常简单地实现:

public static ref TTo As<TFrom, TTo>(ref TFrom source)
{
    return ref source;
}

就是这样!

这是我编写的一个测试程序，用于将其与转换进行比较:

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

namespace Demo
{
    [StructLayout(LayoutKind.Explicit, Pack = 1, Size = 4)]
    public struct Mask32
    {
        [FieldOffset(3)]
        public byte Byte1;
        [FieldOffset(2)]
        public ushort UShort1;
        [FieldOffset(2)]
        public byte Byte2;
        [FieldOffset(1)]
        public byte Byte3;
        [FieldOffset(0)]
        public ushort UShort2;
        [FieldOffset(0)]
        public byte Byte4;
    }

    public static unsafe class Program
    {
        static int count = 50000000;

        public static int ViaStructPointer()
        {
            int total = 0;

            for (int i = 0; i < count; i++)
            {
                var s = (Mask32*)&i;
                total += s->Byte1;
            }

            return total;
        }

        public static int ViaUnsafeAs()
        {
            int total = 0;

            for (int i = 0; i < count; i++)
            {
                var m = Unsafe.As<int, Mask32>(ref i);
                total += m.Byte1;
            }

            return total;
        }

        public static void Main(string[] args)
        {
            var sw = new Stopwatch();

            sw.Restart();
            ViaStructPointer();
            Console.WriteLine("ViaStructPointer took " + sw.Elapsed);

            sw.Restart();
            ViaUnsafeAs();
            Console.WriteLine("ViaUnsafeAs took " + sw.Elapsed);
        }
    }
}

我在我的 PC(x64 发布版本)上得到的结果如下:

ViaStructPointer took 00:00:00.1314279
ViaUnsafeAs took 00:00:00.0249446

如您所见，ViaUnsafeAs确实快多了。

那么让我们看看编译器生成了什么:

public static unsafe int ViaStructPointer()
{
    int total = 0;
    for (int i = 0; i < Program.count; i++)
    {
        total += (*(Mask32*)(&i)).Byte1;
    }
    return total;
}

public static int ViaUnsafeAs()
{
    int total = 0;
    for (int i = 0; i < Program.count; i++)
    {
        total += (Unsafe.As<int, Mask32>(ref i)).Byte1;
    }
    return total;
}

好吧，那里没有什么明显的。但是 IL 呢？

.method public hidebysig static int32 ViaStructPointer () cil managed 
{
    .locals init (
        [0] int32 total,
        [1] int32 i,
        [2] valuetype Demo.Mask32* s
    )

    IL_0000: ldc.i4.0
    IL_0001: stloc.0
    IL_0002: ldc.i4.0
    IL_0003: stloc.1
    IL_0004: br.s IL_0017
    .loop
    {
        IL_0006: ldloca.s i
        IL_0008: conv.u
        IL_0009: stloc.2
        IL_000a: ldloc.0
        IL_000b: ldloc.2
        IL_000c: ldfld uint8 Demo.Mask32::Byte1
        IL_0011: add
        IL_0012: stloc.0
        IL_0013: ldloc.1
        IL_0014: ldc.i4.1
        IL_0015: add
        IL_0016: stloc.1

        IL_0017: ldloc.1
        IL_0018: ldsfld int32 Demo.Program::count
        IL_001d: blt.s IL_0006
    }

    IL_001f: ldloc.0
    IL_0020: ret
}

.method public hidebysig static int32 ViaUnsafeAs () cil managed 
{
    .locals init (
        [0] int32 total,
        [1] int32 i,
        [2] valuetype Demo.Mask32 m
    )

    IL_0000: ldc.i4.0
    IL_0001: stloc.0
    IL_0002: ldc.i4.0
    IL_0003: stloc.1
    IL_0004: br.s IL_0020
    .loop
    {
        IL_0006: ldloca.s i
        IL_0008: call valuetype Demo.Mask32& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<int32, valuetype Demo.Mask32>(!!0&)
        IL_000d: ldobj Demo.Mask32
        IL_0012: stloc.2
        IL_0013: ldloc.0
        IL_0014: ldloc.2
        IL_0015: ldfld uint8 Demo.Mask32::Byte1
        IL_001a: add
        IL_001b: stloc.0
        IL_001c: ldloc.1
        IL_001d: ldc.i4.1
        IL_001e: add
        IL_001f: stloc.1

        IL_0020: ldloc.1
        IL_0021: ldsfld int32 Demo.Program::count
        IL_0026: blt.s IL_0006
    }

    IL_0028: ldloc.0
    IL_0029: ret
}

啊哈!这里唯一的区别是:

ViaStructPointer: conv.u
ViaUnsafeAs:      call valuetype Demo.Mask32& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<int32, valuetype Demo.Mask32>(!!0&)
                  ldobj Demo.Mask32

从表面上看，您会期望 conv.u比用于 Unsafe.As 的两条指令更快.然而，JIT 编译器似乎能够比单个 conv.u 更好地优化这两条指令。 .

问为什么是合理的 - 不幸的是我还没有答案!我几乎可以肯定对 Unsafe::As<>() 的调用正在被JITTER内联，正在被JIT进一步优化。

There is some information about the Unsafe class' optimisations here.

请注意，为 Unsafe.As<> 生成的 IL就是这样:

.method public hidebysig static !!TTo& As<TFrom, TTo> (
        !!TFrom& source
    ) cil managed aggressiveinlining 
{
    .custom instance void System.Runtime.Versioning.NonVersionableAttribute::.ctor() = (
        01 00 00 00
    )
    IL_0000: ldarg.0
    IL_0001: ret
}

现在我想，为什么 JITTER 可以将其优化得如此之好变得更加清楚了。

关于c# - 为什么通过 Pointer 转换结构很慢，而 Unsafe.As 很快？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50870942/

文章推荐： c# - .Net 核心 2.0 控制台应用程序作为 Windows 服务

文章推荐： python - 如何阻止 PyCharm 填充文档字符串？

文章推荐： python - 注册django后发送邮件确认

文章推荐： c# - 使用 .NET core 3.0/System.text.Json 解析 JSON 文件

swift - 很快，如果我将属性标记为最终属性怎么办
我知道如果一个函数被标记为final，那么它就不能在子类中被覆盖。但是如果一个类中的属性被标记为最终的呢？我试了一下，发现可以在子类中给它赋一个新值。最佳答案属性上的 Final 意味着子类不能修
css - 谁能想到这两种编写相同代码的方式会导致问题的原因？ (很快)
我正在制作一个名为“fullWidthContainer”的容器，用于放置我所有的文章和废话。它需要与顶部页眉和底部页脚隔开，所以我在顶部和底部给了它 40px 的边距。 HTML代码: ...
c++ - 为什么这段代码对 char * 很快？
在this talk by Sutter在 1:15:26 出现了如下代码， class employee{ std::string name_; public: template, std::st
java - 为什么 Hazelcast 很快？
我有多年使用数据库相关代码的经验，但有一个问题我总是问自己。基于 Java 的 Hazelcast(内存中)如何比任何其他非基于 Java 的数据库更快？例如，内存中的 Redis 和 Memcach
c - 整数上的无分支条件——很快，但它们能变得更快吗？
我一直在尝试以下内容，并注意到此处定义的无分支“if”(现在用 &-!! 替换 *!!)可以加快速度使用 clang 在 64 位 Intel 目标上将某些瓶颈代码提高(几乎)2 倍: // Prod
javascript - 为什么react.js事件委托(delegate)很快？
我正在 github.io 上阅读 Facebook 的 React 文档。这里写的是react的事件委托(delegate)比原生javascript的事件委托(delegate)要快，因为有一些映
java - Hibernate - SQL 很快，但查询仍然很慢
我正在运行 Hibernate 4.1，并在 Oracle 12c 之上的 Hikari 池之上运行 Javassist 运行时检测。 JDK 是 1.7。我有一个查询在数据库上运行得非常快，并在
ios - 很快，有没有办法让 UIView 类也成为 MPMediaPickerControllerDelegate？
我目前正在尝试使用 swift 制作音乐应用程序。这是我的第一个应用程序(不包括一些教程内容)。我一直在查看一些示例代码，包括 Apple 的 addMusic 示例，并且我遇到了以下将 object
ios - Firebase Auth 不起作用，但适用于 friend ，很快
我和一个 friend 正在开发一个简单的 iOS 应用程序，并尝试实现一些 Firebase 功能，例如登录/注册功能。我的 friend 设置了我们的 firebase 帐户并编写了如下所示的代码
MySQL 第一次 UPDATE 查询很慢，后续 UPDATE 很快
我有一个问题: Update users set Numreviews = 1 where userID = 12345 “numreviews”和“userID”这两个字段都有索引，可以帮助不相关的
tcp - 从 Windows 发送数据在任何具有高延迟的网络上都很慢，但 linux 很快
通过具有高延迟的网络将数据从 Windows 计算机发送到 Windows 或 Linux 使用了 10% 的链路容量。同时，从 Linux 通过同一网络发送相同数据的速度几乎快了十倍。即使仅通过延迟
c# - 即使生成的 SQL 很快， Entity Framework 也会随着添加的额外连接而逐渐变慢
我们有 18 个表连接，这对于 ERP 系统来说是典型的。连接是通过 LINQ over Entity Framework 完成的。随着加入的加入越来越多，加入变得越来越慢。返回结果集很小(15 条
c# - 为什么通过 Pointer 转换结构很慢，而 Unsafe.As 很快？
背景我想制作一些整数大小的 struct s(即 32 位和 64 位)可以轻松转换为相同大小的原始非托管类型(即 Int32 和 UInt32，特别是对于 32 位大小的结构)。然后，这些结构将
MySQL ORDER BY DESC 很快，但 ASC 很慢
出于某种原因，当我按 DESC 对查询进行排序时，速度非常快，但如果按 ASC 排序，则速度非常慢。这大约需要 150 毫秒: SELECT posts.id FROM posts USE INDE
javascript - $apply 中的 Angularjs 性能问题，但绑定(bind)很快
我有一个中型 Angular 应用程序，它使用 angular-1.2.10 和 ui-router-0.2.8。当我转换到特定状态时，无论我是在 ng-show 上使用 $animate 还是手动设
performance - Entity Framework 查询很慢，但 SqlQuery 中的相同 SQL 很快
我看到一些非常奇怪的性能，与使用 Entity Framework Code-First 和 .NET Framework 版本 4 的非常简单的查询相关。LINQ2Entities 查询如下所示:
mysql - LEFT JOIN 很快，但 RIGHT JOIN 很慢，即使两个表上的索引相同
我有两个表，都有大约 200,000 条记录。我想在 KEY 上加入他们，这是一个字符串。两个表都有一个索引KEY，VALUE。当我运行时: SELECT vpn, t1_sku, t2_sk
windows - 来自 CMD 的 Cygwin 命令很慢；来自 bash 很快
几个月来，我一直在 Windows CMD 提示符下使用 Cygwin 命令，没有出现任何问题。在过去的几天里，每次我调用 Cygwin 命令(例如 ls)时，该命令在实际运行之前都需要几秒钟的时间来
javascript - Jquery - Firefox 具有较慢的 .love() 悬停速度，Chrome 和 IE 很快
好的，我正在开发一个新网站。它是一个社交网络类型的网站，并且具有很多 jquery 交互性。当我尝试使用 JQuery.live() 实现鼠标悬停效果时出现问题。它在 Chrome 和 Safari
javascript - react 。呈现和更新 1500 个
元素的简单列表时非常慢。我认为 VirtualDOM 很快
我对以下简单 ReactJS 示例的性能感到非常失望。单击项目时，标签(计数)会相应更新。不幸的是，更新大约需要 0.5-1 秒。这主要是由于“重新呈现”了整个待办事项列表。我的理解是 React

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c# - 为什么通过 Pointer 转换结构很慢，而 Unsafe.As 很快？

背景

示例结构

测试

结果

问题