c - 如果你将一个 big int 转换为 float 会发生什么-6ren

c - 如果你将一个 big int 转换为 float 会发生什么

转载作者：太空宇宙更新时间：2023-11-04 06:27:01

这是一个关于当我使用 gcc 4.4 将一个非常大/小的带符号整数转换为 float 时究竟发生了什么的一般性问题。

我在进行转换时看到了一些奇怪的行为。以下是一些示例:

MUSTBE 是用这个方法得到的:

float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));

./btest -f float_i2f -1 0x80800001
input:          10000000100000000000000000000001
absolute value: 01111111011111111111111111111111

exponent:       10011101
mantissa:       00000000011111101111111111111111  (right shifted absolute value)

EXPECT:         11001110111111101111111111111111  (sign|exponent|mantissa)
MUST BE:        11001110111111110000000000000000  (sign ok, exponent ok,
                                                     mantissa???)

./btest -f float_i2f -1 0x3f7fffe0

EXPECT:    01001110011111011111111111111111
MUST BE:   01001110011111100000000000000000

./btest -f float_i2f -1 0x80004999                                                                  


EXPECT:    11001110111111111111111101101100
MUST BE:   11001110111111111111111101101101    (<- 1 added at the end)

那么令我困扰的是，如果我只是将整数值向右移动，那么两个示例中的尾数都不同。例如末尾的零。它们来自哪里？

我只在大/小值上看到这种行为。 -2^24、2^24 范围内的值可以正常工作。

我想知道是否有人可以告诉我这里发生了什么。哪些步骤也采用非常大/小的值。

这是一个附加问题:function to convert float to int (huge integers)这不像这里的一般。

编辑代码:

unsigned float_i2f(int x) {
  if (x == 0) return 0;
  /* get sign of x */
  int sign = (x>>31) & 0x1;

  /* absolute value of x */
  int a = sign ? ~x + 1 : x;

  /* calculate exponent */
  int e = 158;
  int t = a;
  while (!(t >> 31) & 0x1) {
    t <<= 1;
    e--;
  };

  /* calculate mantissa */
  int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
  m &= 0x7fffff;

  int res = sign << 31;
  res |= (e << 23);
  res |= m;

  return res;
}

编辑 2:

在 Adams 的评论和对 Write Great Code 这本书的引用之后，我用四舍五入更新了我的例程。我仍然遇到一些舍入错误(幸运的是现在只有 1 位)。

现在，如果我进行测试运行，我会得到大部分不错的结果，但会出现一些像这样的舍入误差:

input:  0xfefffff5
result: 11001011100000000000000000000101
GOAL:   11001011100000000000000000000110  (1 too low)

input:  0x7fffff
result: 01001010111111111111111111111111
GOAL:   01001010111111111111111111111110  (1 too high)

unsigned float_i2f(int x) {
  if (x == 0) return 0;
  /* get sign of x */
  int sign = (x>>31) & 0x1;

  /* absolute value of x */
  int a = sign ? ~x + 1 : x;

  /* calculate exponent */
  int e = 158;
  int t = a;
  while (!(t >> 31) & 0x1) {
    t <<= 1;
    e--;
  };

  /* mask to check which bits get shifted out when rounding */
  static unsigned masks[24] = {
    0, 1, 3, 7, 
    0xf, 0x1f, 
    0x3f, 0x7f, 
    0xff, 0x1ff, 
    0x3ff, 0x7ff, 
    0xfff, 0x1fff, 
    0x3fff, 0x7fff, 
    0xffff, 0x1ffff, 
    0x3ffff, 0x7ffff, 
    0xfffff, 0x1fffff, 
    0x3fffff, 0x7fffff
  };

  /* mask to check wether round up, or down */
  static unsigned HOmasks[24] = {
    0,
    1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80,
    0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000
  };

  int S = a & masks[8];
  int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
  m &= 0x7fffff;

  if (S > HOmasks[8]) {
    /* round up */
    m += 1;
  } else if (S == HOmasks[8]) {
    /* round down */
    m = m + (m & 1);
  }

  /* special case where last bit of exponent is also set in mantissa
   * and mantissa itself is 0 */
  if (m & (0x1 << 23)) {
    e += 1;
    m = 0;
  }

  int res = sign << 31;
  res |= (e << 23);
  res |= m;
  return res;
}

有人知道问题出在哪里吗？

最佳答案

32 位 float 使用一些位作为指数，因此不能准确表示所有 32 位整数值。

64 位double 可以准确存储任何 32 位整数值。

维基百科在 IEEE 754 上有一个缩写条目 float ，以及 float 内部的大量细节，位于 IEEE 754-1985 — 当前标准是 IEEE 754:2008。它注意到一个 32 位 float 使用 1 位作为符号，8 位作为指数，留下 23 位显式位和 1 位隐式位作为尾数，这就是为什么最大 2²⁴ 的绝对值可以是准确表示。

I thought that it was clear that a 32 bit integer can't be exactly stored into a 32bit float. My question is: What happens IF I store an integer bigger 2^24 or smaller -2^24? And how can I replicate it?

一旦绝对值大于2²⁴，32位float的尾数的24个有效位就不能精确表示整数值，所以只有前 24 位数字是可靠可用的。浮点舍入也开始。

你可以用类似这样的代码来演示: ＃包括 #包括

typedef union Ufloat
{
    uint32_t    i;
    float       f;
} Ufloat;

static void dump_value(uint32_t i, uint32_t v)
{
    Ufloat u = { .i = v };
    printf("0x%.8" PRIX32 ": 0x%.8" PRIX32 " = %15.7e = %15.6A\n", i, v, u.f, u.f);
}

int main(void)
{
    uint32_t lo = 1 << 23;
    uint32_t hi = 1 << 28;
    Ufloat u;

    for (uint32_t v = lo; v < hi; v <<= 1)
    {
        u.f = v;
        dump_value(v, u.i);
    }

    lo = (1 << 24) - 16;
    hi = lo + 64;

    for (uint32_t v = lo; v < hi; v++)
    {
        u.f = v;
        dump_value(v, u.i);
    }

    return 0;
}

示例输出:

0x00800000: 0x4B000000 =   8.3886080e+06 =  0X1.000000P+23
0x01000000: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x02000000: 0x4C000000 =   3.3554432e+07 =  0X1.000000P+25
0x04000000: 0x4C800000 =   6.7108864e+07 =  0X1.000000P+26
0x08000000: 0x4D000000 =   1.3421773e+08 =  0X1.000000P+27
0x00FFFFF0: 0x4B7FFFF0 =   1.6777200e+07 =  0X1.FFFFE0P+23
0x00FFFFF1: 0x4B7FFFF1 =   1.6777201e+07 =  0X1.FFFFE2P+23
0x00FFFFF2: 0x4B7FFFF2 =   1.6777202e+07 =  0X1.FFFFE4P+23
0x00FFFFF3: 0x4B7FFFF3 =   1.6777203e+07 =  0X1.FFFFE6P+23
0x00FFFFF4: 0x4B7FFFF4 =   1.6777204e+07 =  0X1.FFFFE8P+23
0x00FFFFF5: 0x4B7FFFF5 =   1.6777205e+07 =  0X1.FFFFEAP+23
0x00FFFFF6: 0x4B7FFFF6 =   1.6777206e+07 =  0X1.FFFFECP+23
0x00FFFFF7: 0x4B7FFFF7 =   1.6777207e+07 =  0X1.FFFFEEP+23
0x00FFFFF8: 0x4B7FFFF8 =   1.6777208e+07 =  0X1.FFFFF0P+23
0x00FFFFF9: 0x4B7FFFF9 =   1.6777209e+07 =  0X1.FFFFF2P+23
0x00FFFFFA: 0x4B7FFFFA =   1.6777210e+07 =  0X1.FFFFF4P+23
0x00FFFFFB: 0x4B7FFFFB =   1.6777211e+07 =  0X1.FFFFF6P+23
0x00FFFFFC: 0x4B7FFFFC =   1.6777212e+07 =  0X1.FFFFF8P+23
0x00FFFFFD: 0x4B7FFFFD =   1.6777213e+07 =  0X1.FFFFFAP+23
0x00FFFFFE: 0x4B7FFFFE =   1.6777214e+07 =  0X1.FFFFFCP+23
0x00FFFFFF: 0x4B7FFFFF =   1.6777215e+07 =  0X1.FFFFFEP+23
0x01000000: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x01000001: 0x4B800000 =   1.6777216e+07 =  0X1.000000P+24
0x01000002: 0x4B800001 =   1.6777218e+07 =  0X1.000002P+24
0x01000003: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000004: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000005: 0x4B800002 =   1.6777220e+07 =  0X1.000004P+24
0x01000006: 0x4B800003 =   1.6777222e+07 =  0X1.000006P+24
0x01000007: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x01000008: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x01000009: 0x4B800004 =   1.6777224e+07 =  0X1.000008P+24
0x0100000A: 0x4B800005 =   1.6777226e+07 =  0X1.00000AP+24
0x0100000B: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000C: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000D: 0x4B800006 =   1.6777228e+07 =  0X1.00000CP+24
0x0100000E: 0x4B800007 =   1.6777230e+07 =  0X1.00000EP+24
0x0100000F: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000010: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000011: 0x4B800008 =   1.6777232e+07 =  0X1.000010P+24
0x01000012: 0x4B800009 =   1.6777234e+07 =  0X1.000012P+24
0x01000013: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000014: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000015: 0x4B80000A =   1.6777236e+07 =  0X1.000014P+24
0x01000016: 0x4B80000B =   1.6777238e+07 =  0X1.000016P+24
0x01000017: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x01000018: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x01000019: 0x4B80000C =   1.6777240e+07 =  0X1.000018P+24
0x0100001A: 0x4B80000D =   1.6777242e+07 =  0X1.00001AP+24
0x0100001B: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001C: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001D: 0x4B80000E =   1.6777244e+07 =  0X1.00001CP+24
0x0100001E: 0x4B80000F =   1.6777246e+07 =  0X1.00001EP+24
0x0100001F: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000020: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000021: 0x4B800010 =   1.6777248e+07 =  0X1.000020P+24
0x01000022: 0x4B800011 =   1.6777250e+07 =  0X1.000022P+24
0x01000023: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000024: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000025: 0x4B800012 =   1.6777252e+07 =  0X1.000024P+24
0x01000026: 0x4B800013 =   1.6777254e+07 =  0X1.000026P+24
0x01000027: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x01000028: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x01000029: 0x4B800014 =   1.6777256e+07 =  0X1.000028P+24
0x0100002A: 0x4B800015 =   1.6777258e+07 =  0X1.00002AP+24
0x0100002B: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002C: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002D: 0x4B800016 =   1.6777260e+07 =  0X1.00002CP+24
0x0100002E: 0x4B800017 =   1.6777262e+07 =  0X1.00002EP+24
0x0100002F: 0x4B800018 =   1.6777264e+07 =  0X1.000030P+24

输出的第一部分表明一些整数值仍然可以准确存储；具体来说，可以准确地存储 2 的幂。事实上，更准确地说(但不太简洁)，任何整数的绝对值的二进制表示不超过 24 位有效数字(任何尾随数字为零)都可以精确表示。不一定要准确打印这些值，但这与准确存储它们是不同的问题。

输出的第二部分(较大部分)表明最多 2²⁴-1，可以精确表示整数值。 2²⁴ 本身的值也是可以精确表示的，但是 2²⁴+1 不是，所以它看起来和 2²⁴ 一样。相比之下，2²⁴+2 可以只用 24 个二进制数字后跟 1 个零来表示，因此可以精确表示。对大于 2 的增量重复 ad nauseam。看起来好像“round even”模式有效；这就是结果显示 1 个值然后显示 3 个值的原因。

(顺便提一下，没有办法规定 double 传递给 printf() — 从 float 转换而来按照默认参数提升(ISO/IEC 9899:2011 §6.5.2.2 函数调用，¶6)的规则打印为 float() — h 修饰符在逻辑上会被使用，但没有定义。)

关于c - 如果你将一个 big int 转换为 float 会发生什么，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25701319/

文章推荐： python - 在 python setup.py egg_info 上使用 pip 中断安装

文章推荐： java - 将 JTextArea 组件发送到打印机

文章推荐： java - 如何使用服务器重定向客户端以直接相互连接

Scala: (Int, Int) => Int 不匹配 (Int, Int) => Int
我正在尝试使用 y 组合器在 Scala 中定义 gcd: object Main { def y[A,B]( f : (A => B) => A => B ) : A => B = f(y(f)
c++ - 无法将 int (*(int))(int) 转换为 int (*(int))(int)
我正在尝试了解返回指向函数的指针的函数，在我尝试编译代码后，它给了我这种错误: cannot convert int (*(int))(int) to int (*(int))(int) in ass
java - BufferedImage.getRGB(int, int, int, int, int[], int, int) 如何工作？
所以我一直在关注 youtube 上的游戏编程教程，然后弹出了这段代码:bufferedImageObject.getRGB(int, int, int, int, int[], int, int);
c# - 将格式化的日期字符串转换为 DateTime(int,int,int,int,int,int) 以传递给函数
我正在将时间现在与存储在数据库某处的时间进行比较。数据库中存储的时间格式为“yyyyMMddHHmmss”。例如，数据库可能会为存储的时间值返回 201106203354。然后我使用一个函数将时间现
java - 如何以这种格式编写java模式 : any characters (int, int) (int,int) number number any number of (int,int,int)
例如 Maze0.bmp (0,0) (319,239) 65 120 Maze0.bmp (0,0) (319,239) 65 120 (254,243,90) Maze0.bmp (0,0) (
haskell - 理解类型错误 : "expected signature Int*Int->Int but got Int*Int->Int"
评论 Steve Yegge的post关于 server-side Javascript开始讨论语言中类型系统的优点和这个 comment描述: ... examples from H-M style
c - int(*function)(int,int) 和 int*function(int,int) 的区别
我正在研究 C 的指针，从 Deitel 的书中我不明白 int(*function)(int,int) 和 int*function(int, int) 表示函数时。最佳答案 C 中读取类型的经验
java - joda new DateTime(int，int，int，int，int，int)的问题
您好，我使用 weblogic 11g 创建 war 应用程序，我对 joda time 的方法有疑问 new DateTime(int, int, int, int, int, int); 这抛出了
java - 方法 sum(int, int, int, int) 不适用于参数 (int)
Create a method called average that calculates the average of the numbers passed as parameters. The
swift - 二元运算符 "=="不能应用于 (Int, Int, Int, Int) -> Int 类型的操作数
var a11: Int = 0 var a12: Int = 0 var a21: Int = 0 var a22: Int = 0 var valueDeterminant = a11 * a12
c++ - 阿杜伊诺错误 : too few arguments to function 'int getMode(int, int, int, int, int)'
我正在为一个项目设置 LED 阵列。我得到了一个 LED 阵列，可以根据引脚变化电压进行更改，但我无法添加更多引脚。当我尝试时，编译失败并显示错误:函数“int getMode(int, int,
haskell - 创建 Int 和函数列表 Int -> Int -> Int
除了创建对列表执行简单操作的函数之外，我对 haskell 还是很陌生。我想创建一个列表，其中包含 Int 类型的内容, 和 Int -> Int -> Int 类型的函数. 这是我尝试过的: dat
Java-高效地执行 .setBounds(int, int, int, int);
这个问题已经有答案了: Java add buttons dynamically as an array [duplicate] (4 个回答) 已关闭 7 年前。 StackOverFlow问题今天
android - setCompoundDrawablesWithIntrinsicBounds(int，int，int，int)不起作用
我有几个 EditText View ，我想在其中设置左侧的图像，而 setCompoundDrawablesWithIntrinsicBounds 似乎不起作用。图形似乎没有改变。有人知道为什么会
c++ - 为什么 `is_constructible, int(*)(int,int)>::value`在VC2015RC下为true
#include using namespace std; int main() { static_assert(is_constructible, int(*)(int,int)>::val
java - Kotlin:用 Pair 调用 (Int, Int) -> Int 的惯用方式？
fun sum(a: Int, b: Int) = a + b val x = 1.to(2) 我在找: sum.tupled(x)，或者 sum(*x) 当然，以上都不能用 Kotlin 1.1.3
ios - 类型 "Int -> Bool","Int-> Bool -> Int","Int-> String -> Int－> Bool"
有一个函数: func (first: Int) -> Int -> Bool -> String { return ? } 返回值怎么写？我对上面 func 的返回类型感到很困惑。最
ocaml - OCaml 求和类型中的 int * int 与 (int * int)
type foo = A of int * int | B of (int * int) int * int 和 (int * int) 有什么区别？我看到的唯一区别在于模式匹配: let test_
java - 找不到符号方法drawImage(SlidingBlockModel, int, int, int, int, )
我正在尝试制作一个 slider 游戏。在这个类中，我使用 Graphics 对象 g2 的 drawImage 方法来显示“拼图”的 block 。但在绘制类方法中，我收到此错误:找不到符号方法dr
c# - int int.operator(int left, int right) &
我试着理解这个表达: static Func isOdd = i => (i & 1) == 1; 但是这是什么意思呢？例如我有 i = 3。然后 (3 & 1) == 1 或 i = 4。然后

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c - 如果你将一个 big int 转换为 float 会发生什么