OpenQMC API
Loading...
Searching...
No Matches
float.h
Go to the documentation of this file.
1// SPDX-License-Identifier: Apache-2.0
2// Copyright Contributors to the OpenQMC Project.
3
7
8#pragma once
9
10#include "gpu.h"
11
12#include <cstdint>
13
30
31namespace oqmc
32{
33
34constexpr auto floatOneOverTwoPower32 = 1.0f / (1ull << 32);
35
45OQMC_HOST_DEVICE inline float uintToFloat(std::uint32_t value)
46{
47 // There are various methods for converting an integer to a float, each with
48 // a different balance of speed, quality and complexity.
49
50 // Option 1: Default Conversion and Clamp Maximum
51 //
52 // A simple method is to cast the integer to a floating point
53 // type, using the default runtime rounding mode (round nearest), and then
54 // multiply by the reciprocal of the (exclusive) integer maximum (1/2^32).
55 // This method is fast and retains a lot of precision, but has the
56 // undesirable property of floating point values rounding up to the nearest
57 // representable number to reduce error. This gives the potential for an
58 // output equal to one, requiring a min operation. It also means the
59 // probability density of representable values is not uniform. For example,
60 // output 0.5 is 50% more likely than the next value.
61
62 // Option 2: Truncate Precision and Default Conversion
63 //
64 // Marc Reynolds gives an option where 8 bits of precision is first removed
65 // prior to division:
66 // (http://marc-b-reynolds.github.io/distribution/2017/01/17/DenseFloat.html)
67 // This removes the need for any min operations, but also loses precision
68 // below one half; significant precision is lost for small values.
69
70 // Option 3: Count Leading Zeros and Bitwise Manipulation
71 //
72 // A high-quality reference method is detailed in Section 2.1 of
73 // 'Quasi-Monte Carlo Algorithms (not only) for Graphics Software' by
74 // Keller, Wächter and Binder. This remaps the integer bits to the floating
75 // point exponent and mantissa to provide optimal uniform probability
76 // density. The computational cost however is high, due to instructions
77 // for counting leading zeroes and manipulating the mantissa and exponent
78 // bit patterns directly. It also uses a bitwise cast that can impact
79 // optimiser analysis on some platforms. See a97ad21 for details.
80
81 // Option 4: Adjust Integer and Default Conversion (Current Implementation)
82 //
83 // The following implementation provides results identical to the option 3,
84 // while using a simpler approach that is more efficient on common
85 // architectures.
86 //
87 // The reference method generates values equivalent to modifying the runtime
88 // environment to round down, and then performing the conversion and scale:
89 // `std::fesetround(FE_DOWNWARD);`
90 // `return static_cast<float>(value) * floatOneOverTwoPower32;`
91 //
92 // Unfortunately, modifying the runtime rounding mode is often expensive,
93 // prohibited or ignored (see #pragma STDC FENV_ACCESS). It also affects
94 // subsequent operations unless the previous state is restored.
95
96 // However, it is possible to achieve the same effect by adjusting the
97 // integer so that its conversion to floating point will always round down
98 // with the default runtime mode (round nearest). This retains the optimal
99 // uniform probability density at a lower computational cost.
100
101 // When an integer is converted to floating point, the hardware identifies
102 // the position of the leading one bit. The mantissa stores the following 23
103 // bits, while the exponent encodes the position of that leading one.
104 // The next three bits (guard, round, and sticky) govern the rounding
105 // decision. The sticky bit represents the logical OR of all bits shifted
106 // out beyond the round bit.
107
108 // For example, consider a 32-bit input integer with three leading zeroes:
109 //
110 // <-----------32 bits------------>
111 // Input: 0001mmmmmmmmmmmmmmmmmmmmmmmgrsss
112 // |<-------23 bits------->|
113 // 1 g
114 // Key:
115 // 0 = leading zeros
116 // 1 = leading one
117 // m = mantissa
118 // g = guard
119 // r = round
120 // s = sticky
121 //
122 // To force rounding down during float conversion, it is both necessary and
123 // sufficient to clear the guard bit.
124 //
125 // Fortunately, the guard bit (when present) is always located 24 bits
126 // to the right of the leading one. This allows it to be cleared
127 // efficiently without explicitly determining the position of either bit.
128 // The integer is shifted right by 24 bits to generate a mask that clears
129 // the guard bit and leaves the mantissa unmodified.
130
131 // For example, with three leading zeroes the following bit patterns
132 // are used:
133 //
134 // Input : 0001mmmm mmmmmmmm mmmmmmmm mmmgrsss
135 // Mask : 00000000 00000000 00000000 0001mmmm mask = input >> 24
136 // Safe : 0001mmmm mmmmmmmm mmmmmmmm mmm0???? safe = input & ~mask
137 // ^
138 // guard bit cleared
139 //
140 // The round and sticky bits are left undefined but do not affect the
141 // outcome. With the guard bit cleared, the adjusted integer lies
142 // strictly below the rounding tie-break, ensuring that both the round
143 // and sticky bits are ignored.
144 //
145 // For inputs less than 2^24, the conversion to floating point is exact.
146 // In this case, the mask evaluates to zero and the operation has no effect.
147
148 // On common architectures this reduces to two or three instructions.
149 // For example, ARMv8 requires two instructions: one that combines the
150 // shift-not-and operations, and one for the conversion to float with
151 // free scale by power-of-two constant.
152
153 const auto mask = value >> 24;
154 const auto safe = value & ~mask;
155
156 return static_cast<float>(safe) * floatOneOverTwoPower32;
157}
158
159} // namespace oqmc
#define OQMC_HOST_DEVICE
Definition gpu.h:13
EncodeKey decodeBits16(std::uint16_t value)
Decode a value back into a key.
Definition encode.h:81
float uintToFloat(std::uint32_t value)
Convert an integer into a [0, 1) float.
Definition float.h:45
Definition bntables.h:21
constexpr auto floatOneOverTwoPower32
0x1p-32
Definition float.h:34