AArch64 NEON has the URSQRTE instruction, which gets closer to the OP's question than you might think; view a 32-bit value as a fixed-precision integer with 32 fractional bits (so the representable range is evenly spaced 0 through 1-ε, where ε=2^-32), then URSQRTE computes the approximate inverse square root, halves it, then clamps it to the range 0 through 1-ε. Fixed-precision integers aren't quite integers, and approximate inverse square root isn't quite square root, but it might get you somewhere close.
The related FRSQRTE instruction is much more conventional, operating on 32-bit floats, again giving approximate inverse square root.
Neon is SIMD so I would presume these instructions let you vectorize those calculations and do them in parallel on a lot of data more efficiently than if you broke it down into simpler operations and did them one by one.
Yes, but the part that got me was the halving of the result followed by the clamping. SIMD generally makes sense, but for something like this to exist usually there's something very specific (like a certain video codec, for example) that greatly benefits from such a complex instruction.
It's probably not about avoiding extra instructions/performance, but making the range of the result more useful and avoiding overflow. Or in other words, the entire instruction may be useless if you don't do these things.
The halving and clamping is nothing particularly remarkable in the context of usefully using fixed point numbers (scaled integers) to avoid overflow. Reciprocal square root itself is a fundamental operation for DSP algorithms and of course computer graphics.
This is a fairly generic instruction really, though FRSQRTE likely gets more real world use.
Unfortunately things aren't so simple, as when doing JIT compilation, LuaJIT _will_ try to shorten the lifetimes of local variables. Using the latest available version of LuaJIT (https://github.com/LuaJIT/LuaJIT/commit/0d313b243194a0b8d239...), the following reliably fails for me:
local ffi = require"ffi"
local function collect_lots()
for i = 1, 20 do collectgarbage() end
end
local function f(s)
local blob = ffi.new"int[2]"
local interior = blob + 1
interior[0] = 13 -- should become the return value
s:gsub(".", collect_lots)
return interior[0] -- kept alive by blob?
end
for i = 1, 60 do
local str = ("x"):rep(i - 59)
assert(f(str) == 13) -- can fail!!
end
Well that is from 3 weeks ago. If that remains then it’s a bug or the documentation is wrong.
What are the rules for keeping a GC object alive? What earthly useful meaning can “Lua stack” have in the FFI GC documentation if not to local bindings since that is the only user visible exposure of it in the language.
From the LuaJIT docs:
So e.g. if you assign a cdata array to a pointer, you must keep the cdata object holding the array alive as long as the pointer is still in use:
ffi.cdef[[
typedef struct { int *a; } foo_t;
]]
local s = ffi.new("foo_t", ffi.new("int[10]")) -- WRONG!
local a = ffi.new("int[10]") -- OK
local s = ffi.new("foo_t", a)
-- Now do something with 's', but keep 'a' alive until you're
done.
What on earth does "OK" here mean if not the local variable binding? It's the expectation because this is what it says on the tin.
This then isn’t a discussion about fundamental issues or "impossibilities" with GC, but with poor language implementations not following their own specifications or not having them.
Since LuaJIT does not have an explicit pinning interface the expectation that a local variable binding remains until the end of scope is pretty basic. If your bug case is expected then even the line: interior[0] = 13 is undefined and so would everything after local s in the documentation, ie you can do absolutely nothing with a pointed to cdata until you pin it in a table. Who would want to use that?
You're absolutely right. I'm not particularly familiar with LuaJIT so when I read the article I got the impression the LuaJIT GC semantics weren't documented. Looks like the LuaJIT behavior is well defined and the implementation isn't keeping its own promises.
Some compute might be on the AMX units (dedicated matrix multiplication coprocessor, closely attached to the CPU, distinct from both ANE and GPU). They gained bf16 support in M2.
Using your cdf framing, while there is a point about [0, 1) versus (0, 1] intervals, the bigger point of the article is about whether said cdf holds for any IEEE-754 double-precision p, or whether it only holds for p of the form i*2^-53 (for integer i).
Oh I completely agree there's value in improving the accuracy of the cdf, especially for 32-bit float! There it can be quite a surprise that tests like `random() < x` can be off by as much as .6% for values of x on the order of 1f-5 and 6% for values on the order of 1f-6, even though they are an order of magnitude or two over `eps(1f0)`.
Following your post, I've found the following fast and straightforward and SIMD-friendly implementation that uses all 64 bits for a [0, 1) distribution:
```julia
function random_float(rng)
r = rand(rng, UInt64)
last_bit = r & -r
exponent = UInt64(2045)<<52 - reinterpret(UInt64, Float64(last_bit))
exponent *= !iszero(r)
fraction = ((r ⊻ last_bit)>>(8*sizeof(UInt64) - 52)) % UInt64
return exponent | fraction
end
```
Oh, cute. It looks like they ever-so-slightly overweight the probability of values whose mantissa is entirely zeroes though. For example, the probability of hitting exactly 0.5 should be 2^-54 + 2^-55, whereas zig looks to give it 2^-54 + 2^-54.