More

corsix · 2025-06-23T18:36:38 1750703798

In this case, "atomic 128-bit store" is the special instruction, with the twist that half of those 128 bits contain a pointer.

corsix · 2025-05-07T22:14:17 1746656057

You can start with this idea, and then make it _very_ performant by using bit counting instructions. See https://www.corsix.org/content/higher-quality-random-floats for an exposition.

corsix · on Sept 23, 2024

Indeed, LuaJIT support for Windows/Arm64 was added in https://github.com/LuaJIT/LuaJIT/issues/593, and there’s experimental out-of-tree support for Windows/Arm64EC in https://github.com/LuaJIT/LuaJIT/issues/1096

corsix · on July 11, 2024

For an implementation of logical immediate encoding without the loop, see https://github.com/LuaJIT/LuaJIT/blob/04dca7911ea255f37be799...

corsix · on April 7, 2024

AArch64 NEON has the URSQRTE instruction, which gets closer to the OP's question than you might think; view a 32-bit value as a fixed-precision integer with 32 fractional bits (so the representable range is evenly spaced 0 through 1-ε, where ε=2^-32), then URSQRTE computes the approximate inverse square root, halves it, then clamps it to the range 0 through 1-ε. Fixed-precision integers aren't quite integers, and approximate inverse square root isn't quite square root, but it might get you somewhere close.

The related FRSQRTE instruction is much more conventional, operating on 32-bit floats, again giving approximate inverse square root.

voidbert · on April 7, 2024

What task benefits from using such a complex instruction so easily dividable in simpler ones for it to be present in aarch64?

colechristensen · on April 7, 2024

Inverse square root is for normalizing vectors particularly in computer graphics calculations, it needs to be run a whole lot very fast.

https://en.m.wikipedia.org/wiki/Fast_inverse_square_root#Mot...

hinkley · on April 7, 2024

Famously the magic constant in the Quake engine that nobody remembers inventing.

That article does say there’s an SSE instruction rsqrtss that is better.

AmblingAvocado · on April 7, 2024

Neon is SIMD so I would presume these instructions let you vectorize those calculations and do them in parallel on a lot of data more efficiently than if you broke it down into simpler operations and did them one by one.

voidbert · on April 7, 2024

Yes, but the part that got me was the halving of the result followed by the clamping. SIMD generally makes sense, but for something like this to exist usually there's something very specific (like a certain video codec, for example) that greatly benefits from such a complex instruction.

ekelsen · on April 8, 2024

The halving could come from an intended use in a Newton Raphson iteration of a square root refinement.

See for example https://math.mit.edu/~stevenj/18.335/newton-sqrt.pdf

The initial guess is the approximate square root, but it needs to be halved as part of the calculation.

creato · on April 7, 2024

It's probably not about avoiding extra instructions/performance, but making the range of the result more useful and avoiding overflow. Or in other words, the entire instruction may be useless if you don't do these things.

epcoa · on April 7, 2024

The halving and clamping is nothing particularly remarkable in the context of usefully using fixed point numbers (scaled integers) to avoid overflow. Reciprocal square root itself is a fundamental operation for DSP algorithms and of course computer graphics. This is a fairly generic instruction really, though FRSQRTE likely gets more real world use.

corsix · on Feb 27, 2024

Unfortunately things aren't so simple, as when doing JIT compilation, LuaJIT _will_ try to shorten the lifetimes of local variables. Using the latest available version of LuaJIT (https://github.com/LuaJIT/LuaJIT/commit/0d313b243194a0b8d239...), the following reliably fails for me:

  local ffi = require"ffi"
  local function collect_lots()
    for i = 1, 20 do collectgarbage() end
  end
  local function f(s)
    local blob = ffi.new"int[2]"
    local interior = blob + 1
    interior[0] = 13 -- should become the return value
    s:gsub(".", collect_lots)
    return interior[0] -- kept alive by blob?
  end
  for i = 1, 60 do
    local str = ("x"):rep(i - 59)
    assert(f(str) == 13) -- can fail!!
  end

epcoa · on Feb 27, 2024

Well that is from 3 weeks ago. If that remains then it’s a bug or the documentation is wrong. What are the rules for keeping a GC object alive? What earthly useful meaning can “Lua stack” have in the FFI GC documentation if not to local bindings since that is the only user visible exposure of it in the language.

From the LuaJIT docs: So e.g. if you assign a cdata array to a pointer, you must keep the cdata object holding the array alive as long as the pointer is still in use:

  ffi.cdef[[
  typedef struct { int *a; } foo_t;
  ]]

  local s = ffi.new("foo_t", ffi.new("int[10]")) -- WRONG!

  local a = ffi.new("int[10]") -- OK
  local s = ffi.new("foo_t", a)
  -- Now do something with 's', but keep 'a' alive until you're 
  done.

What on earth does "OK" here mean if not the local variable binding? It's the expectation because this is what it says on the tin.

This then isn’t a discussion about fundamental issues or "impossibilities" with GC, but with poor language implementations not following their own specifications or not having them.

Since LuaJIT does not have an explicit pinning interface the expectation that a local variable binding remains until the end of scope is pretty basic. If your bug case is expected then even the line: interior[0] = 13 is undefined and so would everything after local s in the documentation, ie you can do absolutely nothing with a pointed to cdata until you pin it in a table. Who would want to use that?

matheusmoreira · on Feb 27, 2024

You're absolutely right. I'm not particularly familiar with LuaJIT so when I read the article I got the impression the LuaJIT GC semantics weren't documented. Looks like the LuaJIT behavior is well defined and the implementation isn't keeping its own promises.

corsix · on Jan 13, 2024

Some compute might be on the AMX units (dedicated matrix multiplication coprocessor, closely attached to the CPU, distinct from both ANE and GPU). They gained bf16 support in M2.

corsix · on Jan 8, 2024

close() is documented as a cancellation point, and is the kind of syscall that might crop up in a destructor.

otabdeveloper4 · on Jan 8, 2024

Indeed a tricky point, but like I said - move these objects to another garbage collection thread and call close() there.

(You probably want to do this regardless, because close() is not actually always an instanteneous syscall and can take a long time.)

corsix · on Nov 8, 2023

Using your cdf framing, while there is a point about [0, 1) versus (0, 1] intervals, the bigger point of the article is about whether said cdf holds for any IEEE-754 double-precision p, or whether it only holds for p of the form i*2^-53 (for integer i).

mbauman · on Nov 8, 2023

Oh I completely agree there's value in improving the accuracy of the cdf, especially for 32-bit float! There it can be quite a surprise that tests like `random() < x` can be off by as much as .6% for values of x on the order of 1f-5 and 6% for values on the order of 1f-6, even though they are an order of magnitude or two over `eps(1f0)`.

Following your post, I've found the following fast and straightforward and SIMD-friendly implementation that uses all 64 bits for a [0, 1) distribution:

    ```julia
    function random_float(rng)
        r = rand(rng, UInt64)
        last_bit = r & -r
        exponent = UInt64(2045)<<52 - reinterpret(UInt64, Float64(last_bit))
        exponent *= !iszero(r)
        fraction = ((r ⊻ last_bit)>>(8*sizeof(UInt64) - 52)) % UInt64
        return exponent | fraction
    end
    ```

corsix · on Oct 18, 2023

Oh, cute. It looks like they ever-so-slightly overweight the probability of values whose mantissa is entirely zeroes though. For example, the probability of hitting exactly 0.5 should be 2^-54 + 2^-55, whereas zig looks to give it 2^-54 + 2^-54.