More

corsix · on Oct 3, 2023

https://www.corsix.org/content/whirlwind-tour-aarch64-vector... is my take on NEON, albeit not quite the same form factor as the OP.

corsix · on Aug 25, 2023

I think https://github.com/LuaJIT/LuaJIT/commit/6a2163a6b45d6d251599... improved things a bit, notably making automatic tarballs work again.

0xcoffee · on Aug 25, 2023

The export-subst attribute was new to me.

Here is a description of what it does:

If the attribute export-subst is set for a file then Git will expand several placeholders when adding this file to an archive. The expansion depends on the availability of a commit ID, i.e., if git-archive[1] has been given a tree instead of a commit or a tag then no replacement will be done. The placeholders are the same as those for the option --pretty=format: of git-log[1], except that they need to be wrapped like this: $Format:PLACEHOLDERS$ in the file. E.g. the string $Format:%H$ will be replaced by the commit hash. However, only one %(describe) placeholder is expanded per archive to avoid denial-of-service attacks.

https://git-scm.com/docs/gitattributes#_export_subst

corsix · on Aug 25, 2023

I’m pleased to see FUTEX2_SIZE_U64, but saddened that it isn’t actually implemented. It has always seemed like a very useful primitive to have.

eqvinox · on Aug 25, 2023

+99999999, it's seriously annoying to not be able to have a lock "superimposed" on a pointer (as done in lock-free datastructures, and particularly relevant for "intermediate" data structures where some operations can proceed lock-free but some block [e.g. hash table resizing])

corsix · on Aug 25, 2023

Also interesting is that https://luajit.org/status.html now states “LuaJIT is actively developed and maintained” (whereas for the last ~5 years, “actively” isn’t a word I’d have used), and makes reference to a TBA development branch.

alberth · on Aug 25, 2023

This makes me soooo happy to hear because with all of the forks of LuaJIT post it not being maintained by Mike ~5 years ago, no one seems to carry the baton very well.

LuaJIT is truly an engineering marvel that more folks should adopt.

If it's actively being maintained again, hopefully that will happen.

What I don't quite get is the follow:

"Please note: The main LuaJIT author (Mike Pall) is working on unrelated projects and cannot accept bigger sponsorships at this time. But other community members may be open to sponsorship offers — please ask on the LuaJIT mailing list for any takers."

https://luajit.org/sponsors.html

bumbledraven · on Aug 25, 2023

"If you want to pay someone to make a change to LuaJIT, Mike Pall is not available, but other people on the LuaJIT mailing list may be."

baq · on Aug 25, 2023

https://repo.or.cz/luajit-2.0.git/shortlog

That's some good news I needed but didn't deserve!

kzrdude · on Aug 25, 2023

The RISC-V port is marked as TBA, too.

corsix · on Aug 11, 2023

Per https://www.stateof.ai/compute, one of the players in the market has ten thousand GPUs in a private cloud. Out-computing just that one player is hard enough, let alone out-computing the whole market.

corsix · on Aug 7, 2023

The trie structure described in the article can be (ab)used to export an infinite number of symbols from a library: https://www.corsix.org/content/exporting-an-infinite-number-...

userbinator · on Aug 7, 2023

That seems like another case of excessive complexity leading to surprising results.

Presumably the additional complexity was thought to be beneficial to performance, but PE's simple list of names with ordinal hints, or just ordinals alone, was sufficient for decent performance even with the much slower systems on which it was initially designed for. (Its predecessor, NE, was similar.)

comex · on Aug 8, 2023

There are different kinds of complexity. Thanks to ordinals, PE has two ways to link against a symbol (by name and by ordinal), and if you use ordinals then you need a .def file, which has to be kept consistent over time if you want to keep your DLL ABI-compatible. That adds a bunch of developer-facing complexity. In contrast, improved exported-symbol data structures such as Mach-O’s export tries or ELF’s DT_GNU_HASH are mostly just implementation details that developers don’t need to care about.

As far as I know, when not using ordinals, Windows’ dynamic linker resolves symbols by binary-searching the export table, which is sorted by symbol name. This is almost identical to the mechanism that Mach-O relied on prior to the introduction of the export trie, and macOS’s dynamic linker isn’t particularly inefficient. So the only time PE wins is when using ordinals, with their associated complexity.

Also, if you compare today’s systems to the systems that PE was designed for, today’s processors are much faster, but today’s programs are also much larger with a greater number of symbols being imported and exported. And performance expectations are higher… well, at least when it comes to low-level system components. (User-facing app launch times may well be worse, but that’s a more complicated problem.)

corsix · on June 1, 2023

From a hardware perspective, vector instructions operate on small 1D vectors, whereas tensor instructions operate on small 2D matrices. I say “instructions”, but it’s really only matrix multiply or matrix multiply and accumulate - most other instructions are fine staying as 1D.

nologic01 · on June 1, 2023

If there is matrix multiply at hardware level its fair to have another name than vectorization. For example the dimensions and partitioning of large matrices to fit would be specific to that design and very different from rolling things out on 1D arrays

bee_rider · on June 2, 2023

Maybe they should call it a Matrix Multiplication Xcelerator just to mess with people.

corsix · on April 24, 2023

Assuming that you're after "round to nearest with ties toward even", then the quoted numpy code gets very close to `vcvtps2ph`, and one minor tweak gets it to bitwise identical: replace `ret += (ret == 0x7c00u)` with `ret |= 0x200`. Alternatively, the quoted Maratyszcza code gets to the same place if you replace `& 0x7c00u` with `& 0x7dffu`.

corsix · on April 23, 2023

The first niche that came to mind was x86 code running under Rosetta 2; despite ARM having an equivalent to F16C, Rosetta 2 doesn’t translate AVX, and F16C doesn’t have a non-AVX encoding.

stephencanon · on April 23, 2023

Indeed. Worth noting that Accelerate.framework provides fast and correct bulk f16 <-> f32 conversions as `vImageConvert_Planar16FtoPlanarF` and `vImageConvert_PlanarFtoPlanar16F`, and that the arm conversion instructions are unconditionally available for apps that compile for arm64 (they're part of the base ARMv8 ISA), so any _new_ code shouldn't need to worry about this.

corsix · on Nov 7, 2022

Using the notation from the article, N+K is sufficient for RS(N,K). One point of confusion is that different authors use different notation; some use RS(num data shards, num parity shards), some use RS(total num shards, num data shards), and some use RS(total num shards, num parity shards). Per the article, I'll use RS(num data shards, num parity shards).

As for where the +1 comes from, the clue is in the "noting that you shouldn't use the value 0 in the encoding matrix" remark. The TLDR is that the +1 isn't required, and arises from an (incorrect) attempt to fix an incorrect construction. The non-TLDR is rather long. First, we need to move from polynomials (per the article) to matrices (per the quoted remark). For this movement, let F denote the polynomial from the article, then it so happens that F(k+1) can be expressed as a linear combination of F(1), F(2), ..., F(k). Similarly, F(k+2) can be expressed as a linear combination of F(1), F(2), ..., F(k). This continues to be true up to F(k+t) (and beyond). These various linear combinations can be written as a k-by-t matrix, which is what the quoted remark means by "encoding matrix". Second, once thinking with matrices rather than with polynomials, people want to construct the encoding matrix directly, rather than deriving it from a polynomial. In this direct construction, the requirement is that every square submatrix (of any size) of the k-by-t matrix is invertible. Accordingly, no element of the k-by-t matrix can be zero, as the 1-by-1 submatrix containing just that zero element isn't invertible. Third, one common (but incorrect) direct construction is to create a k-by-t Vandermonde matrix. Such a matrix is usually constructed from some number of distinct elements, but if zero is used as such an element, then the resultant matrix will contain zeroes, which is problematic. Excluding zero causes the Vandermonde construction to _sometimes_ work, but it still doesn't _always_ work. Per https://www.corsix.org/content/reed-solomon-for-software-rai..., there's a slightly different Vandermonde construction that _does_ always work, and also a Cauchy construction that always works, both of which work for zero. Both of these correct constructions have a strong parallel to the polynomial construction: they involve choosing N+K distinct field elements, which is akin to choosing the N+K distinct x co-ordinates for the polynomial.