If you divide that giant piece of silicon into 400k processors, and then only us...

nsteel · on Aug 19, 2019

They'll aim to catch the logic failures and memory failures during wafer test at the factory. This testing is done at room temperature and at hot. There are margins built in to allow for ageing. If they want to ship a decent product they'll also need to repeat the memory testing every boot, and ideally during runtime but maybe the latter isn't a big deal for something like this.

EDIT: to add a bit more and possibly address the original question (which I think keveman may have misunderstood), there will usually be some hardware dedicated to controlling the chip's redundancy. Part of that is often a OTP fuse-type thing that can be programmed during wafer test to indicate parts of the chip that don't work. Something (software or hardware) will read that during boot and not use those parts of the chip.

groundlogic · on Aug 19, 2019

Sure, that makes sense.

With this many cores it seems like the probability that a core dies during a multi-hour job (or in case it's used for inference, during a very long-lived realtime job) is pretty high, so the software in all layers would need to handle this kind of exception. They probably don't, today, since we haven't seen a 400k core chip before.