Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you divide that giant piece of silicon into 400k processors, and then only use the ones that actually work...

I wonder if they figure that out every time the CPU boots, or at the factory. At this scale, maybe it makes sense to do it all in parallel at boot. Or, even dynamically during runtime.

There may be edge case cores that sort of work, and then won't work at different temps, or after aging?



They'll aim to catch the logic failures and memory failures during wafer test at the factory. This testing is done at room temperature and at hot. There are margins built in to allow for ageing. If they want to ship a decent product they'll also need to repeat the memory testing every boot, and ideally during runtime but maybe the latter isn't a big deal for something like this.

EDIT: to add a bit more and possibly address the original question (which I think keveman may have misunderstood), there will usually be some hardware dedicated to controlling the chip's redundancy. Part of that is often a OTP fuse-type thing that can be programmed during wafer test to indicate parts of the chip that don't work. Something (software or hardware) will read that during boot and not use those parts of the chip.


Sure, that makes sense.

With this many cores it seems like the probability that a core dies during a multi-hour job (or in case it's used for inference, during a very long-lived realtime job) is pretty high, so the software in all layers would need to handle this kind of exception. They probably don't, today, since we haven't seen a 400k core chip before.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: