Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Modern GPUs have a rated lifespan in the 3-7 year range, depending on usage.

That statement absolutely needs a source. Is "usage" 100% load 24/7? What is the failure rate after 7 years? Are the failures unrepairable, i.e. not just a broken fan?



I’ve never heard of this and I was an Ethereum miner. We pushed the cards as hard as they would go and they seemed fine after. As long as the fan was still going they were good.


So Intel used to claim a 100,000+ hour life time on their chips. They didnt actually test them to this because that is 11.4 years. But it was basically saying, these things will last at full speed way beyond any reasonable life time. Many chip could probably go way beyond that.

I think it was about 15 years back they stopped saying that. Once we passed the 28nm mark it started to become apparent that they couldnt really state that.

It makes sense, as parts get smaller they will get more fragile from general usage.

With your GPUs yeah they are probably still fine but they could already be half way through their life time, you wouldnt know it until failure point. Add in the silicon lotto and it gets more complicated.


One thing to realize is the lifetime is a statistical thing.

I design chips in modern tech nodes (currently using 2nm). What we get feom the fab is a statistical model of device failure modes. Aging is one of them. When transistors gradually age they get slower sue to increased threshold voltage. This eventually causes failure at a point where timing is tight. When will it happen varies greatly sue to initial conditions, exact conditions the chip was in(temp, vdd, number of on-off cycles, even the workload). After an agong failure the chip will still work if the clock freq is reduced. There are aging monitors on-chip sometimes which try to catch it early and scale down the clock.

There are catastrophic failures too, like gate insulator breakdown, electromigration or mechanical failures of IO interconnect. The last one is orders of magnitude more likely than anything else these days.


For mining, If a GPU was failing in such a way that it was giving completely wrong output for functions during mining, that'd only be visible as a lower success hash-rate which you might not even notice unless you did periodic testing of known-target hashes.

For graphics, the same defect could be severe enough to completely render the GPU useless.


Yeah, chip aging is at max temperature, max current, and worst process corner. And it's nonlinear so running at <10% duty cycle could reduce aging to almost nothing.


Has chip aging finally surpassed thermal cycling as the primary cause of component failure in datacenters?


I don't know but I would guess not. Solder is really weak compared to silicon.


Every now and then, I get a heartfelt chuckle from HN.

By 'Modern' they must mean latest generation, so we'll have to wait and see. I was imagining not using an RTX 5090 for 7 years and find it doesn't work, or one used 24x7 for 3 years then failing.


Electromigration and device aging are huge issues. I can't imagine a modern GPU having a lifetime longer than 3 years at 100C temperature.

Though, it can be solved with redundancy at the cost of performance.


Just look at warranties, gotta go to Quadro series for industrial warranty lengths.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: