Boinc render farm

6/19/2023

The cards stabilize at about 65c with the fans at 70%. I’m running the cards now at 99% utilization for about 18hours then giving them a 6 hour break.

Just my 2 cents (I’m sure there’s even more to consider) In the end though, you’ll probably find your disk drives and PSU fail before your GPU - and you tend to replace consumer cards every year or two anyway - so I’d be more worried about the rest of your system having issues, before your GPU does… The only advice I can give you, if you HAVE to use a consumer card - is under clock… heat is your biggest issue in terms of the core chip, even if the chip is rated to run at a core temp of 140C+ - you won’t want it running that hot…Įlectromigration aside - the RAM, the PCB, heatsink/thermal padding/paste, etc - are all going to have other issues running at sigh high temperatures for extended periods of time - with the added issues of vibration that air cooling and a typical consumer card & case setups give - especially with low quality consumer parts… Working for a company who have requirements of apps running for months (years), of 24/7 high load computation… My only hardware failures were with a bad UPS and with a bad PSU. Frankly I haven’t had any problem with using consumer cards 24/7. This isn’t exactly what the original poster was asking but I mention it here because it’s interesting. So his final point was that it was well worth paying the significant extra cost for a Tesla for the simple reason that the higher MTBF reduced these stresses and costs. And that’s with a catastrophic failure… a transient failure is even worse.Īnd finally, if you do have a failure, you now need the skill and experience to find and replace the bad board (which seems trivial, but it’s a problem as machine counts grow!) and that replacement costs money and time. But now you just interfered with whatever compute you’re doing and much throughput computes aren’t robust to failure, and often need manual intervention, and/or can cause the whole server farm to wait, so it becomes a big reliability issue since basically the one board failure can cause big issues with your entire farm. One dead board will crash a production machine, perhaps it’s on one machine in a big server farm. The interesting point was the economics of board failure. The professional cards (Tesla) run cooler (they have lower clocks), and have been pre-tested more thoroughly, so you’ll have many many fewer failures than consumer cards when stressed like that. His point was that failures were rare, but clearly way more common if stressed with 24/7 use. I did talk to a system builder who was incredulous that you’d ever run consumer cards 24/7. (I would rather not lower the time I process on the cards.) If the temps are too high for that period of time I can always up the fan speed.

I’m guessing that running these cards at 80c for 12 hours a day 7 days a week should not cause any issues, but I wanted to check with the experts here first. (They seem to idle at about 40-50c with no work) When I pump up the fan speed I can get their temps down to 50-60c even under full work. Will there be any problems running these cards for this long at those temperatures? It looks like the high end for these cards are around 100c. I leave the fan control set at auto and based on what the graphs show the fans are running at about 40% at those temperatures. Both cards settle at about 80c (never seen them run hotter). I usually run both cards at or near 100% for 12-14 hours at a time 5-7 days a week. The system is well ventilated and using the EVGA tool I keep a constant eye on their temperatures. I use the cards for BOINC and other CUDA related tasks. I am currently running a new EVGA GTX 470 and an older 8800 GTX in my Windows 7 system.

0 Comments

Boinc render farm

Leave a Reply.

Author

Archives

Categories