AMD EPYC freezes after 1044 days of continuous use and no one will fix it


Yesterday, June 2, Tom’s Hardware, citing its sources, reported that AMD had discovered a serious problem with the EPYC 7002 Rome server processors. The fact is that if this processor model works without interruption for 1044 days (almost three years), which is quite normal for server solutions, then the chip cores will simply freeze. The happy owners of the EPYC 7002 Rome will have to completely restart their computing power, because there is no other way to overcome the problem. And most importantly, AMD is not going to do anything about it.

Pasted 9

A brief post from AMD itself indicates that second-generation server processors are experiencing an issue that prevents cores from exiting Core C6 State (or CC6) power-saving mode after a long run cycle. At the same time, the manufacturer stated that 1044 days is not an absolute value – a failure may occur earlier or later, since it all depends on the frequency of REFCLK, which allows processors to track the time parameter, and on some other factors. But the manufacturer does not provide any information about exactly why the failure occurs, so until now no one understands exactly what the root of the hang is.

Representatives of the company said that there are currently two options for solving the problem – the owners of servers on these processors need to either reboot the system to reset the timer to 1044 days, or completely disable the Core C6 State power saving mode. Probably, both of these options are extremely poorly suited for owners of server processors – the power saving mode allows you to save a lot of money on energy consumption, so obviously no one will turn it off, and waiting for an error to occur and freeze, then to reset the system, is also not a very convenient solution. Especially when it comes to some really important components of the infrastructure.

AMD thinks otherwise – the company does not perceive this bug as a serious problem. The company stated that server maintenance and security updates are clearly more frequent than once every three years (more precisely, once every 2.93 years), so it is very problematic to “catch” such a freeze. Moreover, almost two deadlines have already passed since the launch of the processors in 2018, but the first reviews of this problem have appeared only now, and even in very small quantities.

All TechWeek writers are indepentent and from many different countries. Some english misspelling and grammar mistakes may occur. Report article.