Meta releases how it has been fighting Silent Errors off their systems for so long

As technology advances day by day, it works out its flaws and tries to make itself perfect. But as nothing is perfect some of those flaws get left behind in the form of system errors that are seen on every platform on the internet.

Meta is a huge platform and occasionally errors can take place in its server. There are two types of errors: 'Silent Errors' and the ‘Normal Errors’. Silent error is the one that when it happens it leaves no trace of its existence on the system log, hence the name. The reason for them appearing can be the temperature of the device if it is left too high for too long or it can be a factor of age of the device. This flaw can cause internal hardware problems, incorrect circuit workings leading to data losses and the wrong instructions or commands being taken by the hardware. This problem is faced by almost all of the platforms on the internet and is quite a dangerous one for platforms as it can create disturbance in the system all while being undetected.

In a paper that was published recently, Meta revealed how they handle this problem when it appears in their systems and how it removes them. The company uses a mixture of two different types of tests. First they conduct a test when all the machines are offline for maintenance checks and repairs if needed, and the second test is a mix of smaller ones conducted throughout production. According to Meta the latter can get more coverage in a shorter time while the first provider more coverage but takes time.

Considering Meta is such a big company it is essential that they have all their data and their systems do their work efficiently. So in order to keep an eye out for Silent errors Meta conducts regular tests to identify and get rid of these errors which are also known as SDC or silent data corruptions. The strategies used by the tech giant include Silicon testing, and Infrastructure testing.
Silicon testing uses a device called a Silicon chip that is inserted into the machines that Meta uses to detect SDCs in the device. This process takes about several months, but if a device is not built properly or has a flaw it will result in less than okay results from the tests. The Infrastructure testing strategy is two types of tests within the first. The tests include Out of production and In-production testing. The out of production tests are the ones where the machine is offline. In order to conduct these tests, the machines are not specifically made offline but are conducted whenever the machines are not working for a variety of reasons like maintenance, repairs etc. To detect the Silent error, the company uses a device called a Fleet scanner which scans the devices servers for the error. Once detected it is reported and removed.
But the main problem with this type is that it is slow, so if we look at it from the perspective of time the in- production testing is more beneficial. The in-production testing involves a tool with the name ripple and it can execute commands with the difference of milliseconds between each.


Read next: Messenger Adds An Array Of Shortcuts That Include Sending ‘Silent Replies’
Previous Post Next Post