Facebook Architects Around Silent Data Corruption

facebook-architects-around-silent-data-corruption

Silent but deadly: there is nothing more destructive than data corruptions that cannot be caught by the various error capture tools in hardware and even in software, can be hard to spot before they have infected an entire application. This is especially devastating at Facebook scale but engineering teams at the social giant have discovered strategies to keep a local problem from going global. A single hardware-rooted error can cascade into a massive problem when multiplied at hyperscale and for Facebook, keeping this at bay takes a combination of hardware resiliency, production detection mechanisms, and a broader fault-tolerant software architecture. Facebook’s infrastructure team started an effort to understand the roots and fixes for silent data corruption in 2018 to understand how fleet-wide fixes might look—and what those might detection strategies could cost in terms of overhead. Engineers found that many of the cascading errors are the result of CPUs in production but not always due to the “soft errors” of radiation or synthetic fault injection. Rather, they find these can happen randomly on CPUs in repeatable ways. Although ECC is useful, this is focused on problems in SRAM but other elements are susceptible. The Facebook engineering team that reported on…
Read More

Exit mobile version