hardware - How does Google, Facebook etc. deal with memory corruption etc -
i wondering how google, facebook etc. deal hardware errors memory corruptions, calculating errors in cpu etc. given increasing density (and shrinking size) of circuits, seems frequency of hardware errors going up, not down. also, big providers google , facebook have many machines memory corruption must everyday occurrence. wondering kind pf policy have regards this. after all, algorithms assume underlying hardware operating correctly , data doesnt change in memory etc. if does, bets off. cause corruption not specific data being affected error, conceivably spread other computations. instance if error affects locking/synchronization protocol cause data hazards threads or nodes etc. or corrupt database, causing violations of invariants assumed elsewhere etc. cause other nodes, discover corruption, fail. have seen in practice erroneous data in database (an invalid timestamp in configuration-related row) caused whole system fail because application validated timestamp whenever reading row!
hopefully, of time errors result in node crashing etc. maybe before committing data (for instance, if operating system structure corrupted). since errors occur @ random, occur everywhere , error live on without being noticed.
it must challenging. also, thinking big providers must occationally see errors / stack traces in logs, cannot explained through code inspection/analysis, because situation cant happen if code had executed "as written". can quite hard conclude, lot of investigation being spent on error before concluded must have been hardware error.
of course not limited big service providers, since these errors can occur everywhere. big service providers more exposed it, , make sense them have policy in area.
i can see different ways how can addressed:
1) pragmatic, repair errors go along. repair reboot machine. in cases customer data corrupted , complains, fix that.
2) hardening of code running on individual nodes. dont know techniques used, instance calculating results twice , comparing before committing. of course incur overhead, , comparison logic can subject corruption, maybe quite low risk since requires error in area specifically. also, logic duplicated.
3) different nodes running in lock-step, comparisons being done between nodes before results allowed committed.
4) large-scale architectural initiatives reduce damage localized error. making sure compare db content previous backup(s) detect bit rot (before blindly making backup of current data) etc. various integrity checks in place. resiliency in other nodes in case of corrupted data (not relying on invariants holding etc.). "being liberal in accept".
there might other things havent thought of, reason asking question :)
at least memory content must reliable:
https://en.wikipedia.org/wiki/ecc_memory
there other error detection/correction codes used @ various levels (checksums, hashes, etc.).
Comments
Post a Comment