Abel Avram has posted an interesting analysis of the causes and solutions of the December 22nd Skype outage that affected millions of users.
In short the outage was caused by a bug in the undelivered message code. This bug had been fixed in a subsequent version, but 50% of Skype users were still using the buggy version. With Skype being a peer-to-peer application, and 40% of Skype clients crashing when the undelivered messages attempted delivery, it caused undo strain on the remaining Skype users' machines. thus causing a cascading network failure.
Most interesting are the lessons, which in retrospect seem a little obvious:
One important lesson to be learned is this: many users do not update their software if they dont have to.... Apparently Skype is considering a Google Chrome style invisible update.
Skype deciding to reviewtheir testing processes to determine better ways of detecting and avoiding bugs which could affect the system.
will keep under constant review the capacity of our core systems that support the Skype user base, and continue to invest in both capacity and resilience of these systems. and adequate capacity. Aren't these pretty much the cornerstones of effective IT?
-- Rick Wanner - rwanner at isc dot sans dot org - http://namedeplume.blogspot.com/ - Twitter:namedeplume (Protected)
(c) SANS Internet Storm Center. http://isc.sans.org Creative Commons Attribution-Noncommercial 3.0 United States License.