|
Wednesday, 01 July 2009 10:03 |
Incident Report
Summary Today (6/30/9) at approximately 12:12pm we experienced an issue with one of our VM (Virtual Machine) servers. The server (hollow.engr.ucsb.edu) has a disk that is failing and was undergoing some maintenance. Part of this maintenance involved moving some data around to a second disk in the system. hollow began to freeze up and the VMs running on the server became unresponsive. These included nis0, spamwatch, syslog, and mx1. Some of these are involved in authentication and other important services and thus the service interruption was widespread and lasted for about an hour and 15 minutes until the system was rebooted. The duration was also partly due to the fact that the incident happened during the lunch hour and some staff were off campus.
Details of the Incident hollow.engr uses lvm and there was a pvmove operation in progress when it froze. Data was being migrated off the failing disk and apparently when the reads got to a corrupt point on the disk the system seized up.
The Notification Process We were notified of this via the campus monitoring service and user reports (our monitoring server [syslog.engr] was one of the impacted VMs).
Technical Details / Fix Actions The VMs residing on hollow are copied over to backup servers nightly. Tomorrow morning the VM copies will be brought up in an orderly fashion and the disk replaced on hollow.
Conclusion Some of the LVM utilities do not handle physical corruption of media. In the future we will fully migrate VMs off hardware that may be failing to help isolate the impact of such failures.
|