Home Incident-Reports General Service Outage (6/30/9)
Service Outage (6/30/9) PDF Print
Wednesday, 01 July 2009 10:03
Incident Report

Summary
Today (6/30/9) at approximately 12:12pm we experienced an
issue with one of our VM (Virtual Machine) servers.  The
server (hollow.engr.ucsb.edu) has a disk that is failing and
was undergoing some maintenance.  Part of this maintenance
involved moving some data around to a second disk in the
system.  hollow began to freeze up and the VMs running
on the server became unresponsive.  These included nis0,
spamwatch, syslog, and mx1.  Some of these are involved
in authentication and other important services and thus
the service interruption was widespread and lasted for
about an hour and 15 minutes until the system was rebooted.
The duration was also partly due to the fact that the incident
happened during the lunch hour and some staff were off campus.

Details of the Incident
hollow.engr uses lvm and there was a pvmove operation
in progress when it froze.  Data was being migrated off
the failing disk and apparently when the reads got to a
corrupt point on the disk the system seized up.

The Notification Process
We were notified of this via the campus monitoring service
and user reports (our monitoring server [syslog.engr]
was one of the impacted VMs).

Technical Details / Fix Actions
The VMs residing on hollow are copied over to backup
servers nightly.  Tomorrow morning the VM copies will be
brought up in an orderly fashion and the disk replaced on
hollow.

Conclusion
Some of the LVM utilities do not handle physical corruption
of media.  In the future we will fully migrate VMs off hardware
that may be failing to help isolate the impact of such failures.
 
Copyright © 2012 The Regents of the University of California, All Rights Reserved.