Home Incident-Reports
Incident Reports

We will post information related to service maintenance and disruptions here, as well as major system upgrades or other changes.  If you would like to be kept informed of service disruption events via email please sign up to receive notices here (note: this site uses SSL using a self-signed certificate.  Please accept the certificate for the initial connection):

 https://listserv.engr.ucsb.edu/mailman/listinfo/incident-reports


 

 



Service Outage (11/23/09) PDF Print
Monday, 23 November 2009 17:29
Incident Report

Summary
At approximately 11:05am 11/23/09 the LDAP authentication
server known as ldap1 (AKA accounts.engr.ucsb.edu) started
failing in that it no longer served LDAP requests and was
not accessible to the network.  The virtual machine (VM) was
rebooted but did not come back up.  In its stead an older
copy of the VM was brought up and services were restored
at approximately 11:48am.  This incident had a wide impact
as file, mail, and web services were disrupted due to numerous
dependencies of the services involved.

Details of the Incident
ldap1.engr.ucsb.edu stopped providing information services
and a number of services on other servers relying on this stopped
working.  File serving from hal1.engr.ucsb.edu was no longer working,
web files were no longer being served from the COE web server, and COE
mail was unavailable as information lookups were failing.  Adding to
the length of the outage was the fact that the fail over mechanism
of the LDAP service was not working as expected on various LDAP clients.

The Notification Process
The problem was first noticed by an ECI staff member and shortly
afterwards the automatic alerts confirmed the service outage.  After
the problem was confirmed MSOs and IT staff of the major COE departments
were notified by phone.

Conclusion
The basic cause of this incident was the instability of the VM known
as ldap1.engr.ucsb.edu but the failure of some LDAP clients to gracefully
fail over exacerbated the problem.  To help remedy this type of
incident in the future there are plans to put LDAP slave servers on
major service servers that rely on LDAP information.  This should
potentially help to isolate this type of failure in the future so that
these types of cascading failures can be prevented.
 
IMAP service outaage Friday, 23 October PDF Print
Wednesday, 28 October 2009 09:04

On last Friday morning, the imaps (the ssl entry protocol to imap) became unresponsive in the early morning around 5:30am, preventing some users from accessing their email. This affected only people using imaps, and not people using imap with TLS, and took some time to diagnose and was fully restored around 10:30am 

The cause was traced back to the xinetd program (which listens to various ports and starts services) silently refusing to start the imaps process when connections arrived. Restarting xinetd caused the heretofore unseen problem to go away. This happened a second time with the imap protocol a couple of days after.

The initial start of both coincided in time with two other logged anomalies: connections to the ldap0 authenticaion server being refused, and nfs traffic to the hal1 fileserver timing out (which in itself appeared to be due to failing connections to ldap0).

Examining the condition yesterday morning at the time of the recurring event, it became clear that ldap0 was experiencing a very high load condition causing it to refuse to serve requests. This was due to the scheduled VM guest snapshot being performed on the system at that time, and brings to light a deficiency of the current virtualization technology we are using -- disk I/O on the host operating system, and within the guest, can debilitate the performance of the guest.

We affected three changes to alleviate this. Temporarily ceased the VM snapshots, modified the ldap client rollover configuration to to provide more timely rollover to backup ldap servers, and configured the monit program on imap.engr.ucsb.edu to watch connections to services provided by xinetd and to restart xinetd if the connections fail.

Watching the system this morning revealed no further anomalous behavior. 

 
Planned mail and fileserver outage for Tuesday, 20 Oct. 6am-7am PDT PDF Print
Thursday, 15 October 2009 13:31
A service outage of email and fileserver services is planned for Tuesday, 20 October from 6am-7am PDT. During this period, hardware providing these services will be  physically moved from its current location into a new rack.  

During this period, access to email and to files provided by the hal1.engr.ucsb.edu fileserver (primarily home directories to instructional labs) will be unavailable.
 
Engineering 2 network router failure on 13 October PDF Print
Thursday, 15 October 2009 13:24

Shortly after 3pm PDT on 13 October 2009, the main building network router in Engineering 2 went offline with a catastrophic failure of its supervisor card, bringing the router and all network traffic to Engineering 2 offline, also affecting some instructional lab computers using network services located in E2.

The campus OIT has lent us a spare card which we were able to install to bring the networking back online. The card is functioning properly, and network services provided by the router were restored around 6:30pm PDT.

The failed card is under service contract from Cisco and a replacement card has been delivered; a brief downtime will be scheduled in the near future for the installation of the replacement card. 

 
Fileserver/Mail service outage 6 October 2009 11:30am PDT PDF Print
Tuesday, 06 October 2009 12:01
This morning around 11:30am power was interrupted to the circuit feeding a disk array on the hal1.engr.ucsb.edu fileserver.  This rendered file services to many home directories inaccessible and mail files inaccessible. 

Service was restored after about 20 minutes after the system was rebooted.  We are examining the circuit to see if it has a physical fault.
 
« StartPrev123456NextEnd »

Page 1 of 6
Copyright © 2009 The Regents of the University of California, All Rights Reserved.