Chief Operating Officer
Published on: 31-Aug-2017
As CIO of ProQuest I’d like to apologize for the interruption in our services that began on July 17 at around 3:50PM EDT. It’s our mission to provide better research, better learning, better insights – regrettably a highly unusual incident with a key piece of hardware caused an outage of some ProQuest systems.
Our teams worked through the night to restore service, and all ProQuest services are now operating normally. We are working closely with our vendors to fully analyze the root cause of this incident and ensure that any issues that are identified are addressed quickly.
This outage is especially disappointing for me because every year we invest millions of dollars to build and deliver the best products possible, with substantial investments in robust platforms from leading hardware and cloud vendors. While we are proud of our long track record of availability of ProQuest Platforms, I assure you that we will do everything we can to learn from this event and continue to improve our products and services.
To minimize further customer impact, we have decided to delay the ProQuest maintenance window scheduled for July 29, 2017. We will provide an update shortly with the new timing for the window.
I’ve included below a summary of what happened, additional technical details, and a FAQ to attempt to answer questions that you may have.
If you have additional questions, please don’t hesitate to contact me at firstname.lastname@example.org.
Richard C. Belanger
An investigation by the ProQuest engineering team determined that the issue was caused by the failure of a core hardware component – a fault tolerant storage platform. Although the platform has multiple layers of redundancy, we experienced a complete failure with no warning. The failure impacted not only the storage environment, but also the management environment, which significantly delayed the restoration process. Over 1,200 virtual servers were impacted by this outage and all required intervention to restart and bring back online. While this is a significant number of servers, it represents just 20% of our overall environment.
We are committed to preventing the recurrence of such an outage and are undertaking an architectural review to evaluate changes to our environment to improve the resiliency of our products.
- - What happened? ProQuest experienced an extended outage of some platforms due to hardware failure of a critical component. Once the component was repaired, ProQuest services came back online.
- - Why did it take so long to restore service? The device that failed has sophisticated diagnostics and is directly connected to our vendor for real-time analysis, health checks, and troubleshooting. Unfortunately, the hardware diagnostics didn't generate normal error messages; the device just silently and immediately failed. This caused many services that monitor our environment to fail as well and delayed our identification of the problem by about an hour. According to our vendor this type of failure should not be possible.
- - ProQuest is a major online service; don’t you plan for this sort of thing? Why don’t you use fault-tolerant servers? We have extensive fault-tolerance designed into our products. We have redundant servers, redundant storage, redundant power, redundant HVAC, redundant networking, etc. Unfortunately, the failure we experienced took down multiple components that were designed to provide fault-tolerance. We are working with our hardware vendor to fully understand and correct the problem.
- - I thought ProQuest uses the Amazon AWS public cloud? Was this an Amazon problem? While we use the public cloud extensively at ProQuest, some of our core services, including authentication, are currently hosted in our data center. All of our cloud systems were operating normally, but due to the issues with ProQuest services based in the data center, some of our platforms were inaccessible to customers. We are looking to migrate those services to AWS.
- - What are you going to do to prevent this from happening again? We are taking a number of steps to prevent this type of failure going forward:
- > In cooperation with our vendor we are conducting a complete audit of our storage environment to ensure that all of our storage platforms pass vendor health tests, have all of the necessary fault-tolerance, and are configured correctly. Although this platform was installed and certified directly by the vendor, we want to be sure there aren’t any underlying issues.
- > We are reviewing our physical architecture to identify options to distribute our services more broadly across our hardware infrastructure and further expand the already high levels of redundancy in the system.
- > We are continuing with our plans to move more of our services environment into the Amazon AWS public cloud. While public cloud is also not perfect, it provides many more options for fault tolerance.
- - What are you doing to improve customer communications? It took too long for me to be informed of this issue. We work to be transparent with our customers in all communications. Given the specific issues with this outage we were unable to post a downtime message on the affected platforms. We will continue to use multiple communication channels, including the ProQuest Support Center, the ProQuest Blog, Twitter and Facebook, to share regular updates. We provided 29 updates via Twitter and 12 on Facebook during the interruption. We are reviewing our procedures and will continue our efforts to improve customer communications. We encourage all of our customers to follow us on Facebook or Twitter to receive real-time updates on ProQuest products and services.