July 6, 2023
We want to provide you with some additional information about the service disruption that occurred on July 6th, 2023.
Issue summary
To explain this event, we need to share a little about the internals of the ScheduleInterpreter® storage. While the majority of ScheduleInterpreter® services and the platform run within the main ScheduleInterpreter® tier 1 (T1) storage, ScheduleInterpreter® makes use of multiple tier 2 (T2) storage units to host non-critical data, including monitoring and internal platform analytics. These T2 storage units provide cost effective solutions for all subscribers and allow ScheduleInterpreter® to collect large volumes of information without affecting the price of the overall services. On July 4, 2023, part of the database, hosting non-critical statistical data has been moved to a T2 storage unit. In the process, an error has been made, preventing the database correctly allocating the necessary resources. At 3:16 PM CDT on July 6, 2023, an automated process restarted the core database service in the attempt to flush data that has been accumulated in transactional log. After automated restart, database server was unable to identify the location of the T2 storage hosting statistical data, which led to service interruptions.
ScheduleInterpreter® service impact
ScheduleInterpreter® subscribers' workloads were directly impacted. At 4:32 PM CDT the error was identified and the process to restore the services started. At 4:54 PM CDT the error was corrected and platform was reactivated. This created large spike of the resources resulting in slower than normal performance. At 5:22 PM CDT platform was restored to its normal working conditions.
Event communication
We understand that events like this are more impactful and frustrating when information about what’s happening isn’t readily available. The non-critical status of the data moved to a T2 storge unit delayed our resolution of this event. Our Support Desk Contact also relies on the ScheduleInterpreter® database, so the ability to communicate was impacted as well. We have been working on several enhancements to our Support Services to ensure we can more reliably and quickly communicate with customers during operational issues. We expect to release a Service Health Dashboard early next year that will make it easier to understand service impact and a new support system architecture to ensure we do not have delays in communicating with subscribers.
In closing
Finally, we want to apologize for the impact this event caused for our subscribers. While we are proud of our track record of availability, we know how critical our services are to our subscribers, their applications and end users, and their businesses. We know this event impacted many subscribers in significant ways. We will do everything we can to learn from this event and use it to improve availability of our platform even further.