Dealing with waves of write IOPS inside Sinefa's data warehouse
At 2:30AM AEST, 3/Aug, communications between the Sinefa portal (the system which ingests metadata from probes) and the database were degraded. The database flow control mechanism started to reject all new connections. As a result, customers could not see Probes and their associated data. The flow control mechanism of the Sinefa DB was reset at 8:50AM AEST, restoring services back to normal. Probes have a capacity to buffer unsent metadata so most of the metadata collected between 2:30 and 8:50am was then subsequently sent to the cloud.
The root cause of the issue was the rate and size of new direct API queries hitting the Sinefa portal. Our direct API volumes over the last 3 months have grown significantly. Both the number of API calls and the size of the queries have resulted in far larger wait queues on the DB. These wait queues are shared with users of the UI. An influx of API queries increased the wait queues to a point where large batches of queries were cancelled by the DB resulting in errors. The large errors rates instruct the DB's default security controls to shutdown communications with the offending host (Sinefa Portal).
To resolve the issue and protect system health, we have increased timeouts and adjusted DB security controls to deal with the extra calls. We are also in the process of applying some intelligence filtering out expensive and unnecessary direct API calls which can result in large queue lengths. The ultimate plan is to move direct API calls to a different service and different processing queue.