Investigating intermittent connection issues between nodes
Incident Report for Citus Data
Postmortem

On Friday April 27 at 01:12 UTC we began observing connectivity issues from Citus coordinator node to some data nodes. These connectivity issues non-deterministically affected a subset of formations that had undergone re-scaling.

The connectivity issues were the result of a configuration change we rolled out during that time. The configuration change was the first step in resolving underlying issues we've identified and are working to resolve in pgbouncer. More on that later.

The configuration process writes out a file mapping Citus nodes to pgbouncer addresses . If this was the first time for this host file to be written or the order of hosts was identical as the previous version of the host file, then there was no problem. Scale events could change the order in which this list is generated and routines to stabilize the result were not functioning, resulting in mis-mapped hosts.

In rolling this out we began with a number of internal servers to test and monitored those. As all appeared clear we began gradually rolling this out to the fleet of clusters. Early in the process we began observing some connectivity issues to smaller clusters and paused to investigate. As we began investigating we observed the misconfiguration of the hosts file. We began two actions, one manually patching affected ones with other team members focused on identifying the broader cause of the issue.

These changes were the first step in an effort to resolve an edge behavior case in pgbouncer. We heavily use Pgbouncer within the Citus Cloud architecture to provide lower latency and higher performance. Pgbouncer runs outbound on the coordinator node and inbound on all data nodes. When pgbouncer exhausts its connection limit and receives an error it goes into a state where it stops resolving DNS entries for some indeterminate amount of time.

Going forward we're implementing a few explicit changes in how such changes will be rolled out. First we'll be explicitly rolling out changes like this to dev formations first then working up to production clusters. Second we’re going to be randomizing lists of data that are believed to have a non-explicit ordering to cause implicit ordering conditions to surface defects more predictably.

Posted 7 months ago. May 03, 2018 - 18:12 UTC

Resolved
This incident has been resolved.
Posted 8 months ago. Apr 27, 2018 - 03:04 UTC
Monitoring
Connection availability has been restored to all data nodes. We're continuing to monitor the situation.
Posted 8 months ago. Apr 27, 2018 - 01:38 UTC
Identified
We've identified the cause affecting availability of smaller formations and connections to a particular data node on those. We're in the process of resolving connectivity issues to those clusters.
Posted 8 months ago. Apr 27, 2018 - 01:21 UTC
Investigating
We're investigating intermittent connection issues between some data nodes and coordinators.
Posted 8 months ago. Apr 27, 2018 - 01:12 UTC