Last night we had an incident caused by unforeseen mysql behaviour in one of our ha database clusters which almost made desaster recovery necessary.
For our databases, we form clusters of two nodes and control the services with the pacemaker cluster manager. The data resides on DRBD, so both nodes can access it if they are the designated primary node. We configure pacemaker to control a service ip address, the aforementioned DRBD device, the filesystem on this block device and a MySQL database. The cluster manager regularly queries the resources about their states. And if any reported state differs from the expected one, the cluster manager initiates actions to return to the desired state.
So for example, if the cluster manager knows it started the webserver resource on a particular node and that node reports the webserver to not be running, it considers that an unexpected state (ie an error) and will, depending on your configuration, restart the webserver on that node or start it on another cluster member that is eligible to run the service. Such a „recovery path“ usually involves stopping the resource, verifying that the resource stopped successfully and then starting it again (on whatever node is configured to be or considered to be the best node for the resource).
When the mysql resource agent starts the database, it actually invokes the mysqld_safe wrapper script which then starts the actual database process „mysqld“ as a child process.
Last night, one of our database clusters experienced one of those unexpected states. The currently active node, alice, apparently lost its primary network connection
Feb 9 22:14:02 alice attrd: : notice: attrd_perform_update: Sent update 965: ping=0
and therefore the cluster manager decided alice had to give up all resources it ran at that moment. So it started a recovery path that looked like this:
- stop the database on node alice
- unconfigure the database’s ip address on node alice
- unmount the filesystem on node alice
- demote the drbd device on node alice
- promote the drbd device on node bob
- mount the filesystem on node bob
- configure the database’s ip address on node bob
- start the database on node bob
That is the configured and sane thing to do.
Feb 9 22:14:02 alice crmd: : info: te_rsc_command: Initiating action 57: stop MysqlServer_stop_0 on alice (local) Feb 9 22:14:02 alice lrmd: : info: rsc:MysqlServer stop (pid 20114) Feb 9 22:14:02 alice mysqld: 160209 22:14:02 [Note] /usr/sbin/mysqld: Normal shutdown
Each operation run on a resource has its individual timeout. For a database with dozens of gigabytes of allocated memory and hundreds of queries executed each second, the timeout for the stop operation has to be rather long in order to allow for a clean shutdown of the database process. For this particular database, the stop timeout is three minutes long. Unfortunately, in this case, stopping the database seemed to take even longer than that. If the resource agent for the database encounters such a situation, it tries to prevent running into a fencing situation by forcefully terminating the database process and then reporting a successful stop.
Feb 9 22:16:58 alice mysql: INFO: MySQL failed to stop after 175s using SIGTERM. Trying SIGKILL... Feb 9 22:16:58 alice mysql: INFO: MySQL stopped Feb 9 22:16:58 alice lrmd: : info: operation stop on MysqlServer for client 6292: pid 20114 exited with return code 0
In the meantime however, the network connectivity recovered
Feb 9 22:16:42 alice attrd: : notice: attrd_perform_update: Sent update 967: ping=100
So the recovery path changed from „move everything over to bob“ to „restart the database on alice“ and since alice reported it successfully stopped the database, the cluster manager then initiated the start of the database:
Feb 9 22:16:58 alice crmd: : info: te_rsc_command: Initiating action 58: start MysqlServer_start_0 on alice (local) Feb 9 22:16:58 alice lrmd: : info: rsc:MysqlServer start (pid 21051) Feb 9 22:16:58 alice mysql: INFO: MySQL already running
The start operation on alice however, claimed mysql was already running. Which should be impossible since it was just terminated by sending a KILL signal. But actually, if the cluster tries to start a service and the agent reports that the service is already running, that is okay and does not require further actions since it matches the desired state.
mysql then logged a startup process and complained it was not shutdown correctly before
Feb 9 22:17:08 alice mysqld: 160209 22:17:08 InnoDB: Database was not shut down normally! Feb 9 22:17:08 alice mysqld: InnoDB: Starting crash recovery.
This leads me to assume the kill -KILL was actually successful. But since „start“ was successful just before, the cluster was fine with its state until, a few seconds later, while the recovery was still going on, the cluster checked the database resource’s state and found
Feb 9 22:17:51 alice mysql: ERROR: MySQL is not running Feb 9 22:17:51 alice crmd: : WARN: update_failcount: Updating failcount for MysqlServer on alice after failed monitor: rc=7 (update=value++, time=1455052671) Feb 9 22:17:51 alice attrd: : notice: attrd_perform_update: Sent update 970: fail-count-MysqlServer=1 Feb 9 22:17:51 alice pengine: : WARN: unpack_rsc_op: Processing failed op MysqlServer_last_failure_0 on alice: not running (7)
Since this does again not meet the desired state of the resource, it initiated recovery again. The recovery path consisted of stopping and starting the resource on alice.
Feb 9 22:17:51 alice pengine: : notice: LogActions: Recover MysqlServer (Started alice) Feb 9 22:17:51 alice crmd: : info: te_rsc_command: Initiating action 6: stop MysqlServer_stop_0 on alice (local) Feb 9 22:17:51 alice lrmd: : info: rsc:MysqlServer stop (pid 21259)
This stop operation reported that mysql was not running, which is what the cluster wanted at that moment
Feb 9 22:17:51 alice mysql: INFO: MySQL is not running Feb 9 22:17:51 alice lrmd: : info: operation stop on MysqlServer for client 6292: pid 21259 exited with return code 0
and then tried to start it again
Feb 9 22:17:51 alice pengine: : notice: LogActions: Start MysqlServer (alice) Feb 9 22:17:51 alice lrmd: : info: rsc:MysqlServer start (pid 21281)
This is the second start operation initiated by the cluster. This one however, does not report that mysqld is already running and therefore spawns a mysqld_safe process in the background. This likely writes a new pid to the pidfile and then loops trying to connect to the database.
The new process encountered it cannot access the database files
Feb 9 22:17:58 alice mysqld: InnoDB: Unable to lock ./ibdata1, error: 11 Feb 9 22:17:58 alice mysqld: InnoDB: Check that you do not already have another mysqld process
which is likely caused by the recovery still going on in the first mysql process. So at this time, we have two mysqld_safe processes that each spawned a mysqld process that both try to access the same database files. The data can only be accessed by the first mysqld process. The pidfile however, holds the pid of the second mysqld process.
There’s a long story beyond this point, which I might tell you some day and ultimately resolved in a usable database, but the current question is: How did this happen?
Well … apparently … if you send SIGKILL to mysqld, its controlling mysqld_safe restarts it.
root@alice:~# ps -ef|grep mysql root 21722 21702 0 12:25 pts/1 00:00:00 grep mysql root 22826 1 0 09:24 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe mysql 23084 22826 0 09:25 ? 00:00:02 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306 root 23085 22826 0 09:25 ? 00:00:00 logger -t mysqld -p daemon.error root@alice:~# kill -9 23084 root@alice:~# ps -ef|grep mysql mysql 21756 22826 3 12:25 ? 00:00:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306 root 21757 22826 0 12:25 ? 00:00:00 logger -t mysqld -p daemon.error root 21786 21702 0 12:25 pts/1 00:00:00 grep mysql root 22826 1 0 09:24 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe
Which is a terrible thing to happen in a cluster manager controlled environment where the cluster manager has to be the only instance that starts and stops things.
There is a patch in progress that will hopefully fix this thing for everyone else using MySQL in a pacemaker cluster.