Testing Failures and Fault Recovery of PHP and Mysql

Often I’m tasked with having to architect highly available and fault tolerant systems of varying styles; architecting and designing is one thing, but when it’s built you then have to test, test, and test some more.  In the course of doing this over the years I have a number of horror-stories, and I’ve also picked up some cool tricks and I have a running list of often over looked areas. A fairly recent addition to that list of cool tricks had to deal with fault tolerant PHP applications.

The architecture was fairly simple: it was a couple of worker backends that would check a queue for tasks, processes them, and check the queue again or sleep. Everything worked except that after long periods of quiet time (like overnight) the mysql connection would be closed out due to inactivity.

The error message is pretty simple:

Error message: MySQL server has gone away

In troubleshooting this I found that there is a simple check to see if the connection is alive (as opposed to trying to do a simple read and wrapping that in a “try-catch” loop to handle the failure, which was my first thought of how to solve). It’s called a mysql-ping and works just like real ping, here’s the php page about it

Here’s what my code ended up looking like

/* check if server is alive */
if ($mysqli->ping()) {
printf ("Our connection is ok!\n");
} else {
print "reconnect...."
mysqli_connect("localhost",
…..
}

Sounds good, now how to test? Enter tcpkill. It’s a utility used to kill tcp connections. here’s the page explaining it

And here it is in action looking for my mysql connection which would hit the mysql server on it’s default port of 3306


[root@elastictask-util ~]# netstat -na | grep 3306
tcp        0      0 10.25.10.253:45577          10.25.10.65:3306            TIME_WAIT
tcp        0      0 10.25.10.253:45591          10.25.10.65:3306            ESTABLISHED

Once I see my connection I then can kill it and see how my code did with checking the connection before trying to query the DB.

[root@elastictask-util ~]# tcpkill port 45591
tcpkill: listening on eth0 [port 45591]
10.25.10.253:45591 > 10.25.10.65:3306: R 59975602:59975602(0) win 0
10.25.10.253:45591 > 10.25.10.65:3306: R 59975731:59975731(0) win 0
10.25.10.253:45591 > 10.25.10.65:3306: R 59975989:59975989(0) win 0
10.25.10.65:3306 > 10.25.10.253:45591: R 1386633333:1386633333(0) win 0
10.25.10.65:3306 > 10.25.10.253:45591: R 1386633395:1386633395(0) win 0
10.25.10.65:3306 > 10.25.10.253:45591: R 1386633519:1386633519(0) win 0

In practice this worked great and I was able to test my code and customize it for various scenarios.

When creating highly available and fault tolerant systems it’s important to test all the various connecting pieces, just because you can make a connection or open a socket doesn’t mean that that connection or socket will stay open forever. Sometimes you need to balance the overhead and time of recreating and tearing down the connection each time versus using some other tools or checks (which might not show up in common examples) in order to come up with the solution that works best for your most important factors. And for me that was mysql ping to very easily check the connection and reestablish it if it was broken.