In an effort to simplify our infrastructure, in May 2017 we migrated our own PostgreSQL database cluster at Global Access from a dedicated server to a virtual machine (VM) running on Enterprise Advanced Cloud IaaS. Our application stores customer billing data that is mission-critical. It is the foundation of our revenue and was developed as what we call a ‘classic app’.
The reason we kept the database running on a physical server was that we didn’t want to change something unless it was really necessary. Also, until recently, we were convinced that the virtual machines weren’t sufficient to support a database of the scale we need, namely updating 1,602,024 rows daily. However, after analysing virtual-machine capabilities and performance precisely, this assumption turned out to be completely incorrect.
The original setup of the database
Our database was running on a server with mirrored SSD disks to quickly deliver reads of over 10,000,000 rows for our billing reports. It also had a synchronous replica to a VM running in our cloud. Back in 2010 when we planned this infrastructure, our tests showed that a VM was slower than the dedicated server, and we used the cloud to mitigate any hardware failure of the aforementioned server. Having said that, the hardware server never actually had any issues. It stayed up for hundreds of days at a time, restarting only for kernel updates and security patches.
The decision to move into the cloud
Having a dedicated hardware component in a system that primarily uses virtual machines meant that we had to have specific knowledge of how to manage that server. And the fact that our team was more widely distributed meant that we had to rely on someone being physically present in the data center to fix any problems.
Another thing we were uneasy about was that if the server failed, it would take some time for us to notice it, and we would have to move the mirrored virtual machine to take the place of the dedicated server. To be honest, we didn’t manage to have the VM configured exactly the same as the primary server. It was only a replica of the data.
To our surprise, when we looked at the performance of the VM recently – all queries from our software on the VM were faster than on the dedicated hardware. We did not notice that the underlying hardware running the virtual infrastructure had improved considerably over time and became more than sufficient for our requirements. Therefore, we decided to move into the virtual infrastructure, and the setup simplified during the process. We did not need to keep the replication setup because our virtual infrastructure handles it for us.
The results of virtualisation
Preventing data loss is much easier now
Replication of our database was always necessary so that we could recover the data in the event of a server crash. The recovery process would take some time, but we had the certainty that critical bits would not suddenly disappear.
With the Enterprise Advanced Cloud however, we have a system in place that replicates the data written to the disk across two data centers automatically. Even if a complete power outage or fire occurs in one of them, the data is still safe.
Moreover, every VM is mirrored transparently, so there is no need to have two VMs mimicking the setup. The preliminary results of our tests showed that even when the complete storage system in one of the data centers was turned off for maintenance, the database queries had no measurable performance impact when retrieving the data from the ‘other side’.
Time to full recovery became extremely brief
As already described, the fact that we had two servers running simultaneously meant that if one of them had problems, we would have to tediously restore the operation of the other, which meant manually changing IPs and making sure that the configuration still worked. Should an issue occur at night or early in the morning, an immediate fix would be difficult.
Now, on a high-availability cluster, the VM can restart on the operational side with the same network, hardware and software settings as well as the same data of course. It would take longer for us to call our technical department after noticing that something went wrong than for the VM to restart and continue providing query results.
Incremental backup is still required
One thing did remain from the previous setup. We still perform a regular backup of our data using pgbackrest for the very unlikely case that someone runs ‘delete * from users;’ instead of ‘select * from users;’ in production. However, there is very little chance of this occurring due to our high-quality software, which has undergone comprehensive, automated tests.