In general we divide our story into multiple parts: the dark, the bright and the shiny.
The dark story is about PHP, MySQL with full text search and everything in one
big pile of code. It's monolithic, cumbersome, and slow to develop. It worked
well for quite some time, but we knew that at some point we would not be able
to keep this going. All software ran on a bunch of machines that were
provisioned and maintained by hand.
So let's fast forward to the bright story.
We migrated most of the components into a service-oriented architecture. We
develop most of the things you see on rebuy.de.
Our new services usually have two communication methods. Each service offers
a RESTish API for synchronous calls. Additionally, they interconnect via
RabbitMQ for an asynchronous information flow.
Our services have evolved quite differently. There are very new and shiny ones,
based on Spring 4, Spring Boot and PostgreSQL. They are completely isolated
from each other. Also, there are some older ones, utilizing Spring 3, which share
the same database with other services.
We managed to put most of server provisioning into Puppet code. This also
helped us to create sandbox environments on a VMware cluster and on the
developer machines with Vagrant. Though, adding new services or machines still
required a lot of manual work.
At last, we have a few very old PHP based services. Those are some of the oldest
services that are still in use.
Our big plan™ is to finally migrate the last services away from the dark times
and only use isolated services.
The next big step was to migrate away from a static number of manually provisioned
machines and into the cloud. This has the advantage that we can add capacity on demand
(eg for TV spots) and we are able to automate everything.
We created a Kubernetes cluster on AWS and are using Terraform for provisioning
it. This means our whole infrastructure is checked into source control which
makes all changes comprehensible and reproducible.
Using Kubernetes improved our failure handling in a big way. All of our critical
services are highly available, since the count of replicas of a single service is
only a matter of changing a number in the configuration. Failures of single
replicas get handled by Kubernetes, which restarts the service. This resolves
the issue in the most cases and the developers can take a look into the problem
the next day. Even failures of whole AWS instances get handled well, because
Kubernetes just migrates the replicas to working instances and AWS recreates
the nodes, which automatically join the Kubernetes cluster.
The new infrastructure also makes it trivial for developers to create new
services. With that comes the advantage of being able to split big and complex
services into smaller ones. Additionally, it got a lot easier to create a staging
environment, that is almost identical to the production cluster and deploy the
applications there to test them. Automating the cluster creation for our
development teams will be the next step in our story.