We value your privacy. We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies. Read our Privacy Policy for more information.

Back to Updates

Methodology

Updates

Infrastructure transition: from DIY Kubernetes to managed solutions in GCP

Monday, April 29, 2024

Sebastiaan Viaene

Engineering manager

A customer of ours recently asked us to take a look into their cloud infrastructure. The project they had set up and maintained over a few years had a microservice architecture using Kubernetes hosted on DigitalOcean. We were tasked with analyzing the possible dangers and making the system more robust and scalable in a short time frame.

Initial infrastructure and risk assessment

‍

We set out to create a risk assessment, defining priorities and a roadmap to rebuild their infrastructure. The first thing we noticed was that everything was self-managed in the Kubernetes cluster:

The SQL database along with backup management was all done in the cluster itself
Load balancing and SSL certificate management were also part of the cluster
To provide messaging between microservices a queueing system was being used and the messages were being saved in a Redis database that was also part of the cluster

All of this amounted to a huge amount of configuration and manual tasks to keep the ship afloat.

Phase 1: migrating to Google Cloud Platform (GCP)

Given the fact that the application consisted of Kubernetes, for the most part, we decided it would be best to run it in the most mature Kubernetes environment on the market. Google Kubernetes Engine or GKE for short is Google’s scalable and managed solution. Their autopilot version automatically scales the infrastructure with minimal configuration. They also offer usage-based pricing. At other providers you usually pay a 24/7 server cost and still have to worry about running into hard resource limitations. Fun fact: Kubernetes was actually created by Google in 2014.

Along with migrating the cluster to a new environment we also extracted some of the self-managed components in one go:

The PostgreSQL database was extracted to Cloud SQL
Load balancing and certificate management are now managed by Cloud Load Balancing
The Redis database used as a queue backend was extracted to GCP Memorystore

All these components provide automatic updates, back-ups and deletion protection out of the box which helps the developers focus on their core tasks instead of having to maintain it all.

Phase 2: embracing managed solutions

The cluster, segmented into microservices, each responsible for specific tasks, was categorized into three types:

Continuously operational services
Event-driven services
Request-based services

The continuous services were listening to connections and needed to stay up for real-time result processing. The event-driven services however only needed to respond to triggered events. Finally, there was a NodeJS REST API which responded to requests from the front-end application. The way the cluster was set up all of these services were up 24/7 and were generating costs. It also amounted to a lot of configuration files and wasn’t very scalable unless a lot more configuration was added on top of it.

Piece by piece we started extracting components from the cluster and using managed services for it within the GCP suite:

The REST API was extracted to Cloud Run.
Event-driven services were extracted to a combination of Pub/Sub and Cloud Functions.

This significantly lowered the operational cost of the solution and made it more robust and scalable at the same time. The API was only generating costs when requests were made and the event-driven services only when an event was actually triggered. Both Cloud Run and Cloud Functions horizontally scale based on the amount of events coming in.

We left the continuously operational services in the cluster as we were quite happy with the way they communicated with each other there and there was no added advantage to extracting them.

Phase 3: enhancing monitoring and alerting

At the core of the application, there was a large stream of data coming in from various sources. This data was then processed by various components within the infrastructure. At any point, there could be a drop-off because something failed for example if the data was corrupt or a service was down. There could be a lot of different reasons.

When we started this was all a black box. If anything would go wrong we’d spend a bunch of time trying to track the data and where it actually failed. The data also came from external sources so it was even possible the fault was not in our system.

By leveraging logs in the right place in the code and creating Log-based Metrics in GCP we were able to map everything out, create dashboards and alerts that were triggered if anything went wrong:

We knew instantly when something went wrong because we were alerted using multiple channels automatically by GCP
We didn’t have to find the culprit because it was clear from the logs and the metrics
All incidents were automatically kept and we could start recognizing patterns and create fixes accordingly

Another request we received from the customer was to add audit logs. The way we handled this is a wonderful demonstration of how the GCP ecosystem makes it very straightforward to achieve something without having to create custom services or complex configurations.

We created a specific log type and a structured log that would track every request on the REST API (create, read, update, delete). This was the only code change we had to do
We could then create a Log Sink in GCP that would query based on the specific type of log
The sink would reroute the logs to a Cloud Logging bucket
We added a retention policy of 1 year on the bucket which made it immutable. Even the retention policy itself was locked and could not be removed

‍

Conclusion

In summary, transitioning from a self-managed Kubernetes setup on DigitalOcean to a fully managed solution on Google Cloud Platform (GCP) has streamlined operations and enhanced scalability for the customer's infrastructure. By migrating key components to GCP's managed services and utilizing its monitoring and logging features, the system has become more efficient, reliable, and cost-effective. This shift highlights the value of embracing managed solutions and leveraging cloud platforms for infrastructure optimization.