The value of platform engineering for the public sector: HMRC case study

January 7, 2025

image: © Just_Super | iStock

Discover how the migration to ECK-managed Elasticsearch offers better resource efficiency, improved observability, and seamless scaling for HMRC’s evolving platform strategy

A platform is most vulnerable during a change or release. Making these as robotic as possible on a platform with the solid foundations required to build at scale is important. Running the platform day to day should be uneventful. Observability should cover all areas of the platform so that you know when to make a change. Being predictable in this way is required, and it means that when a change is required, you are doing it in the safest possible way.

Cloud offerings have matured, particularly over the last couple of years. Cloud providers have the capacity to allow massive enterprises to scale and run their entire business from the Cloud. Regulators are now requiring some sectors to have multi-Cloud infrastructure, which further complicates an already congested and complex marketplace.

At times, it can be hard to describe the work we do. It’s not always visible, but it’s undeniably vital. Production downtime is expensive and is often a prime metric, but what about development and test environments? What is the true cost of lost developer productivity due to a platform outage? What metrics are you using, and what targets do you have?

Platform engineering and DevOps

We are a platform engineering company, and we’re often asked how what we do differs from DevOps – a term now commonplace in the industry. The short answer is that – as platform engineers – we focus on building platform services and capabilities that enable developers to build and deploy applications faster, safer, more consistently and more reliably than if they’d started from scratch. DevOps is a set of techniques, processes and practices that enables and encourages collaboration, automation, and continuous improvement across the entire software delivery lifecycle. For us, DevOps enables platform engineering and vice-versa.

Often, in platform engineering, we face the choice between automation or abstraction. When is it appropriate to choose one over the other? As a consultancy, it is our job to create value and transfer ownership of that infrastructure, code or process. If we overcomplicate during planning and implementation, we will fail at knowledge transfer. If we don’t automate enough of the process or don’t fit enough rails for teams to follow after we leave, will the platform suffer a lack of maintenance?

Bringing code quality and security scanning tools together in your DevSecOps pipelines is common. However, it can be challenging to ensure scans are carried out before code makes it to production while also lightening the load on the security team members responsible for the tests. As the amount of code grows and rulesets and waivers become more commonplace, it can be difficult to keep track of everything. Automation or abstraction, in this instance, can help users have an audit trail and alerts and, perhaps more importantly, reasons for changes.

Case Study: HMRC Elasticsearch Migration

Overview

HMRC required the migration of Elasticsearch from their Pega platform to a standalone Elasticsearch deployment, ensuring minimal disruption to services while maintaining search functionality and data integrity. The project required a structured approach to identify dependencies, migrate workloads, and implement an alternative solution, all while ensuring compliance with HMRC’s strict operational requirements.

Problem

The existing platform relied on Elasticsearch for indexing and search capabilities, deeply embedded within various services and data pipelines. Removing Elasticsearch posed significant challenges, including minimizing downtime, maintaining search performance, and ensuring data consistency across dependent systems. The transition needed to be executed seamlessly, without impacting business-critical operations.

Solution

We executed a phased migration to move Elasticsearch from the mega platform to a standalone deployment using Elastic Cloud on Kubernetes (ECK). Our approach included:

Provisioning multiple ECK-Managed Elasticsearch Clusters – Deploying Elasticsearch on Kubernetes for improved automation, scalability, and operational efficiency.
Data Migration & Dependency Mapping – Identifying and migrating all services relying on Elasticsearch to the new deployment.
Performance & Compliance Validation – Testing security, scalability, and performance before decommissioning Elasticsearch from the mega platform.

Outcome

Migrating to ECK-managed Elasticsearch provided HMRC with a scalable, resilient, and easier-to-manage search solution. The transition was completed without downtime, ensuring continued service availability. By leveraging Kubernetes-native automation, we simplified operational management, reducing the overhead of manual maintenance. The new deployment offers better resource efficiency, improved observability, and seamless scaling, aligning with HMRC’s evolving platform strategy.

Maintenance and staying evergreen

Maintenance is important, but it is also different in the Cloud compared to traditional on-premises environments. It’s never been easier to rebuild from a known good position rather than constantly patching VMs. Kubernetes is becoming such an important technology that it provides rolling updates, meaning maintenance can happen without taking your server offline for a kernel patch.

When working with the Cloud, there are so many moving pieces compared to a traditional SysOps approach. You’re moving backwards if you’re not performing maintenance or scheduling dependency updates at least a couple of times a year. Lack of maintenance can also see you falling out of support windows from vendors. Terraform and other infrastructure as code (IaC) tools are evolving themselves while their providers update frequently to adopt new features of the Cloud providers or simply to squash bugs. If you don’t keep up with these changes and deprecations, you will quickly be in a space where a large amount of work is required to keep your code up to date and have reliable deployments.

The value of automation is that maintenance should be simple and alerts can be created to see when and where attention needs to be turned to what could be classed as technical debt.

We are a Kubernetes certified service partner

We have worked with Kubernetes for years, and it’s been fascinating watching containers move from docker and docker-compose files to the complex operating system that Kubernetes is now. We love this technology, and we were so pleased this year to be awarded Kubernetes certified service partner status. Indeed, we are one of only two official KCSPs on G-Cloud 14.

There are now flavours of Kubernetes from Azure, AWS, GCP, Oracle, Red Hat, SUSE, Taikun, and Canonical to smaller instances that run at the edge. Selecting the right option can be daunting, especially when balancing vendor support, runtime costs, maintenance overhead, and training requirements.

Multi-Cloud is becoming a reality for many clients, too. Whether it’s a regulatory requirement or just a desire to increase resilience and redundancy, the reality is that Kubernetes orchestration is now in a place where clusters can be managed across Cloud providers. Some of the solutions already mentioned can provide a level of visibility and orchestration, but the design and operation of such systems require experience.

External influences are rapidly shaping the market. Often, we get caught up in the ‘how’ of what we do and forget to reconsider the ‘why’.

So what is value to you? Where are you focussing your attention? Stability? Cost? Resiliency? Faster feature delivery? All of these things are important, but the value delivered from each is subjective. Perhaps we can enable your teams to unlock the same potential we’ve helped our clients achieve.

Please Note: This is a Commercial Profile