An old wooden trading ship navigating the wide ocean

GitOps is getting adopted more and more. However, there still seems to be some confusion as to what GitOps is, how it differs from regular CI/CD pipelines, and how to best adopt it. In this post we will quickly cover what GitOps is, and the three main lessons learned from using GitOps to manage infrastructure at scale both on premise and in the cloud.

GitOps Overview

GitOps is a set of principles enabling the operation of a system via version controlled, declarative configuration. More specifically, the OpenGitOps project defines four principles which define whether a system or set of systems is managed via GitOps:

Declarative: A system managed by GitOps must have its desired state expressed declaratively.
Versioned and Immutable: Desired state is stored in a way that enforces immutability, versioning and retains a complete version history.
Pulled Automatically: Software agents automatically pull the desired state declarations from the source.
Continuously Reconciled: Software agents continuously observe actual system state and attempt to apply the desired state.

Note that git is not referenced anywhere, as GitOps is not bound to any tooling. However, in layman terms, many consider a system operated via git to be a GitOps system. This is not quite correct.

GitOps is More than CI/CD Pipelines

Taking the "layman's definition" from above, any system that has CI/CD via pipelines triggered on repository changes would be a GitOps system. This is not accurate. Consider an IaC pipeline which applies declaratively defined infrastructure (such as a standard opentofu apply in a pipeline, or a Docker build followed by a kubectl apply). While such a system adheres to the first two principles, it does not adhere to the latter two. This implies that changes made to the target system are not corrected (reconciled) until the pipeline runs the next time. Similarly, if the pipeline fails for whatever reason, the desired state does not change the pipeline: a configuration drift is not detected, even if not reconciled.

This is an important distinction when considering "standard CI/CD" and GitOps. Simply having something declared as code does not make it GitOps.

The Advantages of GitOps

GitOps has many advantages over standard ways of managing systems. The advantages of having a declarative desired state, version controlling it, and interacting with the system only via git (or whatever version control system you use) are tremendous. From improved security and higher efficiency to better change visibility. These are well known to most people and will thus not be covered here.

Drift detection and automatic reconciliation are the two other aspects that make GitOps absolutely amazing. This is especially true in the current day and age, with the proliferation of complex systems being worked on by many people concurrently. Being able to observe that the system is not in the desired state has massive advantages, such as for standard SRE operations. Continuous reconciliation ensures that manual operational tasks are kept to a minimum, and that systems cannot degrade over time as small undesired changes creep in.

Tooling

In this post we will mostly focus on using GitOps to manage resources handled via the Kubernetes API, but it should be noted that GitOps as a concept is in no way restricted to Kubernetes. In the Kubernetes space there are two major players for GitOps: ArgoCD and FluxCD. We will not go into the details as to what the advantages for each tool are, other than saying that according to our own experience, ArgoCD might be more developer focused, while FluxCD might suit platform engineers with more Kubernetes experience that want more flexibility.

The rest of this post is tool agnostic and everything we are talking about can be done with either tool (but some aspects might be easier to do with one or the other).

Infrastructure: Disambiguation

Before we dive into how to structure your GitOps configuration, it might make sense to draw a line as to where infrastructure starts and where it ends. We consider infrastructure everything that is part of the platform provided to an application team. Hence this line might vary depending on the maturity of the platform you provide your teams. If we consider a simple Kubernetes platform with little additional abstraction for its users, the infrastructure would contain the Kubernetes platform itself as well as all system components that are shared between the teams, such as a central monitoring stack, a central credential management solution, centralized policy enforcement of specific Kubernetes resources, and the like.

The lower end of the spectrum will likely not be managed by GitOps. That is simply because the GitOps tooling itself typically needs to run somewhere, and also needs to be bootstrapped somehow. Some tools such as FluxCD allow the GitOps controller to manage itself, but even in these cases the runtime for the controller needs to exist when the controller is initially installed, and is thus typically not part of the GitOps configuration.

Now that this is cleared up, let us consider how the configuration should be managed.

App-of-Apps

A very popular pattern for managing configuration via GitOps is the "app-of-apps" pattern. This was popularized by ArgoCD, but is also applicable to other tooling. We will use ArgoCD in the example below, but the same can be implemented using FluxCD Kustomizations.

Let us consider a component from our infrastructure that we want to manage via GitOps. Typically, we would need to tell the GitOps controller how to manage this component. For instance, let us assume the component is installed via raw Kubernetes manifests. Then we would tell the GitOps controller which repository contains these manifests and in which namespace to install them. Depending on the controller you are using, you might also configure additional parameters such as how often it should be reconciled, whether it depends on other components, and so on. In ArgoCD jargon this would be an "Application" (the root of "app-of-apps" naming), and would look as follows:

1 apiVersion: argoproj.io/v1alpha1
2 kind: Application
3 metadata:
4   name: sealed-secrets
5   namespace: argocd
6 spec:
7   project: default
8   source:
9     chart: sealed-secrets
10     repoURL: https://bitnami-labs.github.io/sealed-secrets
11     targetRevision: 1.16.1
12     helm:
13       releaseName: sealed-secrets
14   destination:
15     server: "https://kubernetes.default.svc"
16     namespace: kubeseal

You would then apply this Application resource to Kubernetes. Your component would then be managed by GitOps, as any changes you push to the manifests repository would be reflected on the Kubernetes cluster.

Then a second infrastructure component needs to be installed, and you repeat the process. The result would be a second Application which installs and manages a component. You might also want to version your deployment (such as using version 1.16.1 of the Helm chart). This implies that lifecycles require a change to this Application manifest, and thus a call against the Kubernetes API to edit it.

The end result is a set of Application resources, some of which you periodically modify when lifecycling a component. Now imagine you need to deploy your infrastructure elsewhere (for instance a second Kubernetes cluster in our example), or maybe even a couple dozen times. Then you need to manage this entire set of Application resources on every platform. A better approach is to add an abstraction layer, which itself deploys the Application resources via GitOps. Hence you put all your Application resources into a repository, and define another, "higher level" Application which deploys this repository. This means that when deploying to new platforms, you only need to deploy that one "higher level" Application, and any changes to the component Application resources can be made via Git, conforming to our GitOps approach. This "higher level" Application is only there to deploy the component Applications thus the name "app-of-apps". Visually, you thus have the following structure:

Visual representation of app-of-apps pattern

It should be noted that this also massively helps when customizing platforms. Typically, components cannot be deployed truly one-to-one in several places, but require slight configuration differences. Consider for instance hostnames for UIs of your components. Two of these components deployed in different locations cannot share the same hostname and routing. Using an "app-of-apps" approach allows you to define variables on the top level application, and inject these into the downstream applications such that they can slightly adapt the way they are installed. We will not dive deeper into how this is done as it is highly dependent on the tooling you use (ArgoCD uses ApplicationSet, FluxCD uses variable substitution), but know this is enabled by such an approach.

Consolidating your Configuration

In the organisation I first used GitOps at scale, we deployed all our components as Helm charts to a Kubernetes cluster. Each component was essentially contained within two different repositories in our version control system:

the source code repository which typically built a Docker image as an artefact
the Helm chart definition which referenced the Docker image from above

When we then introduced GitOps, we decided to add a third repository containing the exact deployment definition (in our case the Application declarations) for the component. Using the app-of-apps pattern from above, we could then reference each of these "GitOps repositories" and deploy specific overlays (customizations) of the Application to specific platforms. This worked well for quite some time. However, with time the number of components we managed increased, and so did the number of target platforms to which these components needed to be deployed. This lead to quite a few issues.

When a new target platform was introduced, all such "GitOps repositories" needed to be updated to contain a new overlay customizing the Application to the specific platform. This is very tedious when you have several dozen such repositories.

Moreover, components had dependencies to other components. This meant that we were referencing components within a repository that were defined in another repository. While not problematic in itself, this can become very tricky when one component has a dependency on a configuration value of another component. The configuration value is then duplicated in both repositories and becomes difficult to maintain. While this sounds like we did not properly separate the components, it is very common to see such cases in infrastructure configurations. Consider for instance a deployment of an ingress controller which defines a hostname suffix for its routes. All components deployed on the same Kubernetes platform that deploy a route/ingress will need to use exactly that hostname suffix in order to have valid routing.

The above issue also results in tricky situations when configurations need to be changed for components that are dependent on one another. If the deployment configuration is separated into different repositories, PRs to these repositories need to be synchronized to ensure the deployment occurs at the same time.

Finally, distributing the deployment configuration over so many repositories meant that it became increasingly difficult to have an overview of what is deployed on a target platform. One would need to navigate through dozens of repositories to check this is correctly done.

After identifying these issues we decided to move all our configuration into a single repository. This repository would then contain a templated definition of the entire set of components which would need to be deployed. A set of platform definitions within the same repository would then feed values to templates to ensure consistent configuration. This massively helped us with to address the issues mentioned above. On top of that, it allows to version the "template" and thus enables rollouts of a versioned infrastructure layer. You can find an example repository of such a structure designed with FluxCD here: FluxCD Monorepo Demo.

Gitops Bridge

The last challenge we want to address in this blog post is a concept called a "GitOps bridge". In public cloud environments, there is typically a relatively strong cut between infrastructure deployed via Terraform (or any similar tool), and the infrastructure deployed via GitOps. For instance, one might deploy an Azure Kubernetes Service and some surrounding services (such as the required network, a container registry, etc) via Terraform, and them deploy components and applications within the AKS using GitOps. The issue that we face here is that the GitOps configuration very often depends on the Terraform configuration. Consider for instance the container registry. Its address is set up by Terraform, but is used in every image declaration in the GitOps configuration. One option is to duplicate such values in the respective configurations, while another option is to use a GitOps bridge.

The GitOps bridge is an abstract concept on how to pass configuration values from tooling such as Terraform as inputs to the GitOps configuration. How this is done in practice very much depends on which tools you use. For instance, if looking at Terraform and FluxCD, a common way to achieve this is to have Terraform write a ConfigMap onto the AKS where the FluxCD controller will run containing all variables (and their values) that will be required by the GitOps configuration. The FluxCD controller then supports injecting variables from a ConfigMap via variable substitution.

Using a GitOps bridge has the advantage that changes in the Terraform configurations are much less likely to break the GitOps configuration that builds on top of it. Moreover, it allows Terraform to directly bootstrap the entire GitOps setup when creating new platforms without the need to manually redefine the required variables in the GitOps repository.

Summary

So, to recap, we have looked at what GitOps really is (and isn't). Understanding these basics is critical to correctly implement GitOps in your projects. On top of that, we looked at three best practices:

Use an app-of-apps pattern to improve resiliency for when you need to recreate platforms.
Consider using a mono-repository for all your GitOps configuration as your setup grows.
Have a look at GitOps bridges to improve the automation when setting up platforms and ensuring your Terraform and GitOps configurations are consistent.

I hope this has helped you understand a bit better how to use GitOps at scale. If you have any questions or comments, feel free to let me know below.

1	apiVersion: argoproj.io/v1alpha1
2	kind: Application
3	metadata:
4	name: sealed-secrets
5	namespace: argocd
6	spec:
7	project: default
8	source:
9	chart: sealed-secrets
10	repoURL: https://bitnami-labs.github.io/sealed-secrets
11	targetRevision: 1.16.1
12	helm:
13	releaseName: sealed-secrets
14	destination:
15	server: "https://kubernetes.default.svc"
16	namespace: kubeseal

A Comprehensive Guide to Managing Large Scale Infrastructure with GitOps