Progressive rollout

Description

No matter how good the automated test campaign is, some issues will always be spotted in production, as users tend to not use the application the way we intend them to.

It is important to be minimize the impact of this issue for the end-users.

To achieve that, we'll use Argo Rollouts to perform canary testing when deploying a new version of the application.

We will route only a few percent of the production traffic to the new version, and launch an analysis to detect if some errors are occuring on this canary, and not the current production version.

If so, we'll rollback automatically to the production version, otherwise, we can continue to increase progressively the canary traffic percentage until we reach 100%.

The Canary Ingress is deployed and controlled by Argo Rollouts controller.

Implementation

The Rollout resource

For Argo Rollouts to be able to control the deployment, we start by using the Rollout resource.

This resource only references the existing Deployment and adds the progressive rollout strategy and analysis.

If you look at the demo app helm chart, we set the replicas to 0 on the Deployment when the Rollout is used, so that the Rollout can control the execution.:

spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ if .Values.rollout.enabled }}0{{ else }}{{ .Values.replicaCount }}{{ end }}
  {{- end }}

And reference the Deployment in the Rollout resource:

{{- if .Values.rollout.enabled }}
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: {{ include "my-app.fullname" . }}-rollout
  labels:
    {{- include "my-app.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  analysis:
    successfulRunHistoryLimit: {{ .Values.rollout.successfulRunHistoryLimit }}
    unsuccessfulRunHistoryLimit: {{ .Values.rollout.unsuccessfulRunHistoryLimit }}
  selector:
    matchLabels:
      {{- include "my-app.selectorLabels" . | nindent 6 }}
  workloadRef: # Reference to the deployment
    apiVersion: apps/v1
    kind: Deployment
    name: {{ include "my-app.fullname" .  }} 
  revisionHistoryLimit: {{ .Values.rollout.revisionHistoryLimit }}
  strategy: 
    {{ toYaml .Values.rollout.strategy | nindent 4 }}
{{- end }}

The rollout is activated from prod only in the production values:

rollout:
  enabled: true

The traffic split management

We'll use the Nginx ingress-controller deployed previously to split the traffic between production and canaries.

Corresponding documentation here.

Argo Rollouts takes care of creating and managing the canary ingress. We just need to define a Service for the canaries. We can deploy it only if the rollout is enabled:

apiVersion: v1
kind: Service
metadata:
  name: {{ include "my-app.fullname" . }}-canary-svc
  labels:
    {{- include "my-app.labels" . | nindent 4 }}
spec:
  ports:
    - name: http
      port: 80
      targetPort: 8080
    - name: metrics
      port: 9000
      targetPort: 9000
  selector:
    {{- include "my-app.selectorLabels" . | nindent 4 }}
{{ end }}

Argo Rollouts controller will also take care of adding the canary hashset in the selectors, to only route the traffic there.

Source here.

The analysis

To ensure that the new application doesn't introduce new errors, we'll define an AnalysisTemplate comparing the demo app's HTTP service success codes.

This is a very simple use validation, but you could think about extending it to any metric or information available around your application.

apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: http-server-success-rate-comparison
spec:
  args:
  - name: container-name
  - name: stable-hash
  - name: latest-hash
  metrics:
  - name: success-rate
    interval: 30s # Run every 30 seconds
    # NOTE: prometheus queries return results in the form of a vector.
    # So it is common to access the index 0 of the returned array to obtain the value
    successCondition: len(result) == 0 || isNaN(result[0]) || result[0] <= 0
    failureLimit: 0
    provider:
      prometheus:
        address: http://prometheus-operated.infra.svc.cluster.local:9090 # the local prometheus instance
        query: |
          (
            sum(
              irate(
                http_server_requests_seconds_count{container="{{args.container-name}}",pod=~"{{args.container-name}}-rollout-{{args.stable-hash}}.*",outcome=~"SUCCESS"}[30s]
              )
            ) 
            /
            sum(
              irate(
                http_server_requests_seconds_count{container="{{args.container-name}}",pod=~"{{args.container-name}}-rollout-{{args.stable-hash}}.*"}[30s]
              )
            )
          ) -
          (
            sum(
              irate(
                http_server_requests_seconds_count{container="{{args.container-name}}",pod=~"{{args.container-name}}-rollout-{{args.latest-hash}}.*",outcome=~"SUCCESS"}[30s]
              )
            ) 
            /
            sum(
              irate(
                http_server_requests_seconds_count{container="{{args.container-name}}",pod=~"{{args.container-name}}-rollout-{{args.latest-hash}}.*"}[30s]
              )
            )
          )

Source here.

It's taking as argument the production (stable) and the canary replicaset hashes, to be able to run Prometheus queries and compare their results.

Here the operation compute the success rate of each (success transactions rate over total transaction rate) and substract the canary to the production.

If the result if less or equals to zero, it means that the canary didn't raise more errors than the production.

The Rollout canary strategy

In the Rollout strategy, we can now reference all the different pieces together.

The analysis will run here in the background.

We'll start from 20% of the traffic on the canary, and increase progressively. The Rollout may stop if the analysis fails at any moment during the execution.

  strategy:
    canary:
      analysis:
        # Arguments taken by the Analysis
        args:
        - name: container-name
          value: app-prod-my-app
        - name: stable-hash
          valueFrom:
            podTemplateHashValue: Stable
        - name: latest-hash
          valueFrom:
            podTemplateHashValue: Latest
        templates:
        - clusterScope: true
          templateName: http-server-success-rate-comparison # Ref to the AnalysisTemplate
      canaryService: app-prod-my-app-canary-svc # Ref the canary service defined earlier
      scaleDownDelaySeconds: 3
      stableService: app-prod-my-app-svc # Ref to the production service
      steps: # Canary steps
      - setWeight: 20
      - pause:
          duration: 30s
      - setWeight: 40
      - pause:
          duration: 30s
      - setWeight: 60
      - pause:
          duration: 30s
      - setWeight: 80
      - pause:
          duration: 30s
      trafficRouting:
        nginx:
          stableIngress: app-prod-my-app-ingress # Ref to the production ingress

Source here.