KINTO Tech Blog
PlatformEngineering

Implementing BlueGreenDeployment with GitHub Actions + ECS

Cover Image for Implementing BlueGreenDeployment with GitHub Actions + ECS

Introduction

Hello. I’m Shimamura from the Platform Group’s Operation Tool Management Team, where I work in platform engineering, focusing on tool development and operations.
I'm Yamada, also part of the Platform Group’s Operation Tool Management Team, where I focus on developing in-house tools.

At KINTO Technologies, we utilize Amazon ECS + Fargate as our application platform. For CI/CD, we use GitHub Actions. In AWS ECS’s Blue/Green deployment system, the "CODE_DEPLOY" option is primarily used for the DeploymentController, and we believe there are few real-world examples where "EXTERNAL" (third-party control) is implemented. At the CI/CD Conference 2023 hosted by CloudNative Days, we also encountered an example of migrating from ECS to Kubernetes specifically to enable Blue/Green deployments. (Chiisaku Hajimeru Blue/Green Deployment (Blue/Green Deployment That Starts Small).)

However, we wondered if it might be possible to perform Blue/Green deployments in ECS without the limitations of CodeDeploy's conditions. We also considered that offering multiple deployment methods could benefit the departments developing applications. With that in mind, we began preparations to explore these options. Indeed, despite CODE_DEPLOY being the more common setting and limited documentation available on using EXTERNAL for this purpose, we successfully implemented a system that supports it for the application teams.
We'll share this as a real-world example of implementing Blue/Green deployment using external pipeline tools with ECS (Fargate).

Background

Issues

  • Relying solely on ECS rolling updates may not fully meet the requirements for future releases.
  • It’s essential to offer a variety of deployment methods and deploy applications in a way that aligns with their specific characteristics.

Solution method

As a first step, we decided to introduce Blue/Green deployment on ECS. Canary releases may present challenges in the future, but since we successfully implemented Blue/Green deployment in this form, we anticipate being able to adapt it to support configurations like setting the influx rate and other parameters via the CLI.

Design

Checking with CODE_DEPLOY

If you search for “ECS Blue/Green deployment,” you will find a wide variety of things. However, simply leaving it at that isn’t ideal, so we’d like to provide a summary of the key points and overall setup
CODE_DEPLOY diagram

This is the configuration. You configure various settings in CodeDeploy, create a new task associated with the task definition, and adjust the influx rate according to the deployment settings.
You can either switch over all at once, test a portion initially, or gradually increase the deployment—depending on your needs.

Specifications we initially thought might be unattainable

When we reviewed the environment and operation under CodeDeploy, certain aspects raised concerns for us. It could all come down to specific settings, so if you have any insights, please feel free to share.

  • We plan to verify the operation by running a test system for a certain period, allowing for customer review and other checks.
    • The system can be maintained for about a day, but the deployment will fail if the switchover button isn't pressed once that timeframe has elapsed.
  • We’d like the option to terminate the old application at a chosen time after the switchover.
    • In CodeDeploy, a time limit can be configured, but it doesn’t allow for arbitrary timing.
  • Reverting back through the console appears to be a complex process.
    • The process becomes cumbersome because, due to the permissions setup, you need to use SwitchRole to access it from the console.

Overall configuration with EXTERNAL

Diagram

Component (Element)

Name Overview
Terraform A product for coding various services, AWS among them. IaC. In-house design patterns and modules are created with Terraform.
GitHub Actions The CI/CD tool included in GitHub. At KINTO Technologies, we utilize GitHub Actions for tasks such as building and releasing applications We use a pipeline in GitHub Actions to deploy new applications and transition from the old ones.
ECS (Elastic Container Service) We use ECS as the runtime environment for our applications For configuration, you can set the DeploymentController to ECS, CODE_DEPLOY, or EXTERNAL; this example specifically implements it with EXTERNAL.
DeploymentController We view this as a kind of control plane for ECS (or at least, that’s how we see it internally).
TaskSet A collection of tasks linked to the ECS services. You can create one via the CLI, but apparently not via the console. Using this enables you to create multiple task definition versions in parallel for a single service. (CLI reference.) Setting this up requires an ALB, Target Group, and several other components, so there are quite a few configurations involved.
ALB ListenerRule A rule for directing resources to Target Groups within the ALB. In Blue/Green deployment, modifying this link toggles the traffic flow between the old and new applications.

Restrictions

  • The DeploymentController in ECS can only be set during service creation, meaning it cannot be modified for existing services.
  • When using EXTERNAL, the platform version isn’t fixed by the service; it’s specified when creating a TaskSet.
  • The service startup type is fixed to EC2. However, if you specify Fargate when creating a TaskSet, the task will be started up with Fargate.

Implementation

Terraform

At KINTO Technologies, we use Terraform as the IaC. We've also turned it into a module, and here, I'll outline the key points to be mindful of that arose during the module modifications.

ListenerRule

Using GitHub Actions, we modify the ListenerRule to update the TargetGroup, so we configure ignore_change to prevent unnecessary updates.

ECS service

  • NetworkConfiguration
  • LoadBalancer
  • ServiceRegisteries

For EXTERNAL settings, these three options cannot be configured. If you’re using Dynamic or similar settings, ensure that these options are not created. In this case, it won’t be registered in CloudMap, so if you plan to integrate it with AppMesh or similar services, you’ll need to account for this. There’s no issue with using AppMesh for communication between ECS services, even if one of them is configured with a Blue/Green deployment setup.
Since the Blue/Green deployment runs in parallel, if it were registered in CloudMap and allowed communication, it could result in unintended or erroneous access. Therefore, we believe this current setup is likely the correct behavior.

IAM policy for roles for CI/CD

In addition to the ECS system, various other permissions are also required. A sample is as follows.

cicd_policy.tf(sample)
resource "aws_iam_policy" "cicd-bg-policy" {
  name = "cicd-bg_policy"
  path = "/"
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "iam:PassRole"
        ]
        Effect   = "Allow"
        Resource = "arn:aws:iam::{ACCOUNT}:role/{ROLE名}"
      },
      {
        Action = [
          "ecs:DescribeServices"
        ]
        Effect   = "Allow"
        Resource = "arn:aws:ecs:{REGION}:{ACCOUNT}:service/{ECS_CLUSTER_NAME}/{ECS_SERVICE_NAME}"
      },
      {
        Action = [
          "ecs:CreateTaskSet",
          "ecs:DeleteTaskSet"
        ]
        Effect   = "Allow"
        Resource = "*"
        conditions = [
          {
            test : "StringLike"
            variable = "ecs:service"
            values = [
              "arn:aws:ecs:{REGION}:{ACCOUNT}:service/{ECS_CLUSTER_NAME}/{ECS_SERVICE_NAME}"
            ]
          }
        ]
      },
      {
        Action = [
          "ecs:RegisterTaskDefinition",
          "ecs:DescribeTaskDefinition"
        ]
        Effect    = "Allow"
        resources = ["*"]
      },
      {
        Action = [
          "elasticloadbalancing:ModifyRule"
        ]
        Effect   = "Allow"
        Resource = "arn:aws:elasticloadbalancing:{REGION}:{ACCOUNT}:listener-rule/app/{ALB_NAME}/*"
      },
      {
        Action = [
          "elasticloadbalancing:DescribeLoadBalancers",
          "elasticloadbalancing:DescribeListeners",
          "elasticloadbalancing:DescribeRules",
          "elasticloadbalancing:DescribeTargetGroups"
        ]
        Effect   = "Allow"
        Resource = "*"
      },
      {
        Action = [
          "ec2:DescribeSubnets",
          "ec2:DescribeSecurityGroups"
        ]
        Effect    = "Allow"
        resources = ["*"]
      },
    ]
  })
}

Please replace the ECS cluster name, ECS service name, and ALB name with the appropriate values. Ensure these align with the scope of the CI/CD roles and any applicable permissions. Permissions for CreateTaskSet and DeleteTaskSet are not restricted by specific resources. Instead, the service launched is defined by a fixed condition. The DescribeLoadBalancers permissions, along with ec2
and DescribeSecurityGroups permissions, are included in the workflow to determine status information.
elasticloadbalancing:ModifyRule is, needless to say, necessary for rewriting the ListenerRule for release.
The ListenerRule is scoped specifically to the ALB name since the ARNs are assigned random values.

GitHub Actions

At KINTO Technologies, we use GitHub Actions for our CI/CD tool.
Our process involves developing standardized CI/CD workflows within the Platform Group and then supplying them to the app development teams.

Workflow overview

In the workflows for this project, we created a Blue/Green deployment system according to the steps below. In this article, we will only cover the deployment workflow.
Workflow diagram

Key considerations and points of caution

As the provider of these workflows to the app development teams, we paid close attention to the following key points:

  • An implementation that minimizes parameter specification at runtime to reduce the risk of errors or misoperations.
  • Since these workflows require manual execution, all parameters that can be retrieved via the CLI are gathered within the workflows themselves. This approach ensures that incorrect parameters aren’t specified at runtime.
  • Simplified workflow setup
    • Implementation that uses secrets as little as possible
    • The AWS resource names are set through environment variables, with fixed values used for all except system-specific ones. This approach minimizes the need for configuration.
    • Registering all the ARNs for the AWS resources to be used as secrets will render in-workflow processing to obtain the ARNs from the resource names unnecessary, reducing the amount of code. To minimize the initial configuration workload, we implemented a CLI-driven process that retrieves and uses ARNs from resource names, requiring almost no manual configuration.

Workflow implementation

Here, we would like to explain the main processes of each workflow using sample code.
All the workflows are basically
Get the AWS Credentials → Get the required parameters via the CLI → Do validation checks → Run
or something similar.

Creating the task set

The runtime parameters for the workflow are the image tags and environments in the ECR (Elastic Container Registry).
Before creating the task set, perform validation checks to ensure that the target group is suitable for testing and that the image tags for the runtime parameters exist in the ECR. After that, create the task definition from the image tags. Once the task definition has been created, you get the parameters (the subnets, security groups, and task definition) that will be needed when creating the task set, then run the CLI to create it.

jobs:
  ...
  ## Check the target group to be used
  check-available-targetGroup:
    ...
  ## Create the task definition from the ECR images
  deploy-task-definition:
    ...
  ## Create the task set
  create-taskset:
    runs-on: ubuntu-latest
    needs: deploy-task-definition
    steps:
      # Get the AWS Credentials
      - Set AWS Credentials
       ...
      - Get the target group
        ...
      # Create the task set
      - name: Create TaskSet
        run: |
          # Get the task definition ARN
          taskDefinition=`aws ecs describe-task-definition\
            --task-definition ${{ env.TASK_DEFINITION }}\
            | jq -r '.taskDefinition.taskDefinitionArn'`
          echo $taskDefinition
          # Get the subnets
          subnetList=(`aws ec2 describe-subnets | jq -r '.Subnets[] | select(.Tags[]?.Value | startswith("${{ env.SUBNET_PREFIX }}")) | .SubnetId'`)
          if [ "$subnetList" == "" ]; then
            echo !! Unable to get the subnets, so processing will be aborted.
            exit 1
          fi
          # Get the security groups
          securityGroupArn1=`aws ec2 describe-security-groups | jq -r '.SecurityGroups[] | select(.Tags[]?.Value == "${{ env.SECURITY_GROUP_1 }}") | .GroupId'`
          if [ "$securityGroupArn1" == "" ]; then
            echo !! Unable to get the security groups, so processing will be stopped.
            exit 1
          fi
          securityGroupArn2=`aws ec2 describe-security-groups | jq -r '.SecurityGroups[] | select(.Tags[]?.Value == "${{ env.SECURITY_GROUP_2 }}") | .GroupId'`
          if [ "$securityGroupArn2" == "" ]; then
            echo !! Unable to get the security groups, so processing will be stopped.
            exit 1
          fi
          echo ---------------------------------------------
          echo Creating the task set
          aws ecs create-task-set\
            --cluster ${{ env.CLUSTER_NAME }}\
            --service ${{ env.SERVICE_NAME }}\
            --task-definition ${taskDefinition}\
            --launch-type FARGATE\
            --network-configuration "awsvpcConfiguration={subnets=["${subnetList[0]}","${subnetList[1]}"],securityGroups=["${securityGroupArn1}","${securityGroupArn2}"]}"\
            --scale value=100,unit=PERCENT\
            --load-balancers targetGroupArn="${createTaskTarget}",containerName=application,containerPort=${ env.PORT }

Switching listener rules

The workflow for switching listener rules begins by retrieving and verifying the number of task sets currently running.
If only the production environment’s task set is running (with a single task set), and you switch between the listener rules for the production and test environments, the task set associated with the production environment will be removed. To prevent this issue, our implementation checks the number of running task sets. If there is only one or fewer, the process halts without switching listener rules.
After that, it switches between the production and test listener rules. Since there is no CLI command for switching between two listener rules, we are calling it switching, but precisely speaking, you run a CLI command that changes the listener rule (modify-rule). Since each listener rule change is processed in parallel, we use a sleep command to adjust processing times. This ensures that both listener rules don’t end up linked to the test environment due to minor timing differences.

env:
  RULE_PATTERN: host-header ## http-header / host-header / path-pattern / source-IP, etc.
  PROD_PARAM: domain.com
  TEST_PARAM: test.domain.com
  ...
jobs:
  ## If there one task set or less running, make it so that the host header cannot be changed
  check-taskSet-counts:
    runs-on: ubuntu-latest
    steps:
      ## Get the AWS Credentials
      - name: Set AWS Credentials
        ...
      # Validation
      - name: Check TaskSet Counts
        run: |
          taskSetCounts=(`aws ecs describe-services --cluster ${{ env.CLUSTER_NAME }}\
            --service ${{ env.SERVICE_NAME }}\
            --region ${{ env.AWS_REGION }}\
            | jq -r '.services[].taskSets | length'`)
          if [ "$taskSetCounts" == "" ]; then
            echo !! Unable to get the number of running task sets, so processing will be aborted.
            exit 1
          fi
          echo Number of running task sets: $taskSetCounts
          if [ $taskSetCounts -le 1 ]; then
            echo !! The number of running task sets is 1 or less, so processing will be aborted.
            exit 1
          fi
  ## Switch between ALB listener rules (production, test)
  change-listener-rule-1:
    runs-on: ubuntu-latest
    needs: check-taskSet-counts
    steps:
      ## Get the AWS Credentials
      - name: Set AWS Credentials
        ...
      - name: Change Listener Rules
        run: |
          # Get the ALB ARN from the ALB name
          albArn=`aws elbv2 describe-load-balancers --names ${{ env.ALB_NAME }} | jq -r .LoadBalancers[].LoadBalancerArn`
          # Get the listener ARN from the ALB ARN
          listenerArn=`aws elbv2 describe-listeners --load-balancer-arn ${albArn} | jq -r .Listeners[].ListenerArn`
          # Get the listener rule ARN from the listener ARN
          listenerRuleArnList=(`aws elbv2 describe-rules --listener-arn ${listenerArn} | jq -r '.Rules[] | select(.Priority != "default") | .RuleArn'`)
          pattern=`aws elbv2 describe-rules --listener-arn ${listenerArn}\
            | jq -r --arg listener_rule ${listenerRuleArnList[0]} '.Rules[] | select(.RuleArn  == $listener_rule) | .Conditions[].Values[]'`
          if [ "$pattern" == "" ]; then
            echo !! Unable to get the listener rule, so processing will be stopped.
            exit 1
          fi
          echo ---------------------------------------------
          echo Current rule pattern: $pattern
          echo ---------------------------------------------
          if [ $pattern == "${{ env.TEST_PARAM }}" ]; then
            aws elbv2 modify-rule --rule-arn ${listenerRuleArnList[0]} --conditions Field="${{ env.RULE_PATTERN }}",Values="${{ env.PROD_PARAM }}"
          else
            sleep 5s
            aws elbv2 modify-rule --rule-arn ${listenerRuleArnList[0]} --conditions Field="${{ env.RULE_PATTERN }}",Values="${{ env.TEST_PARAM }}"
          fi
          echo ---------------------------------------------
          echo Rule pattern after change
          aws elbv2 describe-rules --listener-arn ${listenerArn}\
            | jq -r --arg listener_rule ${listenerRuleArnList[0]} '.Rules[] | select(.RuleArn  == $listener_rule) | .Conditions[].Values[]'
  ## Switch between ALB listener rules (production, test)
  change-listener-rule-2:
    ...
    The processing is the same as for change-listener-rule-1, and only the specification of listenerRuleArnList elements differs
    ...

Deleting the task set

In the task set deletion workflow, the only runtime parameters are the environments.
If you specify the task set ID to be deleted as a parameter, the workflow only requires a single CLI command to delete that task set ID. This simplifies the process to a single line, aside from obtaining AWS credentials and other setup steps. However, if you accidentally specify a task set ID that is currently in production, there is a risk that the production task set could be deleted, leaving only the test environment active.
Therefore, we implemented a solution where the runtime parameters are limited to the environments only. The workflow retrieves and deletes the task set for the test environment directly within the workflow implementation.

env:
  TEST_PARAM: test.domain.com # Host header for testing
  ...
jobs:
  ## Delete the task set
  delete-taskset:
    runs-on: ubuntu-latest
    steps:
      ## Get the AWS Credentials
      - name: Set AWS Credentials
        ...
      # Get the target group linked to the test host header
      - name: Get TargetGroup
        run: |
          # Get the ALB ARN from the ALB name
          albArn=`aws elbv2 describe-load-balancers --names ${{ env.ALB_NAME }} | jq -r .LoadBalancers[].LoadBalancerArn`
          # Get the listener ARN from the ALB ARN
          listenerArn=`aws elbv2 describe-listeners --load-balancer-arn ${albArn} | jq -r .Listeners[].ListenerArn`
          # Get the target group linked to the test rules from the listener’s ARN and the test host header
          testTargetGroup=`aws elbv2 describe-rules --listener-arn ${listenerArn}\
            | jq -r '.Rules[] | select(.Conditions[].Values[] == "${{ env.TEST_PARAM }}") | .Actions[].TargetGroupArn'`
          echo "testTargetGroup=${testTargetGroup}" >> $GITHUB_ENV
      # Get the task set ID linked to the test host header’s target group by the listener rules
      - name: Get TaskSetId
        run: |
          taskId=`aws ecs describe-services\
            --cluster ${{ env.CLUSTER_NAME }}\
            --service ${{ env.SERVICE_NAME }}\
            --region ${{ env.AWS_REGION }}\
            | jq -r '.services[].taskSets[] | select(.loadBalancers[].targetGroupArn == "${{ env.testTargetGroup }}") | .id'`
          if [ "$taskId" == "" ]; then
            echo !! Unable to find the tasked set linked to the test host header’s target group, so processing will be aborted.
            exit 1
          fi
          echo The task set ID to be deleted
          echo $taskId
          echo "taskId=${taskId}" >> $GITHUB_ENV
      # Delete the task set from the task set ID obtained
      - name: Delete TaskSet
        run: |
          aws ecs delete-task-set --cluster ${{ env.CLUSTER_NAME }} --service ${{ env.SERVICE_NAME }} --task-set ${{ env.taskId }}

Next steps

We plan to refine the ALB ListenerRule component and explore enabling a canary release, but first, we need user feedback. For now, we are rolling it out to the application side to gather insights and improvements.
In our GitHub Actions workflows, we minimized the use of secrets as much as possible. However, they still require setting numerous environment variables, and we aim to reduce this dependency in the future.
For instance, we could potentially configure it so that only system-specific values are set via environment variables, minimizing the need for additional variable settings. We are also looking into whether we can switch between listener rules safely and instantaneously.

Impressions

As mentioned earlier, there are likely very few real-world examples of Blue/Green deployment with ECS + EXTERNAL (using GitHub Actions). We’ve reached this point by building a system from scratch, with no existing documentation to guide us. In hindsight, while implementing GitHub Actions workflows wasn’t inherently difficult, we were able to come up with several effective ideas to create workflows that are both straightforward (with minimal setup) and safe to use. Looking ahead, we aim to enhance this system by having people use it and then refining it based on their feedback

Summary

The Operation Tool Manager Team oversees and develops tools used internally throughout the organization. We leverage tools and solutions created by other teams within the Platform Group. Based on the company's requirements, we either develop new tools from scratch or migrate existing components as needed. If you’re interested in these activities or would like to learn more, please don’t hesitate to reach out to us.

Facebook

関連記事 | Related Posts

We are hiring!

【プラットフォームエンジニア】プラットフォームG/東京・大阪

プラットフォームグループについてAWS を中心とするインフラ上で稼働するアプリケーション運用改善のサポートを担当しています。

【クラウドエンジニア】Cloud Infrastructure G/東京・大阪

KINTO Tech BlogWantedlyストーリーCloud InfrastructureグループについてAWSを主としたクラウドインフラの設計、構築、運用を主に担当しています。