オンプレで実行されているバッチ (cron + shell script) をAWS Lambdaに移植する方法

こんにちは、エンジニアリンググループの大和です。
弊社ではエンジニアリンググループ全体で継続して脱オンプレを進めており、これまでに多くのDBやサーバを停止してきました。

www.m3tech.blog

また、直近ではコンシューマチームでの取り組みが紹介されています。

www.m3tech.blog

私が所属するマルチデバイスチームでも、前年度までに管理しているサービスのオンプレDBおよびサーバを廃止しました。

コンシューマチームの記事でも説明されていますが、脱オンプレではサーバアプリケーションとDBだけを移行するだけでは足りず既存のバッチや監視周り等含めて移行する必要があります。この記事では、そのうちcronでshell scriptを実行しているような簡単なバッチをAWS上に移行する方法を紹介します。

構築するシステムの構成
インフラ構築
まとめ

構築するシステムの構成

移行元のシステムは、オンプレサーバ上のcronが特定のshell scriptを日次で実行する構成を想定しています。

移行後のシステムは次の通りです。*1
cronをCloudWatch Events (Amazon EventBridge) に移行し、shell scriptはAWS Lambda上で実行する構成です。 DynamoDBを使用している理由については後述します。

f:id:daiwa_home:20220331153834p:plain — システム構成図

インフラ構築

今回はTerraformを使用してインフラを構築していきます。なお、動作確認は1.0.5で行っています。

VPCに紐付けるLambdaの設定

modules/example/variables.tf:

# 後述するLambda layerのARN
variable "lambda_layer_arns" {
  type = list(string)
}

# 通信を許可するSecurity GroupのID
variable "security_group_ids" {
  type = list(string)
}

# 配置するVPCのSubnetのID
variable "subnet_ids" {
  type = list(string)
}

# 配置するVPCのID
variable "vpc_id" {}

modules/example/main.tf:

data "archive_file" "example" {
  type        = "zip"
  output_path = "${path.module}/src/example/function.zip"

  source {
    content  = file("${path.module}/src/example/function.sh")
    filename = "function.sh"
  }

  source {
    content  = file("${path.module}/src/lock.sh")
    filename = "lock.sh"
  }
}

resource "aws_lambda_function" "example" {
  filename         = data.archive_file.example.output_path
  function_name    = "example"
  handler          = "function.handler"
  layers           = var.lambda_layer_arns
  role             = aws_iam_role.example_lambda.arn
  runtime          = "provided"
  source_code_hash = filebase64sha256(data.archive_file.example.output_path)

  vpc_config {
    security_group_ids = [aws_security_group.example.id]
    subnet_ids         = var.subnet_ids
  }
}

resource "aws_cloudwatch_log_group" "example" {
  name              = "/aws/lambda/${aws_lambda_function.example.function_name}"
  retention_in_days = 14
}

# IAMの設定
resource "aws_iam_role" "example_lambda" {
  name               = "example-lambda"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}

data "aws_iam_policy_document" "lambda_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

resource "aws_iam_role_policy_attachment" "example_lambda_execution_role" {
  role       = aws_iam_role.example_lambda.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "aws_iam_role_policy" "example_lambda" {
  role   = aws_iam_role.example_lambda.name
  policy = data.aws_iam_policy_document.example_lambda.json
}

data "aws_iam_policy_document" "example_lambda" {
  statement {
    actions = [
      "dynamodb:UpdateItem",
    ]
    resources = [
      aws_dynamodb_table.lambda_locks.arn,
    ]
  }
}

# Security Group
resource "aws_security_group" "example" {
  name   = "example"
  vpc_id = var.vpc_id

  egress = [
    {
      description      = "Egress"
      from_port        = 0
      to_port          = 0
      protocol         = "-1"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
      security_groups  = null
      self             = null
      prefix_list_ids  = null
    },
    {
      description      = "Egress to API"
      from_port        = 443
      to_port          = 443
      protocol         = "tcp"
      cidr_blocks      = null
      ipv6_cidr_blocks = null
      security_groups  = var.security_group_ids
      self             = null
      prefix_list_ids  = null
    },
  ]
}

# CloudWatch Eventsから起動する設定
resource "aws_lambda_permission" "example" {
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.example.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.example.arn
}

resource "aws_cloudwatch_event_rule" "example" {
  name                = aws_lambda_function.example.function_name
  schedule_expression = "cron(0 0 * * ? *)"
}

resource "aws_cloudwatch_event_target" "example" {
  arn  = aws_lambda_function.example.arn
  rule = aws_cloudwatch_event_rule.example.name
}

# DynamoDB
resource "aws_dynamodb_table" "lambda_locks" {
  name         = "LambdaLocks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "key"

  attribute {
    name = "key"
    type = "S"
  }

  ttl {
    attribute_name = "ttl"
    enabled        = true
  }
}

Lambda Layerの設定

AWS公式ドキュメントを元に、Lambda上でshell scriptを実行する環境を準備します。

docs.aws.amazon.com

追加でawscliを使用するためのLambda Layerを追加していますが、こちらについては後述します。

modules/layers/src/bash_runtime/bootstrap:

#!/bin/sh

set -euo pipefail

# Initialization - load function handler
source $LAMBDA_TASK_ROOT/"$(echo $_HANDLER | cut -d. -f1).sh"

# Processing
while true
do
    HEADERS="$(mktemp)"
    # Get an event. The HTTP request will block until one is received
    EVENT_DATA=$(curl -sS -LD "$HEADERS" -X GET "http://${AWS_LAMBDA_RUNTIME_API}/2018-06-01/runtime/invocation/next")

    # Extract request ID by scraping response headers received above
    REQUEST_ID=$(grep -Fi Lambda-Runtime-Aws-Request-Id "$HEADERS" | tr -d '[:space:]' | cut -d: -f2)

    # Run the handler function from the script
    RESPONSE=$($(echo "$_HANDLER" | cut -d. -f2) "$EVENT_DATA")

    # Send the response
    curl -X POST "http://${AWS_LAMBDA_RUNTIME_API}/2018-06-01/runtime/invocation/$REQUEST_ID/response"  -d "$RESPONSE"
done

modules/layers/main.tf:

resource "aws_lambda_layer_version" "aws_cli" {
  filename   = "${path.module}/src/aws_cli/layer.zip"
  layer_name = "aws_cli"

  compatible_runtimes = ["provided"]
}

data "archive_file" "bash_runtime" {
  type        = "zip"
  source_file = "${path.module}/src/bash_runtime/bootstrap"
  output_path = "${path.module}/src/bash_runtime/layer.zip"
}

resource "aws_lambda_layer_version" "bash_runtime" {
  filename   = data.archive_file.bash_runtime.output_path
  layer_name = "bash_runtime"

  compatible_runtimes = ["provided"]
}

modules/layers/outputs.tf:

output "aws_cli_layer_arn" {
  value = aws_lambda_layer_version.aws_cli.arn
}

output "bash_runtime_layer_arn" {
  value = aws_lambda_layer_version.bash_runtime.arn
}

awscliのLambda Layerについては、Amazon Linux上で作業するのが好ましいので任意のCI上で実行するためのshell scriptを追加します。 Docker imageは amazonlinux:latest 等を使用します。 *2

modules/layers/src/aws_cli/build.sh:

#!/bin/sh
set -eux

AWS_CLI_VERSION=2.4.22

mkdir build
cd build || exit 1

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64-${AWS_CLI_VERSION}.zip" -o awscliv2.zip
unzip awscliv2.zip
./aws/install -i aws-cli

rm aws-cli/v2/current
ln -s $AWS_CLI_VERSION aws-cli/v2/current

mkdir bin
ln -s ../aws-cli/v2/current/bin/aws bin/aws
ln -s ../aws-cli/v2/current/bin/aws_completer bin/aws_completer

zip -ry ../layer.zip aws-cli bin
cd -
rm -r build

Shell scriptの設定

オンプレから持ってきたshell scriptを一部修正しながら配置します。
handler がentry pointになるので、そこに処理を書いていきます (今回はcurlしているだけの処理を想定しています)。

modules/example/src/example/function.sh:

#!/bin/bash
set -eux

API_URL='https://example.com/api/foo'
USER_AGENT='Example/1.0.0'

source "$LAMBDA_TASK_ROOT/lock.sh"

function handler() {
    if ! check_lock "example"; then
        echo "Skip"
        exit 0
    fi

    curl -A "${USER_AGENT}" "${API_URL}"
}

modules/example/src/lock.sh:

#!/bin/bash

function check_lock() {
    local NAME="$1"

    # 実行は1日あたり1回を想定
    local LOCK_KEY="${NAME}-$(date '+%Y-%m-%d')"
    local TTL=$(date -d '+1 hour' '+%s')

    local RES=$(aws dynamodb update-item \
        --table-name LambdaLocks \
        --key "{\"key\": {\"S\": \"${LOCK_KEY}\"}}" \
        --expression-attribute-names '{"#ttl": "ttl"}' \
        --expression-attribute-values "{\":ttl\": {\"N\": \"${TTL}\"}}" \
        --update-expression 'SET #ttl = if_not_exists(#ttl, :ttl)' \
        --return-values ALL_OLD)

    [ -z "${RES}" ]
}

check_lock で行っているのは、複数回実行を避けるためにDynamoDB上に実行した記録がないかを確認することです。 CloudWatch Events Ruleの実行はAt least oneであることが保証されているため、稀に複数回実行されてしまいます。

docs.aws.amazon.com

これを避けるためにはAPIを冪等に保つ必要がありますが、今回はAPIが冪等になっていない場合を想定して複数回実行を抑止しています (実行の条件は様々だと思うので、適宜調整が必要です)。なお、DynamoDBのTTLは最大48時間削除が遅れるためDynamoDBのPartition Keyを工夫する必要があります。

docs.aws.amazon.com

リソースの定義

最後に作成したmoduleを使用してリソースを定義します。

main.tf:

module "layers" {
  source = "modules/layers"
}

module "example" {
  source = "modules/example

  # 定義済みのリソースのattributesを設定する
  security_group_ids = [...]
  subnet_ids         = [...]
  vpc_id             = ...

  # 適用する順番に並べる
  lambda_layer_arns = [
    module.layers.bash_runtime_arn,
    module.layers.aws_cli_arn,
  ]
}

まとめ

CloudWatch Events + AWS Lambdaによりcronで実行しているshell scriptを移植する方法について紹介しました。 Lambdaの最大実行時間である15分に満たないような軽いバッチについては、脱オンプレの間に合わせの実装としてLambda上への移植は選択肢に入ってくるのではないかと思います。もし、より簡単で便利な方法を御存知でしたら、ぜひカジュアル面談でお聞かせください。

jobs.m3.com

*1:厳密にはVPC Endpointが無いとDynamoDBとの通信がInternet経由になります。今回は元のVPCで構成済み。

*2:なお、awscliの場合は任意のarchitectureのzipを取得できるので別OS上でも実行可能です。