AWS Systems Manager (SSM) Cross Region Replication

Replicate SSM parameters to another region using AWS Lambda & SQS.

Overview of SSM Replication

This blog post will explain in detail how to set up cross region replication for AWS Parameter Store. As of the writing of this blog post, AWS does not have a native feature for replicating parameters in SSM. If you are using SSM Parameter Store instead of Secrets Manager and are seeking a way to replicate parameters for DR/Multi-Region purposes, this post may help you.

Diagram showing the architecture setup:

Serverless Framework Setup

I used Lamby cookie-cutter as the framework for this Lambda, which made a lot of the initial set up very easy! Please take a look at that site & set up your serverless framework for the work to be done ahead. I will first share the CloudFormation template used, then share the code that makes the replication work as well as plain in detail what's happening.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: AWS SSM regional replication for multi-region setup

Parameters:

  StageEnv:
    Type: String
    Default: dev
    AllowedValues:
      - test
      - dev
      - staging
      - prod

Mappings:
  KmsMap:
    us-east-1:
      dev: 'arn:aws:kms:us-east-1:123456:key/super-cool-key1'
      staging: 'arn:aws:kms:us-east-1:123456:key/super-cool-key2' 
      prod: arn:aws:kms:us-east-1:123456:key/super-cool-key3'
    us-east-2:
      dev: 'arn:aws:kms:us-east-2:123456:key/super-cool-key1'
      staging: 'arn:aws:kms:us-east-2:123456:key/super-cool-key1'
      prod: 'arn:aws:kms:us-east-2:123456:key/super-cool-key1' 
  DestinationMap:
    us-east-1: 
      target: "us-east-2"

Resources:

  ReplicationQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: !Sub 'SSM-SQS-replication-${StageEnv}-${AWS::Region}'
      VisibilityTimeout: 1000

  LambdaRegionalReplication:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: .
      Handler: lib/ssm_regional_replication.handler
      Runtime: ruby2.7
      Timeout: 900
      MemorySize: 512
      Environment:
        Variables:
          STAGE_ENV: !Ref StageEnv
          TARGET_REGION: !FindInMap [DestinationMap, !Ref AWS::Region, target]
          SKIP_SYNC: 'skip_sync'
      Events:
        InvokeFromSQS:
          Type: SQS
          Properties:
            Queue: {"Fn::GetAtt" : [ "ReplicationQueue", "Arn" ]}
            BatchSize: 1
            Enabled: true
        ReactToSSM:
          Type: EventBridgeRule
          Properties:
            Pattern:
              detail-type:
                - Parameter Store Change 
              source:
                - aws.ssm
      Policies:
      - Statement:
        - Sid: ReadSSM
          Effect: Allow
          Action:
          - ssm:GetParameter
          - ssm:GetParameters
          - ssm:PutParameter
          - ssm:DeleteParameter
          - ssm:AddTagsToResource
          - ssm:ListTagsForResource
          Resource: 
          - !Sub "arn:aws:ssm:*:${AWS::AccountId}:parameter/*"
      - Statement:
        - Sid: DecryptSSM
          Effect: Allow
          Action:
          - kms:Decrypt
          - kms:Encrypt
          Resource: 
          - !FindInMap [KmsMap, us-east-1, !Ref StageEnv]
          - !FindInMap [KmsMap, us-east-2, !Ref StageEnv]
  LambdaFullReplication:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: .
      Handler: lib/ssm_full_replication.handler
      Runtime: ruby2.7
      Timeout: 900
      MemorySize: 512
      Environment:
        Variables:
          STAGE_ENV: !Ref StageEnv
          TARGET_REGION: !FindInMap [DestinationMap, !Ref AWS::Region, target]
          SKIP_SYNC: 'skip_sync'
      Events:
        DailyReplication:
          Type: Schedule
          Properties:
            Description: Cronjob to run replication at 9:30am EST every Wednesday (cron is UTC)
            Enabled: True 
            Name: DailySSMReplication
            Schedule: "cron(30 13 ? * 4 *)"
      Policies:
      - Statement:
        - Sid: SQSPerms
          Effect: Allow
          Action:
          - sqs:SendMessage
          Resource: 
          - !Sub "arn:aws:sqs:*:${AWS::AccountId}:SSM-SQS-replication-*"
      - Statement:
        - Sid: ReadSSM
          Effect: Allow
          Action:
          - ssm:GetParameter
          - ssm:GetParameters
          - ssm:PutParameter
          - ssm:AddTagsToResource
          - ssm:ListTagsForResource
          - ssm:DescribeParameters
          Resource: 
          - !Sub "arn:aws:ssm:*:${AWS::AccountId}:*"
          - !Sub "arn:aws:ssm:*:${AWS::AccountId}:parameter/*"
      - Statement:
        - Sid: DecryptSSM
          Effect: Allow
          Action:
          - kms:Decrypt
          - kms:Encrypt
          Resource: 
          - !FindInMap [KmsMap, us-east-1, !Ref StageEnv]
          - !FindInMap [KmsMap, us-east-2, !Ref StageEnv]

The above template does a number of things. It creates my SQS queue, a regional replication lambda that is event based, and a full replication lambda that is cron based. Under the 'Mappings' section I have "KmsMap" which maps to the aws/ssm KMS keys. If you use other keys for your SSM entries, enter that value here. If you use many keys across your SSM parameters, simply add them to the lambda properties, example here:

      - Statement:
        - Sid: DecryptSSM
          Effect: Allow
          Action:
          - kms:Decrypt
          - kms:Encrypt
          Resource: 
          - !FindInMap [KmsMap, us-east-1, !Ref StageEnv]
          - !FindInMap [KmsMap, us-east-2, !Ref StageEnv]
          - 'arn:aws:kms:us-east-1:123456:key/my-managed-key1' 

The other 'Mapping', DestinationMap, sets up my source and target region. My original SSM parameters are in us-east-1, so the target is us-east-2 in this case. The SQS queue holds all of the parameters from the LambdaFullReplication, since lambdas cannot run indefinitely, there's a high chance the function won't finish before going through all of your parameters. This LambdaFullReplication function sends the parameters to the SQS queue, where the LambdaRegionalReplication then performs the put action to the destination region. The VisibilityTimeout is set to 1000 to allow some wiggle room for the lambda (900). The full replication lambda runs every Wednesday (or whatever frequency you'd like) for a few reasons:

  1. to do the initial get/put for the parameters and
  2. to catch any parameters that have/delete the skip_sync tag

I will discuss the skip_sync tag in detail when discussing the code. The regional replication lambda runs when there's an entry in the SQS queue that has to be processed, or anytime there's a change to a parameter, driven by event based actions.

Code Setup

Next I will discuss and share the Ruby code that actually does the work. There are three Ruby files that make this lambda function, parameter_store.rb, ssm_regional_replication.rb, and ssm_full_replication.rb. I will share the code along with the comments around what is happening in the file.

require 'aws-sdk-ssm'

# Create ParameterStore class, to be shared by both regional
# and full replication lambda.
class ParameterStore
  # The parameter store class creates instance variables with "attr_accessor" 
  # for the initial client, response, name, and tag_list. 
  attr_accessor :client, :response, :name, :tag_list

  # Initialize method for hash
  # this allows the client & name instance vars
  # to be used outside of the init method
  def initialize(h)
    self.client = h[:client] # this gets the client key from CloudWatch metrics
    self.name = h[:name] # gets the name of the param & assigns it to name instance var
  end

  # this method takes the client & name args from prev method.
  def self.find_by_name(client, name) 
    # create new client connection & name from private `find_by_name` method
    new(client: client, name: name).find_by_name(name)
  end

  private 
  def find_by_name(name)
    # set begin block in order for the get_parameter call to
    # loop through all of the parameters
    begin
      # declare instance variable with self.response
      # set to the AWS client connection calling
      # get_parameter method via Ruby CLI
      # extract the name & with_decruption options set
      self.response = client.get_parameter({
        name: name,
        with_decryption: true,
      })
      # rescue to look for AWS SSM throttling errors.
      # take the exception below, and place in variable "e"
    rescue Aws::SSM::Errors::ThrottlingException => e 
      p "Sleeping for 60 seconds while getting parameters."
      sleep(60)
      # will re-run what is in begin block
      retry
    end
    self
  end


  # creates a `tag_list` instance var
  # `||=` operator is Ruby "short-circuit" which means
  # if `tag_list` is set, then skip this part,
  # if not set, then set it to what is on the right side of equals sign.
  # the purpose is to set the tag_list var equal to
  # the response from the `list_tags_for_resource`¹ 
  # which contains resource_type set to Parameter, and the 
  # resource_id set to name
  def tag_list
    @tag_list ||= client.list_tags_for_resource({resource_type: 'Parameter', resource_id: name})
  end

  # checks the `tag_list` method above & runs a 
  # select method on the tag_list hash
  # loops to see if there is a key with the `key` value in hash
  # and checks presence of a `skip_sync` tag with the `.any?` 
  # boolean method. If this exists, then the lambda function
  # will not run and the replication will not occur.
  # If this does not exist, then it proceeds. 
  # You may want to skip syncing for regional specific resources. 
  # If you want to replicate an initial skip_sync param, simply
  # remove the tag in question and on the next run, the param will sync`
  def skip_sync?
    tag_list[:tag_list].select {|key| key[:key] == $skip_tag }.any?
  end

  # Calls the Ruby `put_parameters` method on the `client_target` parameter.
  # `put_parameter` replicates name, value, type, and overwrite. This method
  # also adds the tags copied over from the tag_list method to resources by name.
  def sync_to_region(client_target)
    client_target.put_parameter({
      name: response['parameter']['name'], # required
      value: response['parameter']['value'], # required
      type: response['parameter']['type'], # accepts String, StringList, SecureString
      overwrite: true,
    })
    client_target.add_tags_to_resource({resource_type: 'Parameter', resource_id: name, tags: tag_list.to_h[:tag_list]})    
  end
end


The next file I will discuss is the ssm_full_replication.rb piece of the code. As you may gather from the name, this is responsible for full replication.

# this pulls the AWS sdk gem
require 'aws-sdk-ssm'
require 'aws-sdk-sqs'
require_relative 'parameter_store'

# Declare global variables which are set to the
# respective values from CloudFormation template.
$target_region = ENV['TARGET_REGION'] or raise "Missing TARGET_REGION variable."
$skip_tag = ENV['SKIP_SYNC'] or raise "Missing skip_sync tag."
$stage_env = ENV['STAGE_ENV']

# method set to us-east-1 for source region. 
# var `sqs_client` set to new SQS client connection in target region
# var `sts_client` set to new STS client conn in source region.
# call `send_message` on `sqs_client` var with queue_url & message_body as params.
def send_params_to_sqs(name)
  region = "us-east-1"
  sqs_client = Aws::SQS::Client.new(region: $target_region)
  sts_client = Aws::STS::Client.new(region: region)

  sqs_client.send_message(
    queue_url: "https://sqs.#{region}.amazonaws.com/#{sts_client.get_caller_identity.account}/SSM-SQS-replication-#{$stage_env}-#{region}",
    message_body: name
  )
end

# sets new SSM client connection in source region
# and new SSM client_target connection in target region
def handler(event:, context:)
  client = Aws::SSM::Client.new
  client_target = Aws::SSM::Client.new(region: $target_region)

  # next_token set to nil, which is important at start of lambda func
  next_token = nil
  # loop starts with begin block which
  # runs before the rest of the code in method.
  loop do 
    begin
      # describe_batch is set to value from 
      # describe_parameters² call on the client variable.
      @describe_batch = client.describe_parameters({
        # parameter_filter limits request results to what we need
        parameter_filters: [
          {
            key: "Type",
            values: ["String", "StringList", "SecureString"]
          },
        ],
        # next_token is set to next set of items to return
        next_token: next_token,
      })
      # describe_batch var calls iterative loop and
      # sends param name to send_params_to_sqs method
      @describe_batch.parameters.each do |item|
        send_params_to_sqs(item.name)
      end
      # break means that func will end if the next_token value is empty.
      break if @describe_batch.next_token.nil?
      next_token = @describe_batch.next_token
      # exception handling. it looks for this error message, and this is how it will handle, by pausing for 60 seconds.
    rescue Aws::SSM::Errors::ThrottlingException
      p "Sleeping for 60 seconds while describing parameters."
      sleep(60)
    end
  end
end

The last file to share is the ssm_regional_replication.rb file. This file is event based and does the regional replication.

# this pulls the AWS sdk gem
require 'aws-sdk-ssm'
require_relative 'parameter_store'

# Global vars for file
$target_region = ENV['TARGET_REGION'] or raise "Missing TARGET_REGION variable."
$skip_tag = ENV['SKIP_SYNC'] or raise "Missing skip_sync tag."

# CloudWatch sends events in a specific format compared to SQS triggered lambdas
# so this method grabs the values from CloudWatch handles both formats.
def massage_event_data(event)
  # pull out values from a cloudwatch invocation
  operation = event.fetch('detail', {})['operation']
  name      = event.fetch('detail', {})['name']
  return operation,name if operation && name
  operation = 'Update' 
  name      = event.fetch('Records', []).first['body']
  return operation,name 
end

def handler(event:, context:)
  # set vars called operation and name. output from prev. method.
  # create new client & target vars for SSM
  operation,name = massage_event_data(event)
  client = Aws::SSM::Client.new
  client_target = Aws::SSM::Client.new(region: $target_region)

  # this logic runs event based code. If the operation from 
  # the CloudWatch metrics is equal to either update or create
  # the ps var uses the ParameterStore find_by_name class method
  # and passes the client * name.
  if operation == 'Update' || operation == 'Create'
    ps = ParameterStore.find_by_name(client, name)

    # if the ps var has a skip_sync tag, then the CloudWatch logs
    # you will get what's in the puts string. if there is no tag
    # it syncs to target region.
    if ps.skip_sync?
      puts "This function has been opted out, not replicating parameter."
    else
      ps.sync_to_region(client_target)
    end

  # if the operation is delete in the source region, then the delete_parameter method is called on the
  # client_target and it's also deleted from the target_region to ensure parity.
  elsif operation == 'Delete'
    response = client_target.delete_parameter({
      name: name, # required. go into event, reference the detail key, and the value name
    })
  end
end

References to AWS API docs page:

  1. list_tags_for_resource-instance_method
  2. describe_parameters

If you want to be sure that there are no missed variables, you can always set up a CloudWatch alarm on if your lambda has any failed invocations or if your SQS queue isn't sending any messages. I hope that this has helped others who are looking for a way to replicate SSM parameters in AWS from one region to another. That's the end of the code, I know it is a lot to digest, so if you have any questions please leave a comment and I'll do my best to follow up.


by Katherine Cisneros