Optimizing AWS Batch Workloads: Concurrent Batch Jobs Execution using Lambda, S3, and ECR

Published in

AWS in Plain English

8 min readOct 24, 2023

Scaling workloads seamlessly by triggering 70 batch jobs concurrently

Situation: While working on building Machine Learning models for 70 different countries (with one model dedicated to each country), we encountered the requirement to execute a specific set of post-processing tasks for each of these countries. To streamline this process and enable the concurrent triggering of 70 separate Batch jobs, each aligned with a distinct entry in a configuration file containing 70 rows, I harnessed the capabilities of AWS Batch, Lambda, S3, ECS, and ECR, effectively achieving a highly efficient workflow.

Task: The task at hand is to set up a serverless system that can concurrently and seamlessly trigger the 70 AWS Batch jobs, each of which is based on one of the rows in the configuration file. To accomplish this, it’s essential to create a well-coordinated system where a Lambda function acts as an orchestrator and AWS Batch code serves as the worker for these jobs.

Before we move onto the real action, let’s see how the AWS Batch workflows look like at high level:

AWS Batch is a container-based, fully managed service that allows you to run batch-style workloads at any scale. The following flow describes how AWS Batch runs each job

Create a job definition that specifies how jobs are to be run, supplying an IAM role, memory and CPU requirements, and other configuration options.
Submit jobs to a managed AWS Batch job queue, where jobs reside until they are scheduled onto a compute environment for processing.
AWS Batch evaluates the CPU, memory, and GPU requirements of each job in the queue and scales the compute resources in a compute environment to process the jobs.
AWS Batch scheduler places jobs in the appropriate AWS Batch compute environment for processing.
Jobs exit with a status and write results to user-defined storage.

It’s time for Execution!

Action: To address this task, we’ll take the following actions:

Lambda Orchestrator: We will create an AWS Lambda function designed to orchestrate the job execution. This Lambda function will read the configuration file as soon as it’s uploaded on S3 as a PUT event, and initiate the corresponding AWS Batch jobs concurrently based on “specific” job parameters pertaining to each country based on the config.csv file.

1.1 Configuration File: The configuration file, containing 70 distinct rows, will serve as the blueprint for the individual jobs. Each row represents a unique job to be executed by AWS Batch. Store this file in an S3 bucket or within your Lambda function itself if it’s not too large.

Below is the code that sits inside the lambda_handler function which is triggered by an S3 PUT event.

# Get the file details that has been PUT into the designated bucket
BUCKET = event['Records'][0]['s3']['bucket']['name']
KEY = event['Records'][0]['s3']['object']['key']
    
print(BUCKET, KEY)
config_file_path = f"s3://{BUCKET}/path-to-config-file/{KEY}"

try:

    # read input files files
    config = wr.s3.read_csv(path=config_file_path)

except Exception as e:
        print(f"An error occurred in reading the files: {e}")

1.2 Creating Batch Job Compute, Queue, Definition and Jobs

To run a job using AWS Batch, we need to set up several components, including compute environments, job queues, and job definitions. Below are the prerequisites and steps to create these components:

Amazon EC2/Fargate Instances: Before setting up AWS Batch, you need Amazon EC2/Fargate instances to act as your compute resources. These instances should be configured with the necessary software and dependencies required for your batch jobs.
IAM Role: Create an IAM role that AWS Batch jobs can assume. This role should grant permissions to access AWS resources that your batch jobs may need. Ensure that the role has the appropriate permissions for the job tasks.
Create a Job Queue: In the AWS Batch console, navigate to “Job queues” and configure your job queue. Set the priority for the queue (lower values have higher priority).

4. Create a Job Definition: In the AWS Batch console, navigate to “Job definitions.” Choose “Create” to create a new job definition and specify the IAM role that your jobs will assume.

If you are using a docker container, define the container image, command, environment variables, and resource requirements. Also, configure job retries if needed.

Finally, we are all set to run our jobs on AWS Batch. Now, there are 2 ways to achieve it using either via “console” or “on the fly”:

The console way-

In the AWS Batch console, navigate to “Jobs.”
Choose “Submit job.”
Configure your job:
Job Definition: Select the job definition you created.
Job Queue: Choose the job queue.
Job Name: Provide a name for the job.
Command: Specify the command to be executed.
Additional job parameters as needed.
Review and submit the job.

On the fly-

The below code iterates through rows in a configuration file, submits AWS Batch jobs for each country, and includes specific environmental variables based on the data from the configuration file.

If any errors occur during job submission, the code catches and handles those exceptions. Finally, it returns a response indicating the successful submission of batch jobs.

for idx in np.arange(config.shape[0]):
        row = config.iloc[idx]
    
        country = str(row["country"])
        column_1= str(row["column_1"])
        column_2= str(row["column_2"])
        column_3 = str(row["column_3"])
        
        print("Initialize the Batch client")
        # Initialize the AWS Batch client
        batch_client = boto3.client('batch')
        pyt
        # Specify the job definition, job name, and job queue
        job_definition = 'job_definition'
        job_name = country
        job_queue = 'job-queue'

        try:
            print(f"Running {country} on batch")
            # Submit the job
            response = batch_client.submit_job(
                jobName=job_name,
                jobQueue=job_queue,
                jobDefinition=job_definition,
                containerOverrides= {
                    "environment": [
                        {
                            "name": "country",
                            "value": country
                        },
                        {
                            "name": "column_1",
                            "value": column_1
                        },
                        {
                            "name": "column_2",
                            "value": column_2
                        },
                        {
                            "name": "column_3",
                            "value": column_3
                        },
                    ]
                }
            )

            # Get the jobId from the response
            job_id = response['jobId']
            print(f'Batch job for {country} submitted successfully with job ID {job_id}')
        
        except Exception as e:
            print(f'Error submitting the batch job for {country}: {str(e)}')

    return {
        'statusCode': 200,
        'body': 'Batch jobs submitted successfully'
    }

Here’s a detailed explanation of how this function works:

Iteration: The function iterates through each row of a configuration data frame (denoted as ‘config’) using the NumPy function np.arange() to generate indices.
Data Extraction: For each iteration, it extracts relevant data from the current row in the configuration data frame. This data includes the country name, column_1, column_2and column_3. These variables will be used to configure the AWS Batch job.
AWS Batch Client Initialization: It initializes the AWS Batch client using the Boto3 library, which allows Python to interact with AWS services.
Job Specification: The function specifies details required for the AWS Batch job, including the job definition, job name (based on the country), and job queue. These details determine how the job will be executed in the AWS Batch environment.
Job Submission: Inside a try block, the function submits the AWS Batch job by calling the batch_client.submit_job() method. It provides the job name, job queue, job definition, and a set of container environment variables. These environment variables are specific to the country and associated data extracted from the configuration file.

For more details, refer to this GitHub repository.

2. Worker Function

The below Python script is designed to be invoked as a worker function in an AWS Batch job for each country. Let’s elaborate on how this script works:

import os
import awswrangler as wr
import pandas as pd
from worker_class import main
   
if __name__ == "__main__":
    
    # Access job parameters from environment variables
    # you can change the data-types as desired

    country = str(os.environ.get('country'))
    column_1 = int(os.environ.get('column_1'))
    column_2 = int(os.environ.get('column_2'))
    column_3 = bool(os.environ.get('column_3'))
    
    # read file for specific country
    print(f"Running {country}")
    df = wr.s3.read_parquet(path=f"s3://path-to-country-file/")
    
    main(country=country, 
         column_1=column_1, 
         column_2=column_2, 
         column_3=column_3,
         df=df)
    
    print(f"{country} Completed.")

The script starts by importing the necessary libraries and modules, including os for interacting with environment variables, awswrangler for reading data from AWS resources, pandas for data manipulation, and worker_class (a custom module) for invoking the main function.
It then checks if the script is being run as the main module using the if __name__ == "__main__": block. This is a common Python pattern that ensures the code within this block is executed only when the script is run directly (not when it's imported as a module).
The script accesses job parameters from environment variables. These variables are expected to be set when the script is executed as part of an AWS Batch job. The key parameters being accessed include country, column_1, column_2, and column_3. The os.environ.get() method is used to retrieve these values.

In summary, this script is designed to be used as a worker function in an AWS Batch job. It retrieves job-specific parameters from environment variables, reads data from an S3 location, and calls a custom main function from an external module to perform data processing and calculations.

For more details, refer to this GitHub repository.

Integration of Docker, ECR, and ECS

To make this workflow scalable, we have integrated Docker, ECR, and ECS before this code is executed as part of an AWS Batch job. Let me explain how these services likely fit into the overall architecture:

Docker: Docker is a containerization platform that allows you to package an application and its dependencies into a container image. In this context, Docker would be used to create a container image that encapsulates the code and dependencies required to run the processing tasks for the AWS Batch job. This container image is typically defined in a Dockerfile.
Amazon Elastic Container Registry (ECR): ECR is a fully managed container registry provided by AWS. It is used to store and manage Docker container images. After creating a Docker image, you can push it to an ECR repository. Once the image is stored in ECR, it can be easily accessed by AWS services, including ECS and AWS Batch, for running containerized tasks.
Amazon Elastic Container Service (ECS): ECS is a container orchestration service that allows you to run and manage Docker containers. You can define and configure ECS tasks and services that specify which container images to use, how many instances of the containers to run, and how they interact with other AWS resources.

Here’s a high-level explanation of how the integration would work in the overall workflow:

You would create a Dockerfile to define the environment and dependencies for your processing tasks.
You would build a Docker image from the Dockerfile. This image encapsulates your code, dependencies, and runtime environment.
You would push the Docker image to an ECR repository. This makes the image available for use by AWS services.
In your Lambda and AWS Batch job definitions, you would specify the ECR repository and the specific Docker image to use.
When the Lambda or AWS Batch job is initiated, it pulls the Docker image from the ECR repository and runs the task using the defined image.

Lastly, you can monitor and manage your jobs from the AWS Batch console or using the AWS CLI. You can view job status, logs, and other job-related information on Cloudwatch, and SQS if you want to enable any notification for the internal stakeholders.

Hope you learned something new today!

If you enjoyed reading this article comment “Hell Yes!” in the comment section and let me know if any feedback.

Feel free to follow me on Medium, and GitHub, or say Hi on LinkedIn. I am excited to discuss on AI, ML, NLP, and MLOps areas!

Appendix —

The end-to-end code for this module is saved to this GitHub repository.

2. For AWS Batch best practices, use this guide— https://docs.aws.amazon.com/batch/latest/userguide/best-practices.html

In Plain English

Thank you for being a part of our community! Before you go:

Be sure to clap and follow the writer! 👏
You can find even more content at PlainEnglish.io 🚀
Sign up for our free weekly newsletter. 🗞️
Follow us: Twitter(X), LinkedIn, YouTube, Discord.
Check out our other platforms: Stackademic, CoFeed, Venture.

AWS in Plain English

Optimizing AWS Batch Workloads: Concurrent Batch Jobs Execution using Lambda, S3, and ECR

Integration of Docker, ECR, and ECS

In Plain English

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in AWS in Plain English

Written by Akash Mathur

Responses (1)