Optimizing AWS Batch Workloads: Concurrent Batch Jobs Execution using Lambda, S3, and ECR
Scaling workloads seamlessly by triggering 70 batch jobs concurrently

Situation: While working on building Machine Learning models for 70 different countries (with one model dedicated to each country), we encountered the requirement to execute a specific set of post-processing tasks for each of these countries. To streamline this process and enable the concurrent triggering of 70 separate Batch jobs, each aligned with a distinct entry in a configuration file containing 70 rows, I harnessed the capabilities of AWS Batch, Lambda, S3, ECS, and ECR, effectively achieving a highly efficient workflow.
Task: The task at hand is to set up a serverless system that can concurrently and seamlessly trigger the 70 AWS Batch jobs, each of which is based on one of the rows in the configuration file. To accomplish this, it’s essential to create a well-coordinated system where a Lambda function acts as an orchestrator and AWS Batch code serves as the worker for these jobs.
Before we move onto the real action, let’s see how the AWS Batch workflows look like at high level:

AWS Batch is a container-based, fully managed service that allows you to run batch-style workloads at any scale. The following flow describes how AWS Batch runs each job
- Create a job definition that specifies how jobs are to be run, supplying an IAM role, memory and CPU requirements, and other configuration options.
- Submit jobs to a managed AWS Batch job queue, where jobs reside until they are scheduled onto a compute environment for processing.
- AWS Batch evaluates the CPU, memory, and GPU requirements of each job in the queue and scales the compute resources in a compute environment to process the jobs.
- AWS Batch scheduler places jobs in the appropriate AWS Batch compute environment for processing.
- Jobs exit with a status and write results to user-defined storage.
It’s time for Execution!
Action: To address this task, we’ll take the following actions:
- Lambda Orchestrator: We will create an AWS Lambda function designed to orchestrate the job execution. This Lambda function will read the configuration file as soon as it’s uploaded on S3 as a PUT event, and initiate the corresponding AWS Batch jobs concurrently based on “specific” job parameters pertaining to each country based on the config.csv file.
1.1 Configuration File: The configuration file, containing 70 distinct rows, will serve as the blueprint for the individual jobs. Each row represents a unique job to be executed by AWS Batch. Store this file in an S3 bucket or within your Lambda function itself if it’s not too large.
Below is the code that sits inside the lambda_handler function which is triggered by an S3 PUT event.

# Get the file details that has been PUT into the designated bucket
BUCKET = event['Records'][0]['s3']['bucket']['name']
KEY = event['Records'][0]['s3']['object']['key']
print(BUCKET, KEY)
config_file_path = f"s3://{BUCKET}/path-to-config-file/{KEY}"
try:
# read input files files
config = wr.s3.read_csv(path=config_file_path)
except Exception as e:
print(f"An error occurred in reading the files: {e}")
1.2 Creating Batch Job Compute, Queue, Definition and Jobs
To run a job using AWS Batch, we need to set up several components, including compute environments, job queues, and job definitions. Below are the prerequisites and steps to create these components:
- Amazon EC2/Fargate Instances: Before setting up AWS Batch, you need Amazon EC2/Fargate instances to act as your compute resources. These instances should be configured with the necessary software and dependencies required for your batch jobs.
- IAM Role: Create an IAM role that AWS Batch jobs can assume. This role should grant permissions to access AWS resources that your batch jobs may need. Ensure that the role has the appropriate permissions for the job tasks.
- Create a Job Queue: In the AWS Batch console, navigate to “Job queues” and configure your job queue. Set the priority for the queue (lower values have higher priority).
4. Create a Job Definition: In the AWS Batch console, navigate to “Job definitions.” Choose “Create” to create a new job definition and specify the IAM role that your jobs will assume.
If you are using a docker container, define the container image, command, environment variables, and resource requirements. Also, configure job retries if needed.
Finally, we are all set to run our jobs on AWS Batch. Now, there are 2 ways to achieve it using either via “console” or “on the fly”:
The console way-
- In the AWS Batch console, navigate to “Jobs.”
- Choose “Submit job.”
- Configure your job:
- Job Definition: Select the job definition you created.
- Job Queue: Choose the job queue.
- Job Name: Provide a name for the job.
- Command: Specify the command to be executed.
- Additional job parameters as needed.
- Review and submit the job.
On the fly-
The below code iterates through rows in a configuration file, submits AWS Batch jobs for each country, and includes specific environmental variables based on the data from the configuration file.
If any errors occur during job submission, the code catches and handles those exceptions. Finally, it returns a response indicating the successful submission of batch jobs.
for idx in np.arange(config.shape[0]):
row = config.iloc[idx]
country = str(row["country"])
column_1= str(row["column_1"])
column_2= str(row["column_2"])
column_3 = str(row["column_3"])
print("Initialize the Batch client")
# Initialize the AWS Batch client
batch_client = boto3.client('batch')
pyt
# Specify the job definition, job name, and job queue
job_definition = 'job_definition'
job_name = country
job_queue = 'job-queue'
try:
print(f"Running {country} on batch")
# Submit the job
response = batch_client.submit_job(
jobName=job_name,
jobQueue=job_queue,
jobDefinition=job_definition,
containerOverrides= {
"environment": [
{
"name": "country",
"value": country
},
{
"name": "column_1",
"value": column_1
},
{
"name": "column_2",
"value": column_2
},
{
"name": "column_3",
"value": column_3
},
]
}
)
# Get the jobId from the response
job_id = response['jobId']
print(f'Batch job for {country} submitted successfully with job ID {job_id}')
except Exception as e:
print(f'Error submitting the batch job for {country}: {str(e)}')
return {
'statusCode': 200,
'body': 'Batch jobs submitted successfully'
}
Here’s a detailed explanation of how this function works:
- Iteration: The function iterates through each row of a configuration data frame (denoted as ‘config’) using the NumPy function
np.arange()
to generate indices. - Data Extraction: For each iteration, it extracts relevant data from the current row in the configuration data frame. This data includes the country name,
column_1, column_2
andcolumn_3
. These variables will be used to configure the AWS Batch job. - AWS Batch Client Initialization: It initializes the AWS Batch client using the Boto3 library, which allows Python to interact with AWS services.
- Job Specification: The function specifies details required for the AWS Batch job, including the job definition, job name (based on the country), and job queue. These details determine how the job will be executed in the AWS Batch environment.
- Job Submission: Inside a
try
block, the function submits the AWS Batch job by calling thebatch_client.submit_job()
method. It provides the job name, job queue, job definition, and a set of container environment variables. These environment variables are specific to the country and associated data extracted from the configuration file.
For more details, refer to this GitHub repository.
2. Worker Function
The below Python script is designed to be invoked as a worker function in an AWS Batch job for each country. Let’s elaborate on how this script works:
import os
import awswrangler as wr
import pandas as pd
from worker_class import main
if __name__ == "__main__":
# Access job parameters from environment variables
# you can change the data-types as desired
country = str(os.environ.get('country'))
column_1 = int(os.environ.get('column_1'))
column_2 = int(os.environ.get('column_2'))
column_3 = bool(os.environ.get('column_3'))
# read file for specific country
print(f"Running {country}")
df = wr.s3.read_parquet(path=f"s3://path-to-country-file/")
main(country=country,
column_1=column_1,
column_2=column_2,
column_3=column_3,
df=df)
print(f"{country} Completed.")
- The script starts by importing the necessary libraries and modules, including
os
for interacting with environment variables,awswrangler
for reading data from AWS resources,pandas
for data manipulation, andworker_class
(a custom module) for invoking themain
function. - It then checks if the script is being run as the main module using the
if __name__ == "__main__":
block. This is a common Python pattern that ensures the code within this block is executed only when the script is run directly (not when it's imported as a module). - The script accesses job parameters from environment variables. These variables are expected to be set when the script is executed as part of an AWS Batch job. The key parameters being accessed include
country
,column_1
,column_2
, andcolumn_3
. Theos.environ.get()
method is used to retrieve these values.
In summary, this script is designed to be used as a worker function in an AWS Batch job. It retrieves job-specific parameters from environment variables, reads data from an S3 location, and calls a custom main
function from an external module to perform data processing and calculations.
For more details, refer to this GitHub repository.
Integration of Docker, ECR, and ECS
To make this workflow scalable, we have integrated Docker, ECR, and ECS before this code is executed as part of an AWS Batch job. Let me explain how these services likely fit into the overall architecture:
- Docker: Docker is a containerization platform that allows you to package an application and its dependencies into a container image. In this context, Docker would be used to create a container image that encapsulates the code and dependencies required to run the processing tasks for the AWS Batch job. This container image is typically defined in a Dockerfile.
- Amazon Elastic Container Registry (ECR): ECR is a fully managed container registry provided by AWS. It is used to store and manage Docker container images. After creating a Docker image, you can push it to an ECR repository. Once the image is stored in ECR, it can be easily accessed by AWS services, including ECS and AWS Batch, for running containerized tasks.
- Amazon Elastic Container Service (ECS): ECS is a container orchestration service that allows you to run and manage Docker containers. You can define and configure ECS tasks and services that specify which container images to use, how many instances of the containers to run, and how they interact with other AWS resources.
Here’s a high-level explanation of how the integration would work in the overall workflow:
- You would create a Dockerfile to define the environment and dependencies for your processing tasks.
- You would build a Docker image from the Dockerfile. This image encapsulates your code, dependencies, and runtime environment.
- You would push the Docker image to an ECR repository. This makes the image available for use by AWS services.
- In your Lambda and AWS Batch job definitions, you would specify the ECR repository and the specific Docker image to use.
- When the Lambda or AWS Batch job is initiated, it pulls the Docker image from the ECR repository and runs the task using the defined image.
Lastly, you can monitor and manage your jobs from the AWS Batch console or using the AWS CLI. You can view job status, logs, and other job-related information on Cloudwatch, and SQS if you want to enable any notification for the internal stakeholders.
Hope you learned something new today!
If you enjoyed reading this article comment “Hell Yes!” in the comment section and let me know if any feedback.
Feel free to follow me on Medium, and GitHub, or say Hi on LinkedIn. I am excited to discuss on AI, ML, NLP, and MLOps areas!
Appendix —
- The end-to-end code for this module is saved to this GitHub repository.
2. For AWS Batch best practices, use this guide— https://docs.aws.amazon.com/batch/latest/userguide/best-practices.html
In Plain English
Thank you for being a part of our community! Before you go:
- Be sure to clap and follow the writer! 👏
- You can find even more content at PlainEnglish.io 🚀
- Sign up for our free weekly newsletter. 🗞️
- Follow us: Twitter(X), LinkedIn, YouTube, Discord.
- Check out our other platforms: Stackademic, CoFeed, Venture.