Big Data Processing Using Distributed Maps and AWS Step Functions

Akash Mathur
AWS in Plain English
9 min readNov 11, 2023

--

Navigating large datasets often involves the challenge of parallelizing tasks for individual elements within the dataset. While Apache Spark is a common industry solution for such scenarios, there are situations where its use may not be feasible or optimal as incorporating Spark introduces complexities such as addressing concerns within User Defined Functions (UDF).

In such cases, a compelling option to explore is using AWS Step Function — Distributed Map. When dealing with large datasets or performing repetitive tasks on multiple items, orchestrating and parallelizing workflows becomes crucial.

Problem Statement

I have to transform a large-scale dataset and the transformation involves applying a complex function to each element in a specific column. The current approach using df[column].apply() is sequential and struggles to scale efficiently with large datasets.

Solution

Exciting News: AWS Step Functions now introduces the Distributed Map, unlocking even more possibilities for efficiently handling large-scale parallel workloads.

The standard map state in Step Functions, while powerful, was limited to handling 40 parallel iterations at a time. Scaling up to process thousands or more items in parallel required intricate workarounds.

Enter the Distributed Map state! It is designed to seamlessly coordinate expansive parallel workloads in your serverless applications. Whether iterating through millions of logs, images, or .csv files stored in Amazon S3, the Distributed Map state can now launch up to ten thousand parallel workflows for robust data processing.

You have the flexibility to process data using any service API supported by Step Functions, but the common and efficient choice is invoking Lambda functions and in this example also, we will write our data processing logic inside the lambda.

In this article, I’ll share an illustrative example of a post-processing task where I applied operations to each individual row item. Specifically, I worked with textual data present in each row of a CSV file. The approach involved invoking a Lambda function asynchronously, serving as the post-processing function, across every row item within the CSV file. The mechanism facilitating this is the distributed map, effectively iterating over a CSV file residing in an S3 bucket.

Understanding Distributed Maps

Before we step ahead in actual implementation, let’s understand a bit about Distributed Maps. It allows you to apply the same operation to each element in a collection concurrently. This is particularly useful when dealing with tasks like processing data rows in a CSV file, analyzing images in a batch, or invoking a Lambda function for multiple inputs simultaneously.

How does a Map function work (in general)?

Let’s dive into the key components and steps involved in utilizing Distributed Maps with AWS Step Functions.

Key Components:

1. State Machine

A State Machine in AWS Step Functions is a visual representation of a workflow. It consists of states that define the different steps or actions to be taken. In the context of Distributed Maps, the state machine includes the Map state.

2. Map State

The Map state is the star of the show when it comes to distributed operations. It enables parallel execution of a set of operations on each element in an array, making it efficient for processing large datasets.

3. Input Data

The input data is the collection or array on which the Map state operates. In the case of processing a CSV file, each row would be an element in the array.

4. Task States

Task states represent individual operations or tasks that need to be performed on each element. For example, invoking a Lambda function or processing data.

Steps to Implement Distributed Maps:

1. Lambda Function: Firstly, create a Lambda function that performs the post-processing on an individual row. Ensure this Lambda function accepts a row as input and returns the processed result.

# Lambda Function: post-pocessing Function
import json

def lambda_handler(event, context):
# Assuming event contains the row data
row_data = event['column_name']

# Perform your post-processing operations on row_data
cleaned_text = re.sub(r'[^A-Za-z\s]', '', row_data)

# Remove extra whitespaces
cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

processed_result = f"Processed data: {cleaned_text}"

return {
'statusCode': 200,
'body': json.dumps(processed_result)
}

2. AWS Step Functions State Machine Definition: Now, create an AWS Step Functions state machine to orchestrate the distributed map.

a) Navigate to the Step Functions page and select Create state machine.

b) In the visual editor, I search and select the Map component on the left-side pane, and I drag it to the workflow area. On the right side, I configure the component. I chose Distributed as the Processing mode and Amazon S3 as the Item Source.

Distributed maps are natively integrated with S3. I entered the name of the bucket and the prefix path where the CSV file is stored.

On the Runtime Settings section, I choose Standard for Child workflow type. I also may decide to restrict the Concurrency limit. It helps to ensure we operate within the concurrency quotas of the downstream services (Lambda) for a particular account or Region.

By default, the output of my sub-workflows will be aggregated as state output, up to 256KB. To process larger outputs, I may choose to Export map state results to Amazon S3.

c) Finally, I define what post-processing task to do for each row item. In this demo, I want to invoke a Lambda function for each file in the S3 bucket. The function exists already (Step-1).

I search for and select the Lambda invocation action on the left-side pane. I drag it to the distributed map component. Then, I use the right-side configuration panel to select the actual Lambda function to invoke: post-processing-function:

d) Select Next. I select Next again on the Review generated code page (not shown here).

On the Specify state machine settings page, I enter a Name for my state machine and the IAM Permissions to run. Then, I select Create state machine.

e) Now I am ready to start the execution. On the State machine page, I select the new workflow and select Start execution. I can optionally enter a JSON document to pass to the workflow. In this demo, the workflow does not handle the input data as it’s provided by the input state. I leave it as-is, and I select Start execution.

Final State Machine and its Definition

State Machine
{
"Comment": "A description of my state machine",
"StartAt": "Text-File-Processing",
"States": {
"Text-File-Processing": {
"Type": "Map",
"ItemProcessor": {
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "STANDARD"
},
"StartAt": "Post-Processing Lambda",
"States": {
"Post-Processing Lambda": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"OutputPath": "$.Payload",
"Parameters": {
"FunctionName": "arn:aws:lambda:us-east-1:XXXXXX:function:step-function-name:$LATEST",
"Payload.$": "$"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException",
"Lambda.TooManyRequestsException"
],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2
}
],
"End": true
}
}
},
"ItemReader": {
"Resource": "arn:aws:states:::s3:getObject",
"ReaderConfig": {
"InputType": "CSV",
"CSVHeaderLocation": "FIRST_ROW"
},
"Parameters": {
"Bucket": "bucke-name",
"Key": "config.csv"
}
},
"Label": "Text-File-Processing",
"End": true
}
}
}

“‘Voila!’ You’ve just witnessed a glimpse of parallel execution, accelerating data transformation. Now, let’s explore a bit more as we continue with this article.

The end-to-end code for this module is saved to my GitHub repository.

Use Cases of Distributed Maps with AWS Step Functions

  1. Data Processing Pipelines: Distribute processing tasks across a large dataset, allowing parallel execution for faster data transformation.
  2. Batch Image/Video Processing: Process and analyze a collection of images or videos concurrently, benefiting applications like computer vision.
  3. Log Analysis: Analyze logs from distributed systems concurrently, identifying patterns or anomalies in real time.
  4. Parallel API Invocations: Invoke external APIs concurrently for tasks like geocoding, weather information retrieval, or language translation.
  5. Database Record Processing: Process and update records in a database concurrently, improving throughput for large-scale operations.

Overall, Distributed Maps are a powerful tool for orchestrating large-scale parallel workloads, but they should be used with care and awareness of the potential disadvantages.

Disadvantages

  1. Limited input and output sizes: Distributed Maps have a limit of 100 million items for input and 10GB for output. This may be a limitation for some workloads.
  2. Limited downstream service capabilities: The downstream service that you use to process the results of a Distributed Map must be able to handle the input and output sizes. For example, Lambda has a limit of 256MB for input and 1024MB for output.

In addition to these disadvantages, it is important to note that Distributed Maps are still a relatively new feature in Step Functions, and there is less documentation and community support for them than for other Step Functions states.

Here are some tips for mitigating the disadvantages of Distributed Maps:

  • Use Distributed Maps judiciously: Only use Distributed Maps for workloads that truly require large-scale parallelism. For smaller workloads, consider using other Step Function states, such as Task or Parallel.
  • Design your workflow for failure: Handle failure cases and retries gracefully. You may want to consider using a state machine pattern such as the retry loop pattern.
  • Optimize your workflow for performance: Consider using features such as the tolerated failure threshold to reduce the performance impact of failures.
  • Estimate the cost of your workflow: Use the Step Functions pricing calculator to estimate the cost of running your workflow with Distributed Maps.
  • Choose the right downstream service: Make sure that the downstream service that you use to process the results of a Distributed Map can handle the input and output sizes.

In conclusion, Distributed Maps introduce a powerful paradigm for orchestrating large-scale parallel workloads with unparalleled ease and efficiency. Whether you are processing extensive datasets, analyzing logs, or orchestrating complex workflows, the distributed map state provides a scalable and cost-effective solution.

The ability to seamlessly scale up to 10,000 parallel executions empowers developers to tackle data-intensive tasks with confidence. Leveraging serverless architecture, this feature optimizes costs by ensuring resources are consumed only during execution, aligning with the pay-as-you-go model.

While the advantages of increased throughput and simplified orchestration are evident, it’s crucial to consider concurrency limits, cold start overhead, and potential complexities in dependency management for certain use cases. Careful design and testing are essential to harness the full potential of Distributed Maps with AWS Step Functions.

Congratulations! 🎉👏🎊

You now understand how to leverage the Distributed Map state and orchestrate large-scale parallel workloads for efficient data processing at an unprecedented scale.

If you enjoyed reading this article comment “Hell Yes!” in the comment section and let me know if any feedback.

Feel free to follow me on Medium, and GitHub, or say Hi on LinkedIn. I am excited to discuss across AI, ML, NLP, and MLOps areas!

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

--

--