Big Data

Use Batch Processing Gateway to automate job administration in multi-cluster Amazon EMR on EKS environments

13 September 2024

AWS prospects usually course of petabytes of information utilizing Amazon EMR on EKS. In enterprise environments with various workloads or various operational necessities, prospects ceaselessly select a multi-cluster setup because of the following benefits:

Higher resiliency and no single level of failure – If one cluster fails, different clusters can proceed processing important workloads, sustaining enterprise continuity
Higher safety and isolation – Elevated isolation between jobs enhances safety and simplifies compliance
Higher scalability – Distributing workloads throughout clusters permits horizontal scaling to deal with peak calls for
Efficiency advantages – Minimizing Kubernetes scheduling delays and community bandwidth rivalry improves job runtimes
Elevated flexibility – You’ll be able to get pleasure from simple experimentation and value optimization by way of workload segregation to a number of clusters

Nevertheless, one of many disadvantages of a multi-cluster setup is that there isn’t any simple technique to distribute workloads and help efficient load balancing throughout a number of clusters. This put up proposes an answer to this problem by introducing the Batch Processing Gateway (BPG), a centralized gateway that automates job administration and routing in multi-cluster environments.

Challenges with multi-cluster environments

In a multi-cluster atmosphere, Spark jobs on Amazon EMR on EKS have to be submitted to completely different clusters from varied shoppers. This structure introduces a number of key challenges:

Endpoint administration – Purchasers should preserve and replace connections for every goal cluster
Operational overhead – Managing a number of consumer connections individually will increase the complexity and operational burden
Workload distribution – There isn’t a built-in mechanism for job routing throughout a number of clusters, which impacts configuration, useful resource allocation, value transparency, and resilience
Resilience and excessive availability – With out load balancing, the atmosphere lacks fault tolerance and excessive availability

BPG addresses these challenges by offering a single level of submission for Spark jobs. BPG automates job routing to the suitable EMR on EKS clusters, offering efficient load balancing, simplified endpoint administration, and improved resilience. The proposed resolution is especially helpful for patrons with multi-cluster Amazon EMR on EKS setups utilizing the Spark Kubernetes Operator with or with out Yunikorn scheduler.

Nevertheless, though BPG gives vital advantages, it’s presently designed to work solely with Spark Kubernetes Operator. Moreover, BPG has not been examined with the Volcano scheduler, and the answer will not be relevant in environments utilizing native Amazon EMR on EKS APIs.

Answer overview

Martin Fowler describes a gateway as an object that encapsulates entry to an exterior system or useful resource. On this case, the useful resource is the EMR on EKS clusters working Spark. A gateway acts as a single level to confront this useful resource. Any code or connection interacts with the interface of the gateway solely. The gateway then interprets the incoming API request into the API provided by the useful resource.

BPG is a gateway particularly designed to offer a seamless interface to Spark on Kubernetes. It’s a REST API service to summary the underlying Spark on EKS clusters particulars from customers. It runs in its personal EKS cluster speaking to Kubernetes API servers of various EKS clusters. Spark customers submit an utility to BPG by way of shoppers, then BPG routes the appliance to one of many underlying EKS clusters.

The method for submitting Spark jobs utilizing BPG for Amazon EMR on EKS is as follows:

The consumer submits a job to BPG utilizing a consumer.
BPG parses the request, interprets it right into a customized useful resource definition (CRD), and submits the CRD to an EMR on EKS cluster in response to predefined guidelines.
The Spark Kubernetes Operator interprets the job specification and initiates the job on the cluster.
The Kubernetes scheduler schedules and manages the run of the roles.

The next determine illustrates the high-level particulars of BPG. You’ll be able to learn extra about BPG within the GitHub README.

The proposed resolution includes implementing BPG for a number of underlying EMR on EKS clusters, which successfully resolves the drawbacks mentioned earlier. The next diagram illustrates the main points of the answer.

Supply Code

You will discover the code base within the AWS Samples and Batch Processing Gateway GitHub repository.

Within the following sections, we stroll by way of the steps to implement the answer.

Stipulations

Earlier than you deploy this resolution, be sure the next stipulations are in place:

Clone the repositories to your native machine

We assume that each one repositories are cloned into the house listing (~/). All relative paths supplied are primarily based on this assumption. In case you have cloned the repositories to a special location, alter the paths accordingly.

Clone the BPG on EMR on EKS GitHub repo with the next command:

cd ~/
git clone git@github.com:aws-samples/batch-processing-gateway-on-emr-on-eks.git

The BPG repository is presently below energetic improvement. To supply a steady deployment expertise in keeping with the supplied directions, we’ve pinned the repository to the steady commit hash aa3e5c8be973bee54ac700ada963667e5913c865.

Earlier than cloning the repository, confirm any safety updates and cling to your group’s safety practices.

Clone the BPG GitHub repo with the next command:

git clone git@github.com:apple/batch-processing-gateway.git
cd batch-processing-gateway
git checkout aa3e5c8be973bee54ac700ada963667e5913c865

Create two EMR on EKS clusters

The creation of EMR on EKS clusters will not be the first focus of this put up. For complete directions, check with Operating Spark jobs with the Spark operator. Nevertheless, to your comfort, we’ve included the steps for establishing the EMR on EKS digital clusters named spark-cluster-a-v and spark-cluster-b-v within the GitHub repo. Observe these steps to create the clusters.

After efficiently finishing the steps, it is best to have two EMR on EKS digital clusters named spark-cluster-a-v and spark-cluster-b-v working on the EKS clusters spark-cluster-a and spark-cluster-b, respectively.

To confirm the profitable creation of the clusters, open the Amazon EMR console and select Digital clusters below EMR on EKS within the navigation pane.

Arrange BPG on Amazon EKS

To arrange BPG on Amazon EKS, full the next steps:

Change to the suitable listing:

cd ~/batch-processing-gateway-on-emr-on-eks/bpg/

Arrange the AWS Area:

export AWS_REGION="<AWS_REGION>"

Create a key pair. Be sure you comply with your group’s greatest practices for key pair administration.

aws ec2 create-key-pair 
--region "$AWS_REGION" 
--key-name ekskp 
--key-type ed25519 
--key-format pem 
--query "KeyMaterial" 
--output textual content > ekskp.pem
chmod 400 ekskp.pem
ssh-keygen -y -f ekskp.pem > eks_publickey.pem
chmod 400 eks_publickey.pem

Now you’re able to create the EKS cluster.

By default, eksctl creates an EKS cluster in devoted digital non-public clouds (VPCs). To keep away from reaching the default gentle restrict on the variety of VPCs in an account, we use the --vpc-public-subnets parameter to create clusters in an current VPC. For this put up, we use the default VPC for deploying the answer. Modify the next code to deploy the answer within the applicable VPC in accordance along with your group’s greatest practices. For official steering, check with Create a VPC.

Get the general public subnets to your VPC:

export DEFAULT_FOR_AZ_SUBNET=$(aws ec2 describe-subnets --region "$AWS_REGION" --filters "Identify=default-for-az,Values=true" --query "Subnets[?AvailabilityZone != 'us-east-1e'].SubnetId" | jq -r '. | map(tostring) | be part of(",")')

Create the cluster:

eksctl create cluster 
--name bpg-cluster 
--region "$AWS_REGION" 
--vpc-public-subnets "$DEFAULT_FOR_AZ_SUBNET" 
--with-oidc 
--ssh-access 
--ssh-public-key eks_publickey.pem 
--instance-types=m5.xlarge 
--managed

On the Amazon EKS console, select Clusters within the navigation pane and examine for the profitable provisioning of the bpg-cluster

Within the subsequent steps, we make the next modifications to the present batch-processing-gateway code base:

On your comfort, we’ve supplied the up to date information within the batch-processing-gateway-on-emr-on-eks repository. You’ll be able to copy these information into the batch-processing-gateway repository.

Exchange POM xml file:

cp ~/batch-processing-gateway-on-emr-on-eks/bpg/pom.xml ~/batch-processing-gateway/pom.xml

Exchange DAO java file:

cp ~/batch-processing-gateway-on-emr-on-eks/bpg/LogDao.java ~/batch-processing-gateway/src/foremost/java/com/apple/spark/core/LogDao.java

Exchange the Dockerfile:

cp ~/batch-processing-gateway-on-emr-on-eks/bpg/Dockerfile ~/batch-processing-gateway/Dockerfile

Now you’re able to construct your Docker picture.

Create a non-public Amazon Elastic Container Registry (Amazon ECR) repository:

aws ecr create-repository --repository-name bpg --region "$AWS_REGION"

Get the AWS account ID:

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output textual content)

Authenticate Docker to your ECR registry:

aws ecr get-login-password --region "$AWS_REGION" | docker login --username AWS --password-stdin "$AWS_ACCOUNT_ID".dkr.ecr."$AWS_REGION".amazonaws.com

Construct your Docker picture:

cd ~/batch-processing-gateway/
docker construct 
--platform linux/amd64 
--build-arg VERSION="1.0.0" 
--build-arg BUILD_TIME=$(date -u +"%Y-%m-%dTpercentH:%M:%SZ") 
--build-arg GIT_COMMIT=$(git rev-parse HEAD) 
--progress=plain 
--no-cache 
-t bpg:1.0.0 .

Tag your picture:

docker tag bpg:1.0.0 "$AWS_ACCOUNT_ID".dkr.ecr."$AWS_REGION".amazonaws.com/bpg:1.0.0

Push the picture to your ECR repository:

docker push "$AWS_ACCOUNT_ID".dkr.ecr."$AWS_REGION".amazonaws.com/bpg:1.0.0

The ImagePullPolicy within the batch-processing-gateway GitHub repo is about to IfNotPresent. Replace the picture tag in case you could replace the picture.

To confirm the profitable creation and add of the Docker picture, open the Amazon ECR console, select Repositories below Personal registry within the navigation pane, and find the bpg repository:

Arrange an Amazon Aurora MySQL database

Full the next steps to arrange an Amazon Aurora MySQL-Suitable Version database:

Checklist all default subnets for the given Availability Zone in a particular format:

DEFAULT_FOR_AZ_SUBNET_RFMT=$(aws ec2 describe-subnets --region "$AWS_REGION" --filters "Identify=default-for-az,Values=true" --query "Subnets[*].SubnetId" | jq -c '.')

Create a subnet group. Discuss with create-db-subnet-group for extra particulars.

aws rds create-db-subnet-group 
--db-subnet-group-name bpg-rds-subnetgroup 
--db-subnet-group-description "BPG Subnet Group for RDS" 
--subnet-ids "$DEFAULT_FOR_AZ_SUBNET_RFMT" 
--region "$AWS_REGION"

Checklist the default VPC:

export DEFAULT_VPC=$(aws ec2 describe-vpcs --region "$AWS_REGION" --filters "Identify=isDefault,Values=true" --query "Vpcs[0].VpcId" --output textual content)

Create a safety group:

aws ec2 create-security-group 
--group-name bpg-rds-securitygroup 
--description "BPG Safety Group for RDS" 
--vpc-id "$DEFAULT_VPC" 
--region "$AWS_REGION"

Checklist the bpg-rds-securitygroup safety group ID:

export BPG_RDS_SG=$(aws ec2 describe-security-groups --filters "Identify=group-name,Values=bpg-rds-securitygroup" --query "SecurityGroups[*].GroupId" --output textual content)

Create the Aurora DB Regional cluster. Discuss with create-db-cluster for extra particulars.

aws rds create-db-cluster 
--database-name bpg 
--db-cluster-identifier bpg 
--engine aurora-mysql 
--engine-version 8.0.mysql_aurora.3.06.1 
--master-username admin 
--manage-master-user-password 
--db-subnet-group-name bpg-rds-subnetgroup 
--vpc-security-group-ids "$BPG_RDS_SG" 
--region "$AWS_REGION"

Create a DB Author occasion within the cluster. Discuss with create-db-instance for extra particulars.

aws rds create-db-instance 
--db-instance-identifier bpg 
--db-cluster-identifier bpg 
--db-instance-class db.r5.giant 
--engine aurora-mysql 
--region "$AWS_REGION"

To confirm the profitable creation of the RDS Regional cluster and Author occasion, on the Amazon RDS console, select Databases within the navigation pane and examine for the bpg database.

Arrange community connectivity

Safety teams for EKS clusters are usually related to the nodes and the management aircraft (if utilizing managed nodes). On this part, we configure the networking to permit the node safety group of the bpg-cluster to speak with spark-cluster-a, spark-cluster-b, and the bpg Aurora RDS cluster.

Determine the safety teams of bpg-cluster, spark-cluster-a, spark-cluster-b, and the bpg Aurora RDS cluster:

# Determine Node Safety Group of the bpg-cluster
BPG_CLUSTER_NODEGROUP_SG=$(aws ec2 describe-instances 
--filters Identify=tag:eks:cluster-name,Values=bpg-cluster 
--query "Reservations[*].Situations[*].SecurityGroups[?contains(GroupName, 'eks-cluster-sg-bpg-cluster-')].GroupId" 
--region "$AWS_REGION" 
--output textual content | uniq)

# Determine Cluster safety group of spark-cluster-a and spark-cluster-b
SPARK_A_CLUSTER_SG=$(aws eks describe-cluster --name spark-cluster-a --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output textual content)
SPARK_B_CLUSTER_SG=$(aws eks describe-cluster --name spark-cluster-b --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output textual content)

# Determine Cluster safety group of bpg Aurora RDS cluster Author Occasion
BPG_RDS_WRITER_SG=$(aws ec2 describe-security-groups --filters "Identify=group-name,Values=bpg-rds-securitygroup" --query "SecurityGroups[*].GroupId" --output textual content)

Enable the node safety group of the bpg-cluster to speak with spark-cluster-a, spark-cluster-b, and the bpg Aurora RDS cluster:

# spark-cluster-a
aws ec2 authorize-security-group-ingress --group-id "$SPARK_A_CLUSTER_SG" --protocol tcp --port 443 --source-group "$BPG_CLUSTER_NODEGROUP_SG"

# spark-cluster-b
aws ec2 authorize-security-group-ingress --group-id "$SPARK_B_CLUSTER_SG" --protocol tcp --port 443 --source-group "$BPG_CLUSTER_NODEGROUP_SG"

# bpg-rds
aws ec2 authorize-security-group-ingress --group-id "$BPG_RDS_WRITER_SG" --protocol tcp --port 3306 --source-group "$BPG_CLUSTER_NODEGROUP_SG"

Deploy BPG

We deploy BPG for weight-based cluster choice. spark-cluster-a-v and spark-cluster-b-v are configured with a queue named dev and weight=50. We anticipate statistically equal distribution of jobs between the 2 clusters. For extra info, check with Weight Based mostly Cluster Choice.

Get the bpg-cluster context:

BPG_CLUSTER_CONTEXT=$(kubectl config view --output=json | jq -r '.contexts[] | choose(.identify | comprises("bpg-cluster")) | .identify')
kubectl config use-context "$BPG_CLUSTER_CONTEXT"

Create a Kubernetes namespace for BPG:

kubectl create namespace bpg

The helm chart for BPG requires a values.yaml file. This file contains varied key-value pairs for every EMR on EKS clusters, EKS cluster, and Aurora cluster. Manually updating the values.yaml file could be cumbersome. To simplify this course of, we’ve automated the creation of the values.yaml file.

Run the next script to generate the values.yaml file:

cd ~/batch-processing-gateway-on-emr-on-eks/bpg
chmod 755 create-bpg-values-yaml.sh
./create-bpg-values-yaml.sh

Use the next code to deploy the helm chart. Be sure the tag worth in each values.template.yaml and values.yaml matches the Docker picture tag specified earlier.

cp ~/batch-processing-gateway/helm/batch-processing-gateway/values.yaml ~/batch-processing-gateway/helm/batch-processing-gateway/values.yaml.$(date +'%YpercentmpercentdpercentHpercentMpercentS') 
&& cp ~/batch-processing-gateway-on-emr-on-eks/bpg/values.yaml ~/batch-processing-gateway/helm/batch-processing-gateway/values.yaml 
&& cd ~/batch-processing-gateway/helm/batch-processing-gateway/

kubectl config use-context "$BPG_CLUSTER_CONTEXT"

helm set up batch-processing-gateway . --values values.yaml -n bpg

Confirm the deployment by itemizing the pods and viewing the pod logs:

kubectl get pods --namespace bpg
kubectl logs <BPG-PODNAME> --namespace bpg

Exec into the BPG pod and confirm the well being examine:

kubectl exec -it <BPG-PODNAME> -n bpg -- bash 
curl -u admin:admin localhost:8080/skatev2/healthcheck/standing

We get the next output:

{"standing":"OK"}

BPG is efficiently deployed on the EKS cluster.

Take a look at the answer

To check the answer, you possibly can submit a number of Spark jobs by working the next pattern code a number of occasions. The code submits the SparkPi Spark job to the BPG, which in flip submits the roles to the EMR on EKS cluster primarily based on the set parameters.

Set the kubectl context to the bpg cluster:

kubectl config get-contexts | awk 'NR==1 || /bpg-cluster/'
kubectl config use-context "<CONTEXT_NAME>"

Determine the bpg pod identify:

kubectl get pods --namespace bpg

Exec into the bpg pod:

kubectl exec -it "<BPG-PODNAME>" -n bpg -- bash

Submit a number of Spark jobs utilizing the curl. Run the under curl command to submit jobs to spark-cluster-a and spark-cluster-b:

curl -u consumer:move localhost:8080/skatev2/spark -i -X POST 
-H 'Content material-Sort: utility/json' 
-d '{
"applicationName": "SparkPiDemo",
"queue": "dev",
"sparkVersion": "3.5.0",
"mainApplicationFile": "native:///usr/lib/spark/examples/jars/spark-examples.jar",
"mainClass":"org.apache.spark.examples.SparkPi",
"driver": {
"cores": 1,
"reminiscence": "2g",
"serviceAccount": "emr-containers-sa-spark",
"labels":{
"model": "3.5.0"
}
},
"executor": {
"cases": 1,
"cores": 1,
"reminiscence": "2g",
"labels":{
"model": "3.5.0"
}
}
}'

After every submission, BPG will inform you of the cluster to which the job was submitted. For instance:

HTTP/1.1 200 OK
Date: Sat, 10 Aug 2024 16:17:15 GMT
Content material-Sort: utility/json
Content material-Size: 67
{"submissionId":"spark-cluster-a-f72a7ddcfde14f4390194d4027c1e1d6"}
{"submissionId":"spark-cluster-a-d1b359190c7646fa9d704122fbf8c580"}
{"submissionId":"spark-cluster-b-7b61d5d512bb4adeb1dd8a9977d605df"}

Confirm that the roles are working within the EMR cluster spark-cluster-a and spark-cluster-b:

kubectl config get-contexts | awk 'NR==1 || /spark-cluster-(a|b)/'
kubectl get pods -n spark-operator --context "<CONTEXT_NAME>"

You’ll be able to view the Spark Driver logs to seek out the worth of Pi as proven under:

kubectl logs <SPARK-DRIVER-POD-NAME> --namespace spark-operator --context "<CONTEXT_NAME>"

After profitable completion of the job, it is best to be capable of see the under message within the logs:

Pi is roughly 3.1452757263786317

We have now efficiently examined the weight-based routing of Spark jobs throughout a number of clusters.

Clear up

To scrub up your sources, full the next steps:

Delete the EMR on EKS digital cluster:

VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --region="$AWS_REGION" --query "virtualClusters[?name=='spark-cluster-a-v' && state=='RUNNING'].id" --output textual content)
aws emr-containers delete-virtual-cluster --region="$AWS_REGION" --id "$VIRTUAL_CLUSTER_ID"
VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --region="$AWS_REGION" --query "virtualClusters[?name=='spark-cluster-b-v' && state=='RUNNING'].id" --output textual content)
aws emr-containers delete-virtual-cluster --region="$AWS_REGION" --id "$VIRTUAL_CLUSTER_ID"

Delete the AWS Id and Entry Administration (IAM) function:

aws iam delete-role-policy --role-name sparkjobrole --policy-name EMR-Spark-Job-Execution
aws iam delete-role --role-name sparkjobrole

Delete the RDS DB occasion and DB cluster:

aws rds delete-db-instance 
--db-instance-identifier bpg 
--skip-final-snapshot

aws rds delete-db-cluster 
--db-cluster-identifier bpg 
--skip-final-snapshot

Delete the bpg-rds-securitygroup safety group and bpg-rds-subnetgroup subnet group:

BPG_SG=$(aws ec2 describe-security-groups --filters "Identify=group-name,Values=bpg-rds-securitygroup" --query "SecurityGroups[*].GroupId" --output textual content)
aws ec2 delete-security-group --group-id "$BPG_SG"
aws rds delete-db-subnet-group --db-subnet-group-name bpg-rds-subnetgroup

Delete the EKS clusters:

eksctl delete cluster --region="$AWS_REGION" --name=bpg-cluster
eksctl delete cluster --region="$AWS_REGION" --name=spark-cluster-a
eksctl delete cluster --region="$AWS_REGION" --name=spark-cluster-b

Delete bpg ECR repository:

aws ecr delete-repository --repository-name bpg --region="$AWS_REGION" --force

Delete the important thing pairs:

aws ec2 delete-key-pair --key-name ekskp
aws ec2 delete-key-pair --key-name emrkp

Conclusion

On this put up, we explored the challenges related to managing workloads on EMR on EKS cluster and demonstrated the benefits of adopting a multi-cluster deployment sample. We launched Batch Processing Gateway (BPG) as an answer to those challenges, showcasing the way it simplifies job administration, enhances resilience, and improves horizontal scalability in multi-cluster environments. By implementing BPG, we illustrated the sensible utility of the gateway structure sample for submitting Spark jobs on Amazon EMR on EKS. This put up supplies a complete understanding of the issue, the advantages of the gateway structure, and the steps to implement BPG successfully.

We encourage you to judge your current Spark on Amazon EMR on EKS implementation and take into account adopting this resolution. It permits customers to submit, look at, and delete Spark purposes on Kubernetes with intuitive API calls, without having to fret concerning the underlying complexities.

For this put up, we targeted on the implementation particulars of the BPG. As a subsequent step, you possibly can discover integrating BPG with shoppers akin to Apache Airflow, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), or Jupyter notebooks. BPG works nicely with the Apache Yunikorn scheduler. You can too discover integrating BPG to make use of Yunikorn queues for job submission.

In regards to the Authors

Umair Nawaz is a Senior DevOps Architect at Amazon Internet Providers. He works on constructing safe architectures and advises enterprises on agile software program supply. He’s motivated to resolve issues strategically by using fashionable applied sciences.

Ravikiran Rao is a Knowledge Architect at Amazon Internet Providers and is captivated with fixing advanced knowledge challenges for varied prospects. Outdoors of labor, he’s a theater fanatic and novice tennis participant.

Sri Potluri is a Cloud Infrastructure Architect at Amazon Internet Providers. He’s captivated with fixing advanced issues and delivering well-structured options for various prospects. His experience spans throughout a spread of cloud applied sciences, guaranteeing scalable and dependable infrastructure tailor-made to every challenge’s distinctive challenges.

Suvojit Dasgupta is a Principal Knowledge Architect at Amazon Internet Providers. He leads a staff of expert engineers in designing and constructing scalable knowledge options for AWS prospects. He focuses on growing and implementing modern knowledge architectures to deal with advanced enterprise challenges.