Big Data

Attribute Amazon EMR on EC2 prices to your end-users

27 August 2024

Amazon EMR on EC2 is a managed service that makes it easy to run massive information processing and analytics workloads on AWS. It simplifies the setup and administration of well-liked open supply frameworks like Apache Hadoop and Apache Spark, permitting you to concentrate on extracting insights from giant datasets relatively than the underlying infrastructure. With Amazon EMR, you possibly can benefit from the ability of those massive information instruments to course of, analyze, and achieve invaluable enterprise intelligence from huge quantities of information.

Value optimization is among the pillars of the Effectively-Architected Framework. It focuses on avoiding pointless prices, choosing essentially the most applicable useful resource sorts, analyzing spend over time, and scaling out and in to satisfy enterprise wants with out overspending. An optimized workload maximizes using all out there sources, delivers the specified end result on the most cost-effective worth level, and meets your purposeful wants.

The present Amazon EMR pricing web page reveals the estimated value of the cluster. You too can use AWS Value Explorer to get extra detailed details about your prices. These views offer you an general image of your Amazon EMR prices. Nevertheless, you could must attribute prices on the particular person Spark job degree. For instance, you may wish to know the utilization value in Amazon EMR for the finance enterprise unit. Or, for chargeback functions, you may must combination the price of Spark purposes by purposeful space. After you have got allotted prices to particular person Spark jobs, this information may help you make knowledgeable choices to optimize your prices. As an example, you possibly can select to restructure your purposes to make the most of fewer sources. Alternatively, you may decide to discover completely different pricing fashions like Amazon EMR on EKS or Amazon EMR Serverless.

On this put up, we share a chargeback mannequin that you should utilize to trace and allocate the prices of Spark workloads working on Amazon EMR on EC2 clusters. We describe an method that assigns Amazon EMR prices to completely different jobs, groups, or traces of enterprise. You need to use this function to distribute prices throughout numerous enterprise items. This could help you in monitoring the return on funding in your Spark-based workloads.

Answer overview

The answer is designed that can assist you monitor the price of your Spark purposes working on EMR on EC2. It might probably aid you determine value optimizations and enhance the cost-efficiency of your EMR clusters.

The proposed answer makes use of a scheduled AWS Lambda perform that operates each day. The perform captures utilization and price metrics, that are subsequently saved in Amazon Relational Database Service (Amazon RDS) tables. The information saved within the RDS tables is then queried to derive chargeback figures and generate reporting developments utilizing Amazon QuickSight. The utilization of those AWS providers incurs further prices for implementing this answer. Alternatively, you possibly can think about an method that includes a cron-based agent script put in in your current EMR cluster, if you wish to keep away from using further AWS providers and related prices for constructing your chargeback answer. This script shops the related metrics in an Amazon Easy Storage Service (Amazon S3) bucket, and makes use of Python Jupyter notebooks to generate chargeback numbers based mostly on the info recordsdata saved in Amazon S3, utilizing AWS Glue tables.

The next diagram reveals the present answer structure.

The workflow consists of the next steps:

A Lambda perform will get the next parameters from Parameter Retailer, a functionality of AWS Techniques Supervisor:

{
  "yarn_url": "http://dummy.compute-1.amazonaws.com:8088/ws/v1/cluster/apps",
  "tbl_applicationlogs_lz": "public.emr_applications_execution_log_lz",
  "tbl_applicationlogs": "public.emr_applications_execution_log",
  "tbl_emrcost": "public.emr_cluster_usage_cost",
  "tbl_emrinstance_usage": "public.emr_cluster_instances_usage",
  "emrcluster_id": "j-xxxxxxxxxx",
  "emrcluster_name": "EMR_Cost_Measure",
  "emrcluster_role": "dt-dna-shared",
  "emrcluster_linkedaccount": "xxxxxxxxxxx",
  "postgres_rds": {
    "host": "xxxxxxxxx.amazonaws.com",
    "dbname": "postgres",
    "consumer": "postgresadmin",
    "secretid": "postgressecretid"
  }
}

The Lambda perform extracts Spark utility run logs from the EMR cluster utilizing the Useful resource Supervisor API. The next metrics are extracted as a part of the method: vcore-seconds, reminiscence MB-seconds, and storage GB-seconds.
The Lambda perform captures the every day value of EMR clusters from Value Explorer.
The Lambda perform additionally extracts EMR On-Demand and Spot Occasion utilization information utilizing the Amazon Elastic Compute Cloud (Amazon EC2) Boto3 APIs.
Lambda perform hundreds these datasets into an RDS database.
The price of working a Spark utility is set by the quantity of CPU sources it makes use of, in comparison with the overall CPU utilization of all Spark purposes. This data is used to distribute the general value amongst completely different groups, enterprise traces, or EMR queues.

The extraction course of runs every day, extracting the day past’s information and storing it in an Amazon RDS for PostgreSQL desk. The historic information within the desk must be purged based mostly in your use case.

The answer is open supply and out there on GitHub.

You need to use the AWS Cloud Growth Equipment (AWS CDK) to deploy the Lambda perform, RDS for PostgreSQL information mannequin tables, and a QuickSight dashboard to trace EMR cluster value on the job, crew, or enterprise unit degree.

The next schema present the tables used within the answer that are queried by QuickSight to populate the dashboard.

emr_applications_execution_log_lz or public.emr_applications_execution_log – Storage for every day run metrics for all jobs run on the EMR cluster:
- appdatecollect – Log assortment date
- app_id – Spark job run ID
- app_name – Run identify
- queue – EMR queue wherein job was run
- job_state – Job working state
- job_status – Job run remaining standing (Succeeded or Failed)
- starttime – Job begin time
- endtime – Job finish time
- runtime_seconds – Runtime in seconds
- vcore_seconds – Consumed vCore CPU in seconds
- memory_seconds – Reminiscence consumed
- running_containers – Containers used
- rm_clusterid – EMR cluster ID
emr_cluster_usage_cost – Captures Amazon EMR and Amazon EC2 every day value consumption from Value Explorer and hundreds the info into the RDS desk:
- costdatecollect – Value assortment date
- startdate – Value begin date
- enddate – Value finish date
- emr_unique_tag – EMR cluster related tag
- net_unblendedcost – Complete unblended every day greenback value
- unblendedcost – Complete unblended every day greenback value
- cost_type – Day by day value
- service_name – AWS service for which the fee incurred (Amazon EMR and Amazon EC2)
- emr_clusterid – EMR cluster ID
- emr_clustername – EMR cluster identify
- loadtime – Desk load date/time
emr_cluster_instances_usage – Captures the aggregated useful resource utilization (vCores) and allotted sources for every EMR cluster node, and helps determine the idle time of the cluster:
- instancedatecollect – Occasion utilization accumulate date
- emr_instance_day_run_seconds – EMR occasion lively seconds within the day
- emr_region – EMR cluster AWS Area
- emr_clusterid – EMR cluster ID
- emr_clustername – EMR cluster identify
- emr_cluster_fleet_type – EMR cluster fleet sort
- emr_node_type – Occasion node sort
- emr_market – Market sort (on-demand or provisioned)
- emr_instance_type – Occasion dimension
- emr_ec2_instance_id – Corresponding EC2 occasion ID
- emr_ec2_status – Operating standing
- emr_ec2_default_vcpus – Allotted vCPU
- emr_ec2_memory – EC2 occasion reminiscence
- emr_ec2_creation_datetime – EC2 occasion creation date/time
- emr_ec2_end_datetime – EC2 occasion finish date/time
- emr_ec2_ready_datetime – EC2 occasion prepared date/time
- loadtime – Desk load date/time

Conditions

You need to have the next stipulations earlier than implementing the answer:

An EMR on EC2 cluster.
The EMR cluster should have a singular tag worth outlined. You possibly can assign the tag immediately on the Amazon EMR console or utilizing Tag Editor. The really helpful tag secret’s cost-center together with a singular worth in your EMR cluster. After you create and apply user-defined tags, it might take as much as 24 hours for the tag keys to look in your value allocation tags web page for activation
Activate the tag in AWS Billing. It takes about 24 hours to activate the tag if not finished earlier than. To activate the tag, observe these steps:
- On the AWS Billing and Value Administration console, select Value allocation tags from navigation pane.
- Choose the tag key that you simply wish to activate.
- Select Activate.
The Spark utility’s identify ought to observe the standardized naming conference. It consists of seven elements separated by underscores: ______. These elements are used to summarize the useful resource consumption and price within the remaining report. For instance: HR_PAYROLL_PS_PSPROD_TAXDUDUCTION_DLY_LD, FIN_CASHRECEIPT_GL_GLDB_MAIN_DLY_LD, or MKT_CAMPAIGN_CRM_CRMDB_TOPRATEDCAMPAIGN_DLY_LD. The appliance identify have to be provided with the spark submit command utilizing the --name parameter with the standardized naming conference. If any of those elements don’t have a price, hardcode the values with the next prompt names:
- frequency
- job_type
- Business_unit
The Lambda perform ought to be capable to hook up with Value Explorer, hook up with the EMR cluster via the Useful resource Supervisor APIs, and cargo information into the RDS for PostgreSQL database. To do that, it’s good to configure the Lambda perform as follows:
- VPC configuration – The Lambda perform ought to be capable to entry the EMR cluster, Value Explorer, AWS Secrets and techniques Supervisor, and Parameter Retailer. If entry just isn’t in place already, you are able to do this by making a digital non-public cloud (VPC) that features the EMR cluster and create VPC endpoint for Parameter Retailer and Secrets and techniques Supervisor and fix it to the VPC. As a result of there isn’t a VPC endpoint out there for Value Explorer and as a way to have Lambda hook up with Value Explorer, a personal subnet and a route desk are required to ship VPC site visitors to public NAT gateway. In case your EMR cluster is in public subnet, you have to create a personal subnet together with a customized route desk and a public NAT gateway, which is able to permit the Value Explorer connection to circulate from the VPC non-public subnet. Consult with How do I arrange a NAT gateway for a personal subnet in Amazon VPC? for setup directions and fix the newly created non-public subnet to the Lambda perform explicitly.
- IAM function – The Lambda perform must have an AWS Identification and Entry Administration (IAM) function with the next permissions: AmazonEC2ReadOnlyAccess, AWSCostExplorerFullAccess, and AmazonRDSDataFullAccess. This function will likely be created mechanically throughout AWS CDK stack deployment; you don’t must set it up individually.
The AWS CDK ought to be put in on AWS Cloud9 (most popular) or one other improvement setting corresponding to VSCode or Pycharm. For extra data, check with Conditions.
The RDS for PostgreSQL database (v10 or increased) credentials ought to be saved in Secrets and techniques Supervisor. For extra data, check with Storing database credentials in AWS Secrets and techniques Supervisor.

Create RDS tables

Create the info mannequin tables talked about in emr-cost-rds-tables-ddl.sql by logging in to postgres rds manually into the general public schema.

Use DBeaver or any suitable SQL shoppers to connect with the RDS occasion and validate the tables have been created.

Deploy AWS CDK stacks

Full the steps on this part to deploy the next sources utilizing the AWS CDK:

Parameter Retailer to retailer required parameter values
IAM function for the Lambda perform to assist hook up with Amazon EMR and underlying EC2 cases, Value Explorer, CloudWatch, and Parameter Retailer
Lambda perform

Clone the GitHub repo:

git clone git@github.com:aws-samples/attribute-amazon-emr-costs-to-your-end-users.git

Replace the next the setting parameters in cdk.context.json (this file could be present in the principle listing):
1. yarn_url – YARN ResourceManager URL to learn job run logs and metrics. This URL ought to be accessible inside the VPC the place Lambda can be deployed.
2. tbl_applicationlogs_lz – RDS temp desk to retailer EMR utility run logs.
3. tbl_applicationlogs – RDS desk to retailer EMR utility run logs.
4. tbl_emrcost – RDS desk to seize every day EMR cluster utilization value.
5. tbl_emrinstance_usage – RDS desk to retailer EMR cluster occasion utilization data.
6. emrcluster_id – EMR cluster occasion ID.
7. emrcluster_name – EMR cluster identify.
8. emrcluster_tag – Tag key assigned to EMR cluster.
9. emrcluster_tag_value – Distinctive worth for EMR cluster tag.
10. emrcluster_role – Service function for Amazon EMR (EMR function).
11. emrcluster_linkedaccount – Account ID underneath which the EMR cluster is working.
12. postgres_rds – RDS for PostgreSQL connection particulars.
13. vpc_id – VPC ID wherein the EMR cluster is configured and the fee metering Lambda perform can be deployed.
14. vpc_subnets – Comma-separated non-public subnets ID related to the VPC.
15. sg_id – EMR safety group ID.

The next is a pattern cdk.context.json file after being populated with the parameters:

{
  "yarn_url": "http://dummy.compute-1.amazonaws.com:8088/ws/v1/cluster/apps",
  "tbl_applicationlogs_lz": "public.emr_applications_execution_log_lz",
  "tbl_applicationlogs": "public.emr_applications_execution_log",
  "tbl_emrcost": "public.emr_cluster_usage_cost",
  "tbl_emrinstance_usage": "public.emr_cluster_instances_usage",
  "emrcluster_id": "j-xxxxxxxxxx",
  "emrcluster_name": "EMRClusterName",
  "emrcluster_tag": "EMRClusterTag",
  "emrcluster_tag_value": "EMRClusterUniqueTagValue",
  "emrcluster_role": "EMRClusterServiceRole",
  "emrcluster_linkedaccount": "xxxxxxxxxxx",
  "postgres_rds": {
    "host": "xxxxxxxxx.amazonaws.com",
    "dbname": "dbname",
    "consumer": "username",
    "secretid": "DatabaseUserSecretID"
  },
  "vpc_id": "xxxxxxxxx",
  "vpc_subnets": "subnet-xxxxxxxxxxx",
  "sg_id": "xxxxxxxxxx"
}

You possibly can select to deploy the AWS CDK stack utilizing AWS Cloud9 or another improvement setting in keeping with your wants. For directions to arrange AWS Cloud9, check with Getting began: primary tutorials for AWS Cloud9.

Go to AWS Cloud9 and select File and Add native recordsdata add the challenge folder.

Deploy the AWS CDK stack with the next code:

cd attribute-amazon-emr-costs-to-your-end-users/
pip set up -r necessities.txt
cdk deploy –-all

The deployed Lambda perform requires two exterior libraries: psycopg2 and requests. The corresponding layer must be created and assigned to the Lambda perform. For directions to create a Lambda layer for the requests module, check with Step-by-Step Information to Creating an AWS Lambda Operate Layer.

Creation of the psycopg2 package deal and layer is tied to the Python runtime model of the Lambda perform. Supplied that the Lambda perform makes use of the Python 3.9 runtime, full the next steps to create the corresponding layer package deal for peycopog2:

Obtain psycopg2_binary-2.9.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl from https://pypi.org/challenge/psycopg2-binary/#recordsdata.
Unzip and transfer the contents to a listing named python:
```
zip ‘python’ listing
```
Create a Lambda layer for psycopg2 utilizing the zip file.
Assign the layer to the Lambda perform by selecting Add a layer within the deployed perform properties.
Validate the AWS CDK deployment.

Your Lambda perform particulars ought to look just like the next screenshot.

On the Techniques Supervisor console, validate the Parameter Retailer content material for precise values.

The IAM function particulars ought to look just like the next code, which permits the Lambda perform entry to Amazon EMR and underlying EC2 cases, Value Explorer, CloudWatch, Secrets and techniques Supervisor, and Parameter Retailer:

{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Action": [
        "ce:GetCostAndUsage",
        "ce:ListCostAllocationTags",
        "ec2:AttachNetworkInterface",
        "ec2:CreateNetworkInterface",
        "ec2:DeleteNetworkInterface",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeInstances",
        "ec2:DescribeNetworkInterfaces",
        "elasticmapreduce:Describe*",
        "elasticmapreduce:List*",
        "ssm:Describe*",
        "ssm:Get*",
        "ssm:List*"
      ],
      "Useful resource": "*",
      "Impact": "Permit"
    },
    {
      "Motion": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:DescribeLogStreams",
        "logs:PutLogEvents"
      ],
      "Useful resource": "arn:aws:logs:*:*:*",
      "Impact": "Permit"
    },
    {
      "Motion": "secretsmanager:GetSecretValue",
      "Useful resource": "arn:aws:secretsmanager:*:*:*",
      "Impact": "Permit"
    }
  ]
}

Check the answer

To check the answer, you possibly can run a Spark job that mixes a number of recordsdata within the EMR cluster, and you are able to do this by creating separate steps inside the cluster. Consult with Optimize Amazon EMR prices for legacy and Spark workloads for extra particulars on how you can add the roles as steps to EMR cluster.

Use the next pattern command to submit the Spark job (emr_union_job.py).
It takes in three arguments:
1. – The Amazon S3 location of the info file that’s learn in by the Spark job. The trail shouldn’t be modified. The input_full_path is s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/enter/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet
2. – The S3 folder the place the outcomes are written to.
3. – By altering the enter to the Spark job, you may make positive the job runs for various quantities of time and in addition change the variety of Spot nodes used.

spark-submit --deploy-mode cluster --name HR_PAYROLL_PS_PSPROD_TAXDUDUCTION_DLY_LD s3://aws-blogs-artifacts-public/artifacts/BDB-2997/scripts/emr_union_job.py s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/enter/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet s3://// 6

spark-submit --deploy-mode cluster --name FIN_CASHRECEIPT_GL_GLDB_MAIN_DLY_LD s3://aws-blogs-artifacts-public/artifacts/BDB-2997/scripts/emr_union_job.py s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/enter/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet s3://// 12

The next screenshot reveals the log of the steps run on the Amazon EMR console.

Run the deployed Lambda perform from the Lambda console. This hundreds the every day utility log, EMR greenback utilization, and EMR occasion utilization particulars into their respective RDS tables.

The next screenshot of the Amazon RDS question editor reveals the outcomes for public.emr_applications_execution_log.

The next screenshot reveals the outcomes for public.emr_cluster_usage_cost.

The next screenshot reveals the outcomes for public.emr_cluster_instances_usage.

Value could be calculated utilizing the previous three tables based mostly in your necessities. Within the following SQL question, you calculate the fee based mostly on relative utilization of all purposes in a day. You first determine the overall vcore-seconds CPU consumed in a day after which discover out the share share of an utility. This drives the fee based mostly on general cluster value in a day.

Take into account the next instance situation, the place 10 purposes ran on the cluster for a given day. You’ll use the next sequence of steps to calculate the chargeback value:

Calculate the relative share utilization of every utility (consumed vcore-seconds CPU by app/whole vcore-seconds CPU consumed).
Now you have got the relative useful resource consumption of every utility, distribute the cluster value to every utility. Let’s assume that the overall EMR cluster value for that date is $400.

app_id	app_name	runtime_seconds	vcore_seconds	% Relative Utilization	Amazon EMR Value ($)
application_00001	app1	10	120	5%	19.83
application_00002	app2	5	60	2%	9.91
application_00003	app3	4	45	2%	7.43
application_00004	app4	70	840	35%	138.79
application_00005	app5	21	300	12%	49.57
application_00006	app6	4	48	2%	7.93
application_00007	app7	12	150	6%	24.78
application_00008	app8	52	620	26%	102.44
application_00009	app9	12	130	5%	21.48
application_00010	app10	9	108	4%	17.84

A pattern chargeback value calculation SQL question is on the market on the GitHub repo.

You need to use the SQL question to create a report dashboard to plot a number of charts for the insights. The next are two examples created utilizing QuickSight.

The next is a every day bar chart.

The next reveals whole {dollars} consumed.

Answer value

Let’s assume we’re calculating for an setting that runs 1,000 jobs every day, and we run this answer every day:

Lambda prices – One run requires 30 Lambda perform invocations monthly.
Amazon RDS value – The whole variety of data within the public.emr_applications_execution_log desk for a 30-day month can be 30,000 data, which interprets to five.72 MB of storage. If we think about the opposite two smaller tables and storage overhead, the general month-to-month storage requirement can be roughly 12 MB.

In abstract, the answer value in keeping with the AWS Pricing Calculator is $34.20/yr, which is negligible.

Clear up

To keep away from ongoing prices for the sources that you simply created, full the next steps:

Delete the AWS CDK stacks:
Delete the QuickSight report and dashboard, if created.

Run the next SQL to drop the tables:

drop desk public.emr_applications_execution_log_lz;
drop desk public.emr_applications_execution_log;
drop desk public.emr_cluster_usage_cost;
drop desk public.emr_cluster_instances_usage;

Conclusion

With this answer, you possibly can deploy a chargeback mannequin to attribute prices to customers and teams utilizing the EMR cluster. You too can determine choices for optimization, scaling, and separation of workloads to completely different clusters based mostly on utilization and development wants.

You possibly can accumulate the metrics for an extended period to look at developments on the utilization of Amazon EMR sources and use that for forecasting functions.

In case you have any ideas or questions, depart them within the feedback part.

Concerning the Authors

Raj Patel is AWS Lead Guide for Knowledge Analytics options based mostly out of India. He makes a speciality of constructing and modernising analytical options. His background is in information warehouse/information lake – structure, improvement and administration. He’s in information and analytical subject for over 14 years.

Ramesh Raghupathy is a Senior Knowledge Architect with WWCO ProServe at AWS. He works with AWS prospects to architect, deploy, and migrate to information warehouses and information lakes on the AWS Cloud. Whereas not at work, Ramesh enjoys touring, spending time with household, and yoga.

Gaurav Jain is a Sr Knowledge Architect with AWS Skilled Companies, specialised in massive information and helps prospects modernize their information platforms on the cloud. He’s captivated with constructing the correct analytics options to achieve well timed insights and make essential enterprise choices. Outdoors of labor, he likes to spend time along with his household and likes watching films and sports activities.

Dipal Mahajan is a Lead Guide with Amazon Internet Companies based mostly out of India, the place he guides world prospects to construct extremely safe, scalable, dependable, and cost-efficient purposes on the cloud. He brings in depth expertise on Software program Growth, Structure and Analytics from industries like finance, telecom, retail and healthcare.