17.8 C
New York
Friday, October 25, 2024

Analyze Amazon EMR on Amazon EC2 cluster utilization with Amazon Athena and Amazon QuickSight


Gaining granular visibility into application-level prices on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) clusters presents a chance for patrons searching for methods to additional optimize useful resource utilization and implement truthful value allocation and chargeback fashions. By breaking down the utilization of particular person purposes operating in your EMR cluster, you may unlock a number of advantages:

  • Knowledgeable workload administration – Utility-level value insights empower organizations to prioritize and schedule workloads successfully. Useful resource allocation selections could be made with a greater understanding of value implications, doubtlessly bettering general cluster efficiency and cost-efficiency.
  • Value optimization – With granular value attribution, organizations can determine cost-saving alternatives for particular person purposes. They will right-size underutilized assets or prioritize optimization efforts for purposes which might be driving excessive utilization and prices.
  • Clear billing – In multi-tenant environments, organizations can implement truthful and clear value allocation fashions primarily based on particular person software useful resource consumption and related prices. This fosters accountability and allows correct chargebacks to tenants.

On this publish, we information you thru deploying a complete answer in your Amazon Net Providers (AWS) setting to research Amazon EMR on EC2 cluster utilization. Through the use of this answer, you’ll achieve a deep understanding of useful resource consumption and related prices of particular person purposes operating in your EMR cluster. This can enable you optimize prices, implement truthful billing practices, and make knowledgeable selections about workload administration, finally enhancing the general effectivity and cost-effectiveness of your Amazon EMR setting. This answer has been solely examined on Spark workloads operating on EMR on EC2 that makes use of YARN as its useful resource supervisor. It hasn’t been examined on workloads from different frameworks that run on YARN, similar to HIVE or TEZ.

Resolution overview

The answer works by operating a Python script on the EMR cluster’s main node to gather metrics from the YARN useful resource supervisor and correlate them with value utilization particulars from the AWS Value and Utilization Reviews (AWS CUR). The script activated by a cronjob makes HTTP requests to the YARN useful resource supervisor to gather two sorts of metrics from paths /ws/v1/cluster/metrics for cluster metrics and /ws/v1/cluster/apps for software metrics. The cluster metrics include utilization info of cluster assets, and the appliance metrics include utilization info of an software or job. These metrics are saved in an Amazon Easy Storage Service (Amazon S3) bucket.

There are two YARN metrics that seize the useful resource utilization info of an software or job.

  • memorySeconds – That is the reminiscence (in MB) allotted to an software occasions the variety of seconds the appliance ran
  • vcoreSeconds – That is the variety of YARN vcores allotted to an software occasions the variety of seconds software ran

The answer makes use of memorySeconds to derive the price of operating the appliance or job. It may be modified to make use of vcoreSeconds as an alternative if obligatory.

The metadata of the YARN metrics collected in Amazon S3 is created, saved, and represented as database and tables in AWS Glue Knowledge Catalog, which is in flip accessible to Amazon Athena for additional processing. Now you can write SQL queries in Athena to correlate the YARN metrics with the associated fee utilization info from AWS CUR to derive the detailed value breakdown of your EMR cluster by infrastructure and software. This answer creates two corresponding Athena views of the respective value breakdown that may grow to be the information supply to Amazon QuickSight for visualization.

The next diagram exhibits the answer structure.

EMR Cluster Usage Utility Solution Architecture

Stipulations

To carry out the answer, you want the next conditions:

  1. Affirm {that a} CUR is created in your AWS account. It wants an S3 bucket to retailer the report information. Comply with the steps described in Creating Value and Utilization Reviews to create the CUR on the AWS Administration Console. When creating the report, ensure that the next settings are enabled:
    • Embrace useful resource IDs
    • Time granularity is about to hourly
    • Report knowledge integration to Athena

It will possibly take as much as 24 hours for AWS to begin delivering experiences to your S3 bucket. Thereafter, your CUR will get up to date no less than one time a day.

  1. The answer wants Athena to run queries in opposition to the information from the CUR utilizing customary SQL. To automate and streamline the mixing of Athena with CUR, AWS supplies an AWS CloudFormation template, crawler-cfn.yml, which is robotically generated in the identical S3 bucket throughout CUR creation. Comply with the directions in Organising Athena utilizing AWS CloudFormation templates to combine Athena with the CUR. This template will create an AWS Glue database that references to the CUR, an AWS Lambda occasion and an AWS Glue crawler that will get invoked by S3 occasion notification to replace the AWS Glue database at any time when the CUR will get up to date.
  2. Be certain to activate the AWS generated value allocation tag, aws:elasticmapreduce:job-flow-id. This allows the sector, resource_tags_aws_elasticmapreduce_job_flow_id, within the CUR to be populated with the EMR cluster ID and is utilized by the SQL queries within the answer. To activate the associated fee allocation tag from the administration console, observe these steps:
    • Sign up to the payer account’s AWS Administration Console and open the AWS Billing and Value Administration console
    • Within the navigation pane, select Value Allocation Tags
    • Underneath AWS generated value allocation tags, select the aws:elasticmapreduce:job-flow-id tag
    • Select Activate. It will possibly take as much as 24 hours for tags to activate.

The next screenshot exhibits an instance of the aws:elasticmapreduce:job-flow-id tag being activated.

CostAllocationTag

Now you can check out this answer on an EMR cluster in a lab setting. When you’re not already accustomed to EMR, observe the detailed directions supplied in Tutorial: Getting began with Amazon EMR to launch a brand new EMR cluster and run a pattern Spark job.

Deploying the answer

To deploy the answer, observe the steps within the subsequent sections.

Putting in scripts to the EMR cluster

Obtain two scripts from the GitHub repository and save them into an S3 bucket:

  • emr_usage_report.py – Python script that makes the HTTP requests to YARN Useful resource Supervisor
  • emr_install_report.sh  – Bash script that creates a cronjob to run the python script each minute

To put in the scripts, add a step to the EMR cluster by way of the console or AWS Command Line Interface (AWS CLI) utilizing aws emr add-step command.

Exchange:

  • REGION with the AWS Areas the place the cluster is operating (for instance, Europe (Eire) eu-west-1)
  • MY-BUCKET with the identify of the bucket the place the script is saved (for instance, my.artifact.bucket)
  • MY_REPORT_BUCKET with the bucket identify the place you need to accumulate YARN metrics (for instance, my.report.bucket)
aws emr add-steps 
--cluster-id j-XXXXXXXXXXXXX 
--steps Kind=CUSTOM_JAR,Title="Set up YARN reporter",Jar=s3://REGION.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3:///emr-install_reporter.sh,s3:///emr_usage_reporter.py,MY_REPORT_BUCKET]

Now you can run some Spark jobs in your EMR cluster to begin producing software utilization metrics.

Launching the CloudFormation stack

When the conditions are met and you’ve got the scripts deployed in order that your EMR clusters are sending YARN metrics to an S3 bucket, the remainder of the answer could be deployed utilizing CloudFormation.

Earlier than launching the stack, add a duplicate of this QuickSight definition file into an S3 bucket required by the CloudFormation template to construct the preliminary evaluation in QuickSight. When prepared, proceed to launch your stack to provision the remaining assets of the answer.

  1. Select

This robotically launches AWS CloudFormation in your AWS account with a template. It prompts you to register as wanted and ensure you create the stack in your supposed Area.

The CloudFormation stack requires just a few parameters, as proven within the following screenshot.

CloudFormationStack

The next desk describes the parameters.

Parameter Description
Stack identify A significant identify for the stack; for instance, EMRUsageReport
S3 configuration
YARNS3BucketName Title of S3 bucket the place YARN metrics are saved
Value Utilization Report configuration
CURDatabaseName Title of Value Utilization Report database in AWS Glue
CURTableName Title of Value Utilization Report desk in AWS Glue
AWS Glue Database configuration
EMRUsageDBName Title of AWS Glue database to be created for the EMR Value Utilization Report
EMRInfraTableName Title of AWS Glue desk to be created for infrastructure utilization metrics
EMRAppTableName Title of AWS Glue desk to be created for software utilization metrics
QuickSight configuration
QSUserName Title of QuickSight person in default namespace to handle the EMR Utilization Report assets in QuickSight.
QSDefinitionsFile S3 URI of the definition JSON file for the EMR Utilization Report.
  1. Enter the parameter values from the previous desk.
  2. Select Subsequent.
  3. On the following display, enter any obligatory tags, an AWS Id and Entry Administration (IAM) position, stack failure, or superior choices if obligatory. In any other case, you may depart them as default.
  4. Select Subsequent.
  5. Evaluation the main points on the ultimate display and choose the examine bins confirming AWS CloudFormation may create IAM assets with customized names or require CAPABILITY_AUTO_EXPAND.
    CloudFormationCheckbox
  6. Select Create.

The stack will take a few minutes to create the remaining assets for the answer. After the CloudFormation stack is created, on the Outputs tab, you’ll find the main points of the assets created.

Reviewing the correlation outcomes

The CloudFormation template creates two Athena views containing the correlated value breakdown particulars of the YARN cluster and software metrics with the CUR. The CUR aggregates value hourly and due to this fact correlation to derive the price of operating an software is prorated primarily based on the hourly operating value of the EMR cluster.

The next screenshot exhibits the Athena view for the correlated value breakdown particulars of YARN cluster metrics.

CorrelationResults

The next desk describes the fields within the Athena view for YARN cluster metrics.

Area Kind Description
cluster_id string ID of the cluster.
household string Useful resource kind of the cluster. Potential values are compute occasion, elastic map scale back occasion, storage and knowledge switch.
billing_start timestamp Begin billing hour of the useful resource.
usage_type string A particular kind or unit of the useful resource similar to BoxUsage:m5.xlarge of compute occasion.
value string Value related to the useful resource.

The next screenshot exhibits the Athena view for the correlated value breakdown particulars of YARN software metrics.

CostBreakdownYARNAppMetrics

The next desk describes the fields within the Athena view for YARN software metrics.

Area Kind Description
cluster_id string ID of the cluster
id string Distinctive identifier of the appliance run
person string Consumer identify
identify string Title of the appliance
queue string Queue identify from YARN useful resource supervisor
finalstatus string Closing standing of software
applicationtype string Kind of the appliance
startedtime timestamp Begin time of the appliance
finishedtime timestamp Finish time of the appliance
elapsed_sec double Time taken to run the appliance
memoryseconds bigint The reminiscence (in MB) allotted to an software occasions the variety of seconds the appliance ran
vcoreseconds int The variety of YARN vcores allotted to an software occasions the variety of seconds software ran
total_memory_mb_avg double Whole quantity of reminiscence (in MB) accessible to the cluster within the hour
memory_sec_cost double Derived unit value of memoryseconds
application_cost double Derived value related to the appliance primarily based on memoryseconds
total_cost double Whole value of assets related to the cluster for the hour

Constructing your personal visualization

In QuickSight, the CloudFormation template creates two datasets that reference Athena views as knowledge sources and a pattern evaluation. The pattern evaluation has two sheets, EMR Infra Spend and EMR App Spend. They’ve a prepopulated bar chart and pivot tables to reveal how you should utilize the datasets to construct your personal visualization to current the associated fee breakdown particulars of your EMR clusters.

EMR Infra Spend sheet references to the YARN cluster metrics dataset. There’s a filter for date vary choice and a filter for cluster ID choice. The pattern bar chart exhibits the consolidated value breakdown of the assets for every cluster in the course of the interval. The pivot desk breaks them down additional to point out their every day expenditure.

The next screenshot exhibits the EMR Infra Spend sheet from pattern evaluation created by the CloudFormation template.

EMR App Spend sheet references to the YARN software metrics. There’s a filter for date vary choice and a filter for cluster ID choice. The pivot desk on this sheet exhibits how you should utilize the fields within the dataset to current the associated fee breakdown particulars of the cluster by customers to watch the purposes that have been run, whether or not they have been accomplished efficiently or not, the time and period of every run, and the derived value of the run.

The next screenshot exhibits the EMR App Spend sheet from pattern evaluation created by the CloudFormation template.

Cleanup

When you now not want the assets you created throughout this walkthrough, delete them to stop incurring extra fees. To wash up your assets, full the next steps:

  1. On the CloudFormation console, delete the stack that you just created utilizing the template
  2. Terminate the EMR cluster
  3. Empty or delete the S3 bucket used for YARN metrics

Conclusion

On this publish, we mentioned tips on how to implement a complete cluster utilization reporting answer that gives granular visibility into the useful resource consumption and related prices of particular person purposes operating in your Amazon EMR on EC2 cluster. Through the use of the facility of Athena and QuickSight to correlate YARN metrics with value utilization particulars out of your Value and Utilization Report, this answer empowers organizations to make knowledgeable selections. With these insights, you may optimize useful resource allocation, implement truthful and clear billing fashions primarily based on precise software utilization, and finally obtain larger cost-efficiency in your EMR environments. This answer will enable you unlock the total potential of your EMR cluster, driving steady enchancment in your knowledge processing and analytics workflows whereas maximizing return on funding.


In regards to the authors

Boon Lee Eu is a Senior Technical Account Supervisor at Amazon Net Providers (AWS). He works intently and proactively with Enterprise Help clients to offer advocacy and strategic technical steerage to assist plan and obtain operational excellence in AWS setting primarily based on greatest practices. Based mostly in Singapore, Boon Lee has over 20 years of expertise in IT & Telecom industries.

Kyara Labrador is a Sr. Analytics Specialist Options Architect at Amazon Net Providers (AWS) Philippines, specializing in large knowledge and analytics. She helps clients in designing and implementing scalable, safe, and cost-effective knowledge options, in addition to migrating and modernizing their large knowledge and analytics workloads to AWS. She is obsessed with empowering organizations to unlock the total potential of their knowledge.

Vikas Omer is the Head of Knowledge & AI Resolution Structure for ASEAN at Amazon Net Providers (AWS). With over 15 years of expertise within the knowledge and AI area, he’s a seasoned chief who leverages his experience to drive innovation and growth within the area. Vikas is obsessed with serving to clients and companions succeed of their digital transformation journeys, specializing in cloud-based options and rising applied sciences.

Lorenzo Ripani is a Huge Knowledge Resolution Architect at AWS. He’s obsessed with distributed methods, open supply applied sciences and safety. He spends most of his time working with clients around the globe to design, consider and optimize scalable and safe knowledge pipelines with Amazon EMR.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles