The fast adoption of software program as a service (SaaS) options has led to information silos throughout numerous platforms, presenting challenges in consolidating insights from various sources. Efficient information analytics depends on seamlessly integrating information from disparate programs via figuring out, gathering, cleaning, and mixing related information right into a unified format. AWS Glue, a serverless information integration service, has simplified this course of by providing scalable, environment friendly, and cost-effective options for integrating information from numerous sources. With AWS Glue, you’ll be able to streamline information integration, cut back information silos and complexities, and acquire agility in managing information pipelines, finally unlocking the true potential of your information property for analytics, data-driven decision-making, and innovation.
This put up explores the brand new Salesforce connector for AWS Glue and demonstrates the best way to construct a contemporary extract, remodel, and cargo (ETL) pipeline with AWS Glue ETL scripts.
Introducing the Salesforce connector for AWS Glue
To fulfill the calls for of various information integration use instances, AWS Glue now helps SaaS connectivity for Salesforce. This allows customers to rapidly preview and switch their buyer relationship administration (CRM) information, fetch the schema dynamically on request, and question the info. With the AWS Glue Salesforce connector, you’ll be able to ingest and remodel your CRM information to any of the AWS Glue supported locations, together with Amazon Easy Storage Service (Amazon S3), in your most popular format, together with Apache Iceberg, Apache Hudi, and Linux Basis Delta Lake; information warehouses equivalent to Amazon Redshift and Snowflake; and many extra. Reverse ETL use instances are additionally supported, permitting you to write down information again to Salesforce.
The next are key advantages of the Salesforce connector for AWS Glue:
- You should use AWS Glue native capabilities
- It’s effectively examined with AWS Glue capabilities and is manufacturing prepared for any information integration workload
- It really works seamlessly on high of AWS Glue and Apache Spark in a distributed vogue for environment friendly information processing
Resolution overview
For our use case, we wish to retrieve the complete load of a Salesforce account object in an information lake on Amazon S3 and seize the incremental modifications. This resolution additionally lets you replace sure fields of the account object within the information lake and push it again to Salesforce. To attain this, you create two ETL jobs utilizing AWS Glue with the Salesforce connector, and create a transactional information lake on Amazon S3 utilizing Apache Iceberg.
Within the first job, you configure AWS Glue to ingest the account object from Salesforce and put it aside right into a transactional information lake on Amazon S3 in Apache Iceberg format. Then you definitely replace the account object information that’s extracted from the primary job within the transactional information lake in Amazon S3. Lastly, you run the second job to ship that change again to Salesforce.
Conditions
Full the next prerequisite steps:
- Create an S3 bucket to retailer the outcomes.
- Join a Salesforce account, in the event you don’t have already got one.
- Create an AWS Id and Entry Administration (IAM) function for the AWS Glue ETL job to make use of. The function should grant entry to all sources utilized by the job, together with Amazon S3 and AWS Secrets and techniques Supervisor. For this put up, we title the function
AWSGlueServiceRole-SalesforceConnectorJob
. Use the next insurance policies:- AWS managed insurance policies:
- Inline coverage:
- Create the AWS Glue connection for Salesforce:
- The Salesforce connector helps two OAuth2 grant varieties:
JWT_BEARER
andAUTHORIZATION_CODE
. For this put up, we use theAUTHORIZATION_CODE
grant sort. - On the Secrets and techniques Supervisor console, create a brand new secret. Add two keys,
ACCESS_TOKEN
andREFRESH_TOKEN
, and maintain their values clean. These will likely be populated after you enter your Salesforce credentials. - Configure the Salesforce connection in AWS Glue. Use
AWSGlueServiceRole-SalesforceConnectorJob
whereas creating the Salesforce connection. For this put up, we title the connectionSalesforce_Connection
. - Within the Authorization part, select Authorization Code and the key you created within the earlier step.
- Present your Salesforce credentials when prompted. The
ACCESS_TOKEN
andREFRESH_TOKEN
keys will likely be populated after you enter your Salesforce credentials.
- The Salesforce connector helps two OAuth2 grant varieties:
- Create an AWS Glue database. For this put up, we title it
glue_etl_salesforce_db
.
Create an ETL job to ingest the account object from Salesforce
Full the next steps to create a brand new ETL job in AWS Glue Studio to switch information from Salesforce to Amazon S3:
- On the AWS Glue console, create a brand new job (with the Script editor choice). For this put up, we title the job
Salesforce_to_S3_Account_Ingestion
. - On the Script tab, enter the Salesforce_to_S3_Account_Ingestion script.
Guarantee that the title, which you used to create the Salesforce connection, is handed because the connectionName
parameter worth within the script, as proven within the following code instance:
The script fetches data from the Salesforce account object. Then it checks if the account desk exists within the transactional information lake. If the desk doesn’t exist, it creates a brand new desk and inserts the data. If the desk exists, it performs an upsert operation.
- On the Job particulars tab, for IAM function, select
AWSGlueServiceRole-SalesforceConnectorJob
. - Beneath Superior properties, for Further community connection, select the Salesforce connection.
- Arrange the job parameters:
--conf
:spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
--datalake-formats
:iceberg
--db_name
:glue_etl_salesforce_db
--s3_bucket_name
: your S3 bucket--table_name
: account
- Save the job and run it.
Relying on the dimensions of the info in your account object in Salesforce, the job will take a couple of minutes to finish. After a profitable job run, a brand new desk referred to as account is created and populated with Salesforce account info.
- You should use Amazon Athena to question the info:
Validate transactional capabilities
You possibly can validate the transactional capabilities supported by Apache Iceberg. For testing, attempt three operations: insert, replace, and delete:
- Create a brand new account object in Salesforce, rerun the AWS Glue job, then run the question in Athena to validate the brand new account is created.
- Delete an account in Salesforce, rerun the AWS Glue job, and validate the deletion utilizing Athena.
- Replace an account in Salesforce, rerun the AWS Glue job, and validate the replace operation utilizing Athena.
Create an ETL job to ship updates again to Salesforce
AWS Glue additionally lets you write information again to Salesforce. Full the next steps to create an ETL job in AWS Glue to get updates from the transactional information lake and write them to Salesforce. On this state of affairs, you replace an account report and push it again to Salesforce.
- On the AWS Glue console, create a brand new job (with the Script editor choice). For this put up, we title the job
S3_to_Salesforce_Account_Writeback
. - On the Script tab, enter the S3_to_Salesforce_Account_Writeback script.
Guarantee that the title, which you used to create the Salesforce connection, is handed because the connectionName
parameter worth within the script:
- On the Job particulars tab, for IAM function, select
AWSGlueServiceRole-SalesforceConnectorJob
. - Configure the job parameters:
--conf
:spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse
--datalake-formats
:iceberg
--db_name
:glue_etl_salesforce_db
--table_name
:account
- Run the replace question in Athena to vary the worth of
UpsellOpportunity__c
for a Salesforce account to “Sure”: - Run the
S3_to_Salesforce_Account_Writeback
AWS Glue job.
Relying on the dimensions of the info in your account object in Salesforce, the job will take a couple of minutes to finish.
- Validate the article in Salesforce. The worth of
UpsellOpportunity
ought to change.
You will have now efficiently validated the Salesforce connector.
Issues
You possibly can arrange AWS Glue job triggers to run the ETL jobs on a schedule, in order that the info is usually synchronized between Salesforce and Amazon S3. You too can combine the ETL jobs with different AWS companies, equivalent to AWS Step Capabilities, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Lambda, or Amazon EventBridge, to create a extra superior information processing pipeline.
By default, the Salesforce connector doesn’t import deleted data from Salesforce objects. Nevertheless, you’ll be able to set the IMPORT_DELETED_RECORDS
choice to “true” to import all data, together with the deleted ones. Confer with Salesforce connection choices for various Salesforce connection choices.
Clear up
To keep away from incurring fees, clear up the sources used on this put up out of your AWS account, together with the AWS Glue jobs, Salesforce connection, Secrets and techniques Supervisor secret, IAM function, and S3 bucket.
Conclusion
The AWS Glue connector for Salesforce simplifies the analytics pipeline, reduces time to insights, and facilitates data-driven decision-making. It empowers organizations to streamline information integration and analytics. The serverless nature of AWS Glue means there isn’t any infrastructure administration, and also you pay just for the sources consumed whereas your jobs are working. As organizations more and more depend on information for decision-making, this Salesforce connector offers an environment friendly, cost-effective, and agile resolution to swiftly meet information analytics wants.
To be taught extra in regards to the AWS Glue connector for Salesforce, consult with Connecting to Salesforce in AWS Glue Studio. On this person information, we stroll via all the course of, from organising the connection to working the info switch move. For extra info on AWS Glue, go to AWS Glue.
In regards to the authors
Ramakant Joshi is an AWS Options Architect, specializing within the analytics and serverless area. He has a background in software program improvement and hybrid architectures, and is obsessed with serving to clients modernize their cloud structure.
Kamen Sharlandjiev is a Sr. Massive Knowledge and ETL Options Architect, Amazon MWAA and AWS Glue ETL professional. He’s on a mission to make life simpler for patrons who’re going through advanced information integration and orchestration challenges. His secret weapon? Absolutely managed AWS companies that may get the job achieved with minimal effort. Comply with Kamen on LinkedIn to maintain updated with the most recent Amazon MWAA and AWS Glue options and information!
Debaprasun Chakraborty is an AWS Options Architect, specializing within the analytics area. He has round 20 years of software program improvement and structure expertise. He’s obsessed with serving to clients in cloud adoption, migration and technique.