Big Data

Constructing end-to-end information lineage for one-time and complicated queries utilizing Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

12 December 2024

One-time and complicated queries are two widespread eventualities in enterprise information analytics. One-time queries are versatile and appropriate for fast evaluation and exploratory analysis. Complicated queries, however, discuss with large-scale information processing and in-depth evaluation primarily based on petabyte-level information warehouses in large information eventualities. These advanced queries usually contain information sources from a number of enterprise programs, requiring multilevel nested SQL or associations with quite a few tables for extremely refined analytical duties.

Nevertheless, combining the info lineage of those two question varieties presents a number of challenges:

Variety of information sources
Various question complexity
Inconsistent granularity in lineage monitoring
Completely different real-time necessities
Difficulties in cross-system integration

Furthermore, sustaining the accuracy and completeness of lineage data whereas offering system efficiency and scalability are essential issues. Addressing these challenges requires a rigorously designed structure and superior technical options.

Amazon Athena affords serverless, versatile SQL analytics for one-time queries, enabling direct querying of Amazon Easy Storage Service (Amazon S3) information for speedy, cost-effective instantaneous evaluation. Amazon Redshift, optimized for advanced queries, gives high-performance columnar storage and massively parallel processing (MPP) structure, supporting large-scale information processing and superior SQL capabilities. Amazon Neptune, as a graph database, is good for information lineage evaluation, providing environment friendly relationship traversal and complicated graph algorithms to deal with large-scale, intricate information lineage relationships. The mixture of those three providers gives a strong, complete answer for end-to-end information lineage evaluation.

Within the context of complete information governance, Amazon DataZone affords organization-wide information lineage visualization utilizing Amazon Internet Companies (AWS) providers, whereas dbt gives project-level lineage via mannequin evaluation and helps cross-project integration between information lakes and warehouses.

On this put up, we use dbt for information modeling on each Amazon Athena and Amazon Redshift. dbt on Athena helps real-time queries, whereas dbt on Amazon Redshift handles advanced queries, unifying the event language and considerably decreasing the technical studying curve. Utilizing a single dbt modeling language not solely simplifies the event course of but in addition mechanically generates constant information lineage data. This strategy affords sturdy adaptability, simply accommodating modifications in information buildings.

By integrating Amazon Neptune graph database to retailer and analyze advanced lineage relationships, mixed with AWS Step Features and AWS Lambda capabilities, we obtain a totally automated information lineage era course of. This mix promotes consistency and completeness of lineage information whereas enhancing the effectivity and scalability of the complete course of. The result’s a strong and versatile answer for end-to-end information lineage evaluation.

Structure overview

The experiment’s context entails a buyer already utilizing Amazon Athena for one-time queries. To raised accommodate large information processing and complicated question eventualities, they intention to undertake a unified information modeling language throughout completely different information platforms. This led to the implementation of each Athena on dbt and Amazon Redshift on dbt architectures.

AWS Glue crawler crawls information lake data from Amazon S3, producing a Knowledge Catalog to assist dbt on Amazon Athena information modeling. For advanced question eventualities, AWS Glue performs extract, rework, and cargo (ETL) processing, loading information into the petabyte-scale information warehouse, Amazon Redshift. Right here, information modeling makes use of dbt on Amazon Redshift.

Lineage information unique information from each elements are loaded into an S3 bucket, offering information assist for end-to-end information lineage evaluation.

The next picture is the structure diagram for the answer.

Some necessary issues:

This experiment makes use of the next information dictionary:

Supply desk	Instrument	Goal desk
`imdb.name_basics`	DBT/Athena	`stg_imdb__name_basics`
`imdb.title_akas`	DBT/Athena	`stg_imdb__title_akas`
`imdb.title_basics`	DBT/Athena	`stg_imdb__title_basics`
`imdb.title_crew`	DBT/Athena	`stg_imdb__title_crews`
`imdb.title_episode`	DBT/Athena	`stg_imdb__title_episodes`
`imdb.title_principals`	DBT/Athena	`stg_imdb__title_principals`
`imdb.title_ratings`	DBT/Athena	`stg_imdb__title_ratings`
`stg_imdb__name_basics`	DBT/Redshift	`new_stg_imdb__name_basics`
`stg_imdb__title_akas`	DBT/Redshift	`new_stg_imdb__title_akas`
`stg_imdb__title_basics`	DBT/Redshift	`new_stg_imdb__title_basics`
`stg_imdb__title_crews`	DBT/Redshift	`new_stg_imdb__title_crews`
`stg_imdb__title_episodes`	DBT/Redshift	`new_stg_imdb__title_episodes`
`stg_imdb__title_principals`	DBT/Redshift	`new_stg_imdb__title_principals`
`stg_imdb__title_ratings`	DBT/Redshift	`new_stg_imdb__title_ratings`
`new_stg_imdb__name_basics`	DBT/Redshift	`int_primary_profession_flattened_from_name_basics`
`new_stg_imdb__name_basics`	DBT/Redshift	`int_known_for_titles_flattened_from_name_basics`
`new_stg_imdb__name_basics`	DBT/Redshift	`names`
`new_stg_imdb__title_akas`	DBT/Redshift	`titles`
`new_stg_imdb__title_basics`	DBT/Redshift	`int_genres_flattened_from_title_basics`
`new_stg_imdb__title_basics`	DBT/Redshift	`titles`
`new_stg_imdb__title_crews`	DBT/Redshift	`int_directors_flattened_from_title_crews`
`new_stg_imdb__title_crews`	DBT/Redshift	`int_writers_flattened_from_title_crews`
`new_stg_imdb__title_episodes`	DBT/Redshift	`titles`
`new_stg_imdb__title_principals`	DBT/Redshift	`titles`
`new_stg_imdb__title_ratings`	DBT/Redshift	`titles`
`int_known_for_titles_flattened_from_name_basics`	DBT/Redshift	`titles`
`int_primary_profession_flattened_from_name_basics`	DBT/Redshift
`int_directors_flattened_from_title_crews`	DBT/Redshift	`names`
`int_genres_flattened_from_title_basics`	DBT/Redshift	`genre_titles`
`int_writers_flattened_from_title_crews`	DBT/Redshift	`names`
genre_titles	DBT/Redshift
`names`	DBT/Redshift
`titles`	DBT/Redshift

The lineage information generated by dbt on Athena contains partial lineage diagrams, as exemplified within the following pictures. The primary picture reveals the lineage of name_basics in dbt on Athena. The second picture reveals the lineage of title_crew in dbt on Athena.

The lineage information generated by dbt on Amazon Redshift contains partial lineage diagrams, as illustrated within the following picture.

Referring to the info dictionary and screenshots, it’s evident that the entire information lineage data is extremely dispersed, unfold throughout 29 lineage diagrams. Understanding the end-to-end complete view requires vital time. In real-world environments, the scenario is commonly extra advanced, with full information lineage probably distributed throughout a whole lot of information. Consequently, integrating an entire end-to-end information lineage diagram turns into essential and difficult.

This experiment will present an in depth introduction to processing and merging information lineage information saved in Amazon S3, as illustrated within the following diagram.

Stipulations

To carry out the answer, you might want to have the next stipulations in place:

The Lambda perform for preprocessing lineage information should have permissions to entry Amazon S3 and Amazon Redshift.
The Lambda perform for establishing the directed acyclic graph (DAG) should have permissions to entry Amazon S3 and Amazon Neptune.

Resolution walkthrough

To carry out the answer, observe the steps within the subsequent sections.

Preprocess uncooked lineage information for DAG era utilizing Lambda capabilities

Use Lambda to preprocess the uncooked lineage information generated by dbt, changing it into key-value pair JSON information which are simply understood by Neptune: athena_dbt_lineage_map.json and redshift_dbt_lineage_map.json.

To create a brand new Lambda perform within the Lambda console, enter a Perform title, choose the Runtime (Python on this instance), configure the Structure and Execution position, then click on the “Create perform” button.

Open the created Lambda perform and on the Configuration tab, within the navigation pane, choose Atmosphere variables and select your configurations. Utilizing Athena on dbt processing for example, configure the atmosphere variables as follows (the method for Amazon Redshift on dbt is comparable):
- INPUT_BUCKET: data-lineage-analysis-24-09-22 (substitute with the S3 bucket path storing the unique Athena on dbt lineage information)
- INPUT_KEY: athena_manifest.json (the unique Athena on dbt lineage file)
- OUTPUT_BUCKET: data-lineage-analysis-24-09-22 (substitute with the S3 bucket path for storing the preprocessed output of Athena on dbt lineage information)
- OUTPUT_KEY: athena_dbt_lineage_map.json (the output file after preprocessing the unique Athena on dbt lineage file)

On the Code tab, within the lambda_function.py file, enter the preprocessing code for the uncooked lineage information. Right here’s a code reference utilizing Athena on dbt processing for example (the method for Amazon Redshift on dbt is comparable). The preprocessing code for Athena on dbt’s unique lineage file is as follows:

The athena_manifest.json, redshift_manifest.json, and different information used on this experiment will be obtained from the Knowledge Lineage Graph Development GitHub repository.

import json
import boto3
import os

def lambda_handler(occasion, context):
    # Arrange S3 shopper
    s3 = boto3.shopper('s3')

    # Get enter and output paths from atmosphere variables
    input_bucket = os.environ['INPUT_BUCKET']
    input_key = os.environ['INPUT_KEY']
    output_bucket = os.environ['OUTPUT_BUCKET']
    output_key = os.environ['OUTPUT_KEY']

    # Outline helper perform
    def dbt_nodename_format(node_name):
        return node_name.break up(".")[-1]

    # Learn enter JSON file from S3
    response = s3.get_object(Bucket=input_bucket, Key=input_key)
    file_content = response['Body'].learn().decode('utf-8')
    information = json.hundreds(file_content)
    lineage_map = information["child_map"]
    node_dict = {}
    dbt_lineage_map = {}

    # Course of information
    for merchandise in lineage_map:
        lineage_map[item] = [dbt_nodename_format(child) for child in lineage_map[item]]
        node_dict[item] = dbt_nodename_format(merchandise)

    # Replace key names
    lineage_map = {node_dict[old]: worth for outdated, worth in lineage_map.objects()}
    dbt_lineage_map["lineage_map"] = lineage_map

    # Convert consequence to JSON string
    result_json = json.dumps(dbt_lineage_map)

    # Write JSON string to S3
    s3.put_object(Physique=result_json, Bucket=output_bucket, Key=output_key)
    print(f"Knowledge written to s3://{output_bucket}/{output_key}")

    return {
        'statusCode': 200,
        'physique': json.dumps('Athena information lineage processing accomplished efficiently')
    }

Merge preprocessed lineage information and write to Neptune utilizing Lambda capabilities

Earlier than processing information with the Lambda perform, create a Lambda layer by importing the required Gremlin plugin. For detailed steps on creating and configuring Lambda Layers, see the AWS Lambda Layers documentation.

As a result of connecting Lambda to Neptune for establishing a DAG requires the Gremlin plugin, it must be uploaded earlier than utilizing Lambda. The Gremlin package deal will be obtained from the Knowledge Lineage Graph Development GitHub repository.

Create a brand new Lambda perform. Select the perform to configure. To the just lately created layer, on the backside of the web page, select Add a layer.

Create one other Lambda layer for the requests library, much like the way you created the layer for the Gremlin plugin. This library might be used for HTTP shopper performance within the Lambda perform.

Select the just lately created Lambda perform to configure. Hook up with Neptune via Lambda to merge the 2 datasets and assemble a DAG. On the Code tab, the reference code to execute is as follows:

import json
import boto3
import os
import requests
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest
from botocore.credentials import get_credentials
from botocore.session import Session
from concurrent.futures import ThreadPoolExecutor, as_completed

def read_s3_file(s3_client, bucket, key):
    attempt:
        response = s3_client.get_object(Bucket=bucket, Key=key)
        information = json.hundreds(response['Body'].learn().decode('utf-8'))
        return information.get("lineage_map", {})
    besides Exception as e:
        print(f"Error studying S3 file {bucket}/{key}: {str(e)}")
        elevate

def merge_data(athena_data, redshift_data):
    return {**athena_data, **redshift_data}

def sign_request(request):
    credentials = get_credentials(Session())
    auth = SigV4Auth(credentials, 'neptune-db', os.environ['AWS_REGION'])
    auth.add_auth(request)
    return dict(request.headers)

def send_request(url, headers, information):
    attempt:
        response = requests.put up(url, headers=headers, information=information, timeout=30)
        response.raise_for_status()
        return response.textual content
    besides requests.exceptions.RequestException as e:
        print(f"Request Error: {str(e)}")
        if hasattr(e.response, 'textual content'):
            print(f"Response content material: {e.response.textual content}")
        elevate

def write_to_neptune(information):
    endpoint="https://your neptune endpoint title:8182/gremlin"
    # substitute along with your neptune endpoint title

    # Clear Neptune database
    clear_query = "g.V().drop()"
    request = AWSRequest(technique='POST', url=endpoint, information=json.dumps({'gremlin': clear_query}))
    signed_headers = sign_request(request)
    response = send_request(endpoint, signed_headers, json.dumps({'gremlin': clear_query}))
    print(f"Clear database response: {response}")

    # Confirm if the database is empty
    verify_query = "g.V().depend()"
    request = AWSRequest(technique='POST', url=endpoint, information=json.dumps({'gremlin': verify_query}))
    signed_headers = sign_request(request)
    response = send_request(endpoint, signed_headers, json.dumps({'gremlin': verify_query}))
    print(f"Vertex depend after clearing: {response}")
    
    def process_node(node, kids):
        # Add node
        question = f"g.V().has('lineage_node', 'node_name', '{node}').fold().coalesce(unfold(), addV('lineage_node').property('node_name', '{node}'))"
        request = AWSRequest(technique='POST', url=endpoint, information=json.dumps({'gremlin': question}))
        signed_headers = sign_request(request)
        response = send_request(endpoint, signed_headers, json.dumps({'gremlin': question}))
        print(f"Add node response for {node}: {response}")

        for child_node in kids:
            # Add little one node
            question = f"g.V().has('lineage_node', 'node_name', '{child_node}').fold().coalesce(unfold(), addV('lineage_node').property('node_name', '{child_node}'))"
            request = AWSRequest(technique='POST', url=endpoint, information=json.dumps({'gremlin': question}))
            signed_headers = sign_request(request)
            response = send_request(endpoint, signed_headers, json.dumps({'gremlin': question}))
            print(f"Add little one node response for {child_node}: {response}")

            # Add edge
            question = f"g.V().has('lineage_node', 'node_name', '{node}').as('a').V().has('lineage_node', 'node_name', '{child_node}').coalesce(inE('lineage_edge').the place(outV().as('a')), addE('lineage_edge').from('a').property('edge_name', ' '))"
            request = AWSRequest(technique='POST', url=endpoint, information=json.dumps({'gremlin': question}))
            signed_headers = sign_request(request)
            response = send_request(endpoint, signed_headers, json.dumps({'gremlin': question}))
            print(f"Add edge response for {node} -> {child_node}: {response}")

    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [executor.submit(process_node, node, children) for node, children in data.items()]
        for future in as_completed(futures):
            attempt:
                future.consequence()
            besides Exception as e:
                print(f"Error in processing node: {str(e)}")

def lambda_handler(occasion, context):
    # Initialize S3 shopper
    s3_client = boto3.shopper('s3')

    # S3 bucket and file paths
    bucket_name="data-lineage-analysis" # Substitute along with your S3 bucket title
    athena_key = 'athena_dbt_lineage_map.json' # Substitute along with your athena lineage key worth output json title
    redshift_key = 'redshift_dbt_lineage_map.json' # Substitute along with your redshift lineage key worth output json title

    attempt:
        # Learn Athena lineage information
        athena_data = read_s3_file(s3_client, bucket_name, athena_key)
        print(f"Athena information dimension: {len(athena_data)}")

        # Learn Redshift lineage information
        redshift_data = read_s3_file(s3_client, bucket_name, redshift_key)
        print(f"Redshift information dimension: {len(redshift_data)}")

        # Merge information
        combined_data = merge_data(athena_data, redshift_data)
        print(f"Mixed information dimension: {len(combined_data)}")

        # Write to Neptune (together with clearing the database)
        write_to_neptune(combined_data)

        return {
            'statusCode': 200,
            'physique': json.dumps('Knowledge efficiently written to Neptune')
        }
    besides Exception as e:
        print(f"Error in lambda_handler: {str(e)}")
        return {
            'statusCode': 500,
            'physique': json.dumps(f'Error: {str(e)}')
        }

Create Step Features workflow

On the Step Features console, select State machines, after which select Create state machine. On the Select a template web page, choose Clean template.

Within the Clean template, select Code to outline your state machine. Use the next instance code:

{
  "Remark": "Day by day Knowledge Lineage Processing Workflow",
  "StartAt": "Parallel Processing",
  "States": {
    "Parallel Processing": {
      "Sort": "Parallel",
      "Branches": [
        {
          "StartAt": "Process Athena Data",
          "States": {
            "Process Athena Data": {
              "Type": "Task",
              "Resource": "arn:aws:states:::lambda:invoke",
              "Parameters": {
                "FunctionName": "athena-data-lineange-process-Lambda", ##Replace with your Athena data lineage process Lambda function name
                "Payload": {
                  "input.$": "$"
                }
              },
              "End": true
            }
          }
        },
        {
          "StartAt": "Process Redshift Data",
          "States": {
            "Process Redshift Data": {
              "Type": "Task",
              "Resource": "arn:aws:states:::lambda:invoke",
              "Parameters": {
                "FunctionName": "redshift-data-lineange-process-Lambda", ##Replace with your Redshift data lineage process Lambda function name
                "Payload": {
                  "input.$": "$"
                }
              },
              "End": true
            }
          }
        }
      ],
      "Subsequent": "Load Knowledge to Neptune"
    },
    "Load Knowledge to Neptune": {
      "Sort": "Job",
      "Useful resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "data-lineage-analysis-lambda" ##Substitute along with your Lambda perform Identify
      },
      "Finish": true
    }
  }
}

After finishing the configuration, select the Design tab to view the workflow proven within the following diagram.

Create scheduling guidelines with Amazon EventBridge

Configure Amazon EventBridge to generate lineage information day by day throughout off-peak enterprise hours. To do that:

Create a brand new rule within the EventBridge console with a descriptive title.
Set the rule sort to “Schedule” and configure it to run as soon as day by day (utilizing both a hard and fast fee or the Cron expression “0 0 * * ? *”).
Choose the AWS Step Features state machine because the goal and specify the state machine you created earlier.

Question ends in Neptune

On the Neptune console, choose Notebooks. Open an current pocket book or create a brand new one.

Within the pocket book, create a brand new code cell to carry out a question. The next code instance reveals the question assertion and its outcomes:

%%gremlin -d node_name -de edge_name
g.V().hasLabel('lineage_node').outE('lineage_edge').inV().hasLabel('lineage_node').path().by(elementMap())

Now you can see the end-to-end information lineage graph data for each dbt on Athena and dbt on Amazon Redshift. The next picture reveals the merged DAG information lineage graph in Neptune.

You may question the generated information lineage graph for information associated to a particular desk, comparable to title_crew.

The pattern question assertion and its outcomes are proven within the following code instance:

%%gremlin -d node_name -de edge_name
g.V().has('lineage_node', 'node_name', 'title_crew')
  .repeat(
    union(
      __.inE('lineage_edge').outV(),
      __.outE('lineage_edge').inV()
    )
  )
  .till(
    __.has('node_name', inside('names', 'genre_titles', 'titles'))
    .or()
    .loops().is(gt(10))
  )
  .path()
  .by(elementMap())

The next picture reveals the filtered outcomes primarily based on title_crew desk in Neptune.

Clear up

To scrub up your assets, full the next steps:

Delete EventBridge guidelines

# Cease new occasions from triggering whereas eradicating dependencies
aws occasions disable-rule --name 
# Break connections between rule and targets (like Lambda capabilities)
aws occasions remove-targets --rule  --ids 
# Take away the rule fully from EventBridge
aws occasions delete-rule --name

Delete Step Features state machine

# Cease all working executions
aws stepfunctions stop-execution --execution-arn 
# Delete the state machine
aws stepfunctions delete-state-machine --state-machine-arn

Delete Lambda capabilities

# Delete Lambda perform
aws lambda delete-function --function-name 
# Delete Lambda layers (if used)
aws lambda delete-layer-version --layer-name  --version-number

Clear up the Neptune database

# Delete all snapshots
aws neptune delete-db-cluster-snapshot --db-cluster-snapshot-identifier 
# Delete database occasion
aws neptune delete-db-instance --db-instance-identifier  --skip-final-snapshot
# Delete database cluster
aws neptune delete-db-cluster --db-cluster-identifier  --skip-final-snapshot

Comply with the directions at Deleting a single object to scrub up the S3 buckets

Conclusion

On this put up, we demonstrated how dbt allows unified information modeling throughout Amazon Athena and Amazon Redshift, integrating information lineage from each one-time and complicated queries. By utilizing Amazon Neptune, this answer gives complete end-to-end lineage evaluation. The structure makes use of AWS serverless computing and managed providers, together with Step Features, Lambda, and EventBridge, offering a extremely versatile and scalable design.

This strategy considerably lowers the educational curve via a unified information modeling technique whereas enhancing improvement effectivity. The tip-to-end information lineage graph visualization and evaluation not solely strengthen information governance capabilities but in addition supply deep insights for decision-making.

The answer’s versatile and scalable structure successfully optimizes operational prices and improves enterprise responsiveness. This complete strategy balances technical innovation, information governance, operational effectivity, and cost-effectiveness, thus supporting long-term enterprise progress with the adaptability to fulfill evolving enterprise wants.

With OpenLineage-compatible information lineage now typically out there in Amazon DataZone, we plan to discover integration potentialities to additional improve the system’s functionality to deal with advanced information lineage evaluation eventualities.

You probably have any questions, please be at liberty to go away a remark within the feedback part.

In regards to the authors

Nancy Wu is a Options Architect at AWS, liable for cloud computing structure consulting and design for multinational enterprise prospects. Has a few years of expertise in massive information, enterprise digital transformation analysis and improvement, consulting, and challenge administration throughout telecommunications, leisure, and monetary industries.

Xu Feng is a Senior Business Resolution Architect at AWS, liable for designing, constructing, and selling business options for the Media & Leisure and Promoting sectors, comparable to clever customer support and enterprise intelligence. With 20 years of software program business expertise, at present centered on researching and implementing generative AI and AI-powered information options.

Xu Da is a Amazon Internet Companies (AWS) Accomplice Options Architect primarily based out of Shanghai, China. He has greater than 25 years of expertise in IT business, software program improvement and answer structure. He’s obsessed with collaborative studying, information sharing, and guiding group of their cloud applied sciences journey.