Big Data

RAG-to-SQL on Google Cloud

19 October 2024

Textual content-to-SQL applied sciences ceaselessly wrestle to seize the entire context and which means of a person’s request, leading to queries that don’t precisely match the meant. Whereas builders work exhausting to boost these methods, it’s price questioning if there’s a higher methodology.

Enter RAG-to-SQL—a brand new method that mixes pure language understanding with highly effective knowledge retrieval to generate correct SQL queries. By mixing the perfect of pure language processing and knowledge retrieval, RAG-to-SQL gives a extra dependable approach to flip on a regular basis language into significant insights out of your database.

On this article, we’ll discover how RAG-to-SQL can rework the best way we work together with databases, particularly utilizing Google Cloud companies similar to BigQuery and Vertex AI.

Studying Aims

Establish the constraints of Textual content-to-SQL methods in precisely capturing person intent.
Perceive some great benefits of RAG-to-SQL as a brand new paradigm for producing extra dependable SQL queries.
Implement the RAG-to-SQL method utilizing Google Cloud companies like BigQuery and Vertex AI.
Learn to combine and make the most of a number of Google Cloud instruments for RAG-to-SQL implementation.

This text was revealed as part of the Information Science Blogathon.

Limitations of Conventional Textual content-to-SQL Approaches

The primary concept behind textual content to sql fashions of LLM was to allow individuals who have no idea about SQL to work together with database and acquire info utilizing pure language as an alternative. The prevailing textual content 2 sql framework depends primarily in LLM information to have the ability to convert pure language question to sql question. This will result in incorrect or invalid formulation of SQL queries. That is the place the brand new method RAG to SQL involves our rescue which is defined in subsequent part.

What’s RAG-to-SQL ?

With the intention to overcome the shortcomings of Textual content to SQL we will use the progressive method of RAG to SQL. The combination of area details about the database is the key challenge that every text-to-SQL software program faces. The RAG2SQL structure addresses this issue by including contextual knowledge (metadata, DDL, queries, and extra). This knowledge is then “skilled” and made obtainable for utilization.
Moreover, the “retriever” evaluates and forwards probably the most related context to reply to the Consumer Question. The top result’s significantly improved precision.

Setting Up RAG-to-SQL with Google Cloud: A Step-by-Step Information

Comply with an in depth information to implement RAG-to-SQL utilizing Google Cloud companies similar to BigQuery and Vertex AI.

Pre-requisites for Code

With the intention to comply with and run this code you will want to setup your GCP (google cloud account with Fee info). Initially they supply free 300$ trial for 90 days so no prices shall be incurred. Element for account setup : Hyperlink

Code Flowchart

Under is the code flowchart which describes at a better stage the varied blocks of code. We are able to refer it to comply with alongside as we proceed.

The code implementation will be divided into 3 main blocks :

SQL Question Chain: This chain is liable for producing applicable sql question based mostly on person query and related schema of desk fetched from Vector DB.
Interpret Chain: This chain takes the SQL question from the earlier chain, runs it in BigQuery, after which makes use of the outcomes to generate a response with an applicable immediate.
Agent Chain: That is the ultimate chain which encapsulates the above two chains. At any time when a person query is available in it’ll determine whether or not to name sql question instrument or reply the query instantly. It routes person queries to varied instruments based mostly on the duty required to reply the query.

Step 1: Putting in the Required Libraries

In colab pocket book we have now to put in the beneath libraries required for this implementation.

! pip set up langchain==0.0.340 --quiet
! pip set up chromadb==0.4.13 --quiet
! pip set up google-cloud-bigquery[pandas] --quiet
! pip set up google-cloud-aiplatform --quiet#import csv

Step 2: Configuring Your Google Cloud Undertaking and Credentials

Now we have now to declare some variables to initialise our GCP venture and Huge Question Datasets . Utilizing this variables we will entry the tables in Huge Question withing GCP in our pocket book.

You possibly can view this particulars in your GCP cloud console. In BigQuery you may create a dataset and inside dataset you may add or add a desk for particulars see Create Dataset and Create Desk.

VERTEX_PROJECT = "Your GCP Undertaking ID" # @param{kind: "string"}
VERTEX_REGION = "us-central1" # @param{kind: "string"}

BIGQUERY_DATASET = "Huge Question Dataset Title" # @param{kind: "string"}
BIGQUERY_PROJECT = "Vertex Undertaking ID" # @param{kind: "string"}

Now authenticate and login to your GCP Vertex AI out of your pocket book utilizing beneath code in colab.

from google.colab import auth
auth.authenticate_user()

import vertexai
vertexai.init(venture=VERTEX_PROJECT, location=VERTEX_REGION)

Step 3: Constructing a Vector Database for Desk Schema Storage

Now we have now to create a vector db which can include schema of varied tables current in our dataset and we are going to create a retriever on high of this vector db in order that we will incorporate RAG in our workflow.

Connecting to Huge Question utilizing BQ consumer in python and fetching schema of tables.

from google.cloud import bigquery
import json

#Fetching Schemas of Tables

bq_client = bigquery.Shopper(venture=VERTEX_PROJECT)
bq_tables = bq_client.list_tables(dataset=f"{BIGQUERY_PROJECT}.{BIGQUERY_DATASET}")
schemas = []
for bq_table in bq_tables:
   t = bq_client.get_table(f"{BIGQUERY_PROJECT}.{BIGQUERY_DATASET}.{bq_table.table_id}")
   schema_fields = [f.to_api_repr() for f in t.schema]
   schema = f"The schema for desk {bq_table.table_id} is the next: n```{json.dumps(schema_fields, indent=1)}```"
   schemas.append(schema)

print(f"Discovered {len(schemas)} tables in dataset {BIGQUERY_PROJECT}:{BIGQUERY_DATASET}")#import csv

Storing the schemas in Vector Db similar to Chroma DB. We have to create a folder referred to as “knowledge”

from langchain.embeddings import VertexAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = VertexAIEmbeddings()
attempt: # Keep away from duplicated paperwork
  vector_store.delete_collection()
besides:
  print("No want to scrub the vector retailer")
vector_store = Chroma.from_texts(schemas, embedding=embeddings,persist_directory='./knowledge')
n_docs = len(vector_store.get()['ids'])
retriever = vector_store.as_retriever(search_kwargs={'okay': 2})
print(f"The vector retailer has {n_docs} paperwork")

Step 4: Instantiating LLM Fashions for SQL Question, Interpretation, and Agent Chains

We’ll instantiate the three LLM fashions for the three totally different chains.

First mannequin is Question Mannequin which is liable for producing SQL question based mostly on person query and desk schema retrieved from vector db much like person query. For this we’re utilizing “codechat-bison” mannequin . This mannequin makes a speciality of producing code in numerous coding languages and therefore, is suitable for our use case.

Different 2 fashions are default LLM fashions in ChatVertexAI which is “gemini-1.5-flash-001” that is the most recent gemini mannequin optimized for chat and fast response.

from langchain.chat_models import ChatVertexAI
from langchain.llms import VertexAI

query_model = ChatVertexAI(model_name="codechat-bison", max_output_tokens=1000)
interpret_data_model = ChatVertexAI(max_output_tokens=1000)
agent_model = ChatVertexAI(max_output_tokens=1024)

Step 5: Developing the SQL Question Chain

Under is the SQL immediate used to generate the SQL question for the enter person query.

SQL_PROMPT = """You're a SQL and BigQuery professional.

Your job is to create a question for BigQuery in SQL.

The next paragraph accommodates the schema of the desk used for a question. It's encoded in JSON format.

{context}

Create a BigQuery SQL question for the next person enter, utilizing the above desk.
And Use solely columns talked about in schema for the SQL question

The person and the agent have accomplished this dialog up to now:
{chat_history}

Comply with these restrictions strictly:
- Solely return the SQL code.
- Don't add backticks or any markup. Solely write the question as output. NOTHING ELSE.
- In FROM, all the time use the total desk path, utilizing `{venture}` as venture and `{dataset}` as dataset.
- At all times rework nation names to full uppercase. As an illustration, if the nation is Japan, you need to use JAPAN within the question.

Consumer enter: {query}

SQL question:
"""

Now we are going to outline a operate which can retrieve related paperwork i.e schemas for the person query enter.

from langchain.schema.vectorstore import VectorStoreRetriever
def get_documents(retriever: VectorStoreRetriever, query: str) -> str:
  # Return solely the primary doc
  output = ""
  for d in retriever.get_relevant_documents(query):
    output += d.page_content
    output += "n"
    return output

Then we outline the LLM chain utilizing Langchain expression language syntax. Be aware we outline immediate with 5 placeholder variables and later we outline a partial immediate by filling within the 2 placeholder variables venture and dataset. The remainder of the variables will get populated with incoming request dictionary consisting of enter, chat historical past and the context variable is populated kind the operate we outlined above get_documents.

from operator import itemgetter
from langchain.prompts import PromptTemplate
from langchain.schema import StrOutputParser

prompt_template = PromptTemplate(
    input_variables=["context", "chat_history", "question", "project", "dataset"],
    template=SQL_PROMPT)

partial_prompt = prompt_template.partial(venture=BIGQUERY_PROJECT,
                                         dataset=BIGQUERY_DATASET)

# Enter shall be like {"enter": "SOME_QUESTION", "chat_history": "HISTORY"}
docs = {"context": lambda x: get_documents(retriever, x['input'])}
query = {"query": itemgetter("enter")}
chat_history = {"chat_history": itemgetter("chat_history")}
query_chain = docs | query | chat_history | partial_prompt | query_model
question = query_chain | StrOutputParser()

Allow us to take a look at our chain utilizing CallBack Handler of Langchain which can present every steps of chain execution intimately.

from langchain.callbacks.tracers import ConsoleCallbackHandler
# Instance
x = {"enter": "Highest period of journey the place begin station was from Atlantic Ave & Fort Greene Pl ", "chat_history": ""}
print(question.invoke(x, config={'callbacks': [ConsoleCallbackHandler()]}))

Final SQL query output — Remaining SQL question output

Step 6: Refining the SQL Chain Output for Interpretation

We have to refine the above sql chain output so that it’s going to embrace different variables too in its outp which shall be then handed on to second chain – interpret chain.

from langchain.output_parsers import ResponseSchema, StructuredOutputParser
from langchain.schema.runnable import RunnableLambda

#Refine the Chain output to incorporate different variables in output in dictionary format
def _dict_to_json(x: dict) -> str:
  return "```n" + json.dumps(x) + "n```"

query_response_schema = [
    ResponseSchema(name="query", description="SQL query to solve the user question."),
    ResponseSchema(name="question", description="Question asked by the user."),
    ResponseSchema(name="context", description="Documents retrieved from the vector store.")
  ]
query_output_parser = StructuredOutputParser.from_response_schemas(query_response_schema)
query_output_json = docs | query | {"question": question} | RunnableLambda(_dict_to_json) | StrOutputParser()
query_output = query_output_json | query_output_parser

Lets attempt to execute this chain.

# Instance
x = {"enter": "Give me high 2 begin stations the place journey period was highest?", "chat_history": ""}
output = query_output.invoke(x)  # Output is now a dictionary, enter for the subsequent chain

Above we will see the output of the refined chain is an sql question.

Step 7: Constructing the Interpret Chain for Question Outcomes

Now we have now to construct the subsequent chain which can take output of SQL question chain outlined above. This chain will take the sql question from earlier chain and run it in Huge Question and its outcomes are then used to generate a response utilizing applicable immediate.

INTERPRET_PROMPT = """You're a BigQuery professional. You're additionally professional in extracting knowledge from CSV.

The next paragraph describes the schema of the desk used for a question. It's encoded in JSON format.

{context}

A person requested this query:
{query}

To search out the reply, the next SQL question was run in BigQuery:
```
{question}
```

The results of that question was the next desk in CSV format:
```
{consequence}
```

Primarily based on these outcomes, present a quick reply to the person query.

Comply with these restrictions strictly:
- Don't add any clarification about how the reply is obtained, simply write the reply.
- Extract any worth associated to the reply solely from the results of the question. Don't use every other knowledge supply.
- Simply write the reply, omit the query out of your reply, this can be a chat, simply present the reply.
- If you happen to can't discover the reply within the consequence, don't make up any knowledge, simply say "I can't discover the reply"
"""

from google.cloud import bigquery
def get_bq_csv(bq_client: bigquery.Shopper, question: str) -> str:
  cleaned_query = clean_query(question)
  df = bq_client.question(cleaned_query, location="US").to_dataframe()
  return df.to_csv(index=False)


def clean_query(question: str):
  question = question.substitute("```sql","")
  cleaned_query = question.substitute("```","")

  return cleaned_query

We’ll outline two operate one is clean_query – this can clear the sql question of apostrophes and different pointless characters and different is get_bq_csv – this can run the cleaned sql question in Huge Question and get the output desk in csv format.

# Get the output of the earlier chain


question = {"question": itemgetter("question")}
context = {"context": itemgetter("context")}
query = {"query": itemgetter("query")}
#cleaned_query = {"consequence": lambda x: clean_query(x["query"])}
query_result = {"consequence": lambda x: get_bq_csv(bq_client, x["query"])}

immediate = PromptTemplate(
    input_variables=["question", "query", "result", "context"],
    template=INTERPRET_PROMPT)

run_bq_chain = context | query | question | query_result | immediate
run_bq_result = run_bq_chain | interpret_data_model | StrOutputParser()

Let’s execute the chain and take a look at it.

# Instance
x = {"enter": "Give me high 2 begin stations the place journey period was highest?", "chat_history": ""}
final_response = run_bq_result.invoke(query_output.invoke(x))
print(final_response)

Step 8: Implementing the Agent Chain for Dynamic Question Routing

Now we are going to construct the ultimate chain which is the Agent Chain . When a person asks a query, it decides whether or not to utilise the SQL question instrument or to reply it instantly. Principally, it sends person queries to varied instruments in accordance on the work that should be accomplished in an effort to reply the person’s inquiry.

We outline an agent_memory, agent immediate, instrument funtion.

from langchain.reminiscence import ConversationBufferWindowMemory

agent_memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    okay=10,
    return_messages=True)

AGENT_PROMPT = """You're a very highly effective assistant that may reply questions utilizing BigQuery.

You possibly can invoke the instrument user_question_tool to reply questions utilizing BigQuery.

At all times use the instruments to attempt to reply the questions. Use the chat historical past for context. By no means attempt to use every other exterior info.

Assume that the person could write with misspellings, repair the spelling of the person earlier than passing the query to any instrument.

Do not point out what instrument you might have utilized in your reply.
"""

from langchain.instruments import instrument
from langchain.callbacks.tracers import ConsoleCallbackHandler

@instrument
def user_question_tool(query) -> str:
  """Helpful to reply pure language questions from customers utilizing BigQuery."""
  config={'callbacks': [ConsoleCallbackHandler()]}
  config = {}
  reminiscence = agent_memory.buffer_as_str.strip()
  query = {"enter": query, "chat_history": reminiscence}
  question = query_output.invoke(query, config=config)
  print("nn******************nn")
  print(question['query'])
  print("nn******************nn")
  consequence = run_bq_result.invoke(question, config=config)
  return consequence.strip()

We now deliver collectively all the principle parts of agent and initialize the agent.

from langchain.brokers import AgentType, initialize_agent, AgentExecutor

agent_kwgards = {"system_message": AGENT_PROMPT}
agent_tools = [user_question_tool]

agent_memory.clear()

agent = initialize_agent(
    instruments=agent_tools,
    llm=agent_model,
    agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION,
    reminiscence=agent_memory,
    agent_kwgards=agent_kwgards,
    max_iterations=5,
    early_stopping_method='generate',
    verbose=True)

Lets run the agent now.

q = "Give me high 2 begin stations the place journey period was highest?"
agent.invoke(q)

Comply with-up query to agent.

q = "What's the capability for  every of those station identify?"
agent.invoke(q)

Observations

The agent was precisely capable of course of the complicated query and in addition generate right solutions for comply with -up query based mostly on chat historical past after which it utilised one other desk to get capability info of citi bikes.

Conclusion

The RAG-to-SQL method represents a major development in addressing the constraints of conventional Textual content-to-SQL fashions by incorporating contextual knowledge and leveraging retrieval strategies. This system enhances question accuracy by retrieving related schema info from vector databases, permitting for extra exact SQL era. Implementing RAG-to-SQL inside Google Cloud companies like BigQuery and Vertex AI demonstrates its scalability and effectiveness in real-world purposes. By automating the decision-making course of in question dealing with, RAG-to-SQL opens new prospects for non-technical customers to work together seamlessly with databases whereas sustaining excessive precision.

Key Takeaways

Overcomes Textual content-to-SQL Limitations addresses the frequent pitfalls of conventional Textual content-to-SQL fashions by integrating metadata.
The agent-based system effectively decides learn how to course of person queries, enhancing usability.
RAG-to-SQL permits non-technical customers to generate complicated SQL queries with pure language inputs.
The method is efficiently carried out utilizing companies like BigQuery and Vertex AI.

Continuously Requested Questions

Q1. Is GCP Vertex AI entry free?

A. No, however you may get a trial interval of 90 days with 300$ credit in case you register first time and also you solely want to offer a card particulars for getting entry. No prices are deducted from card and even in case you use any companies which is consuming past 300$ credit then Google will ask you to allow cost account so as to use the service. So there isn’t any computerized deduction of quantity.

Q2. What’s the key advantage of utilizing Rag to SQL?

A. This enables us to automate the desk schema which is to be fed to the LLM if we’re utilizing a number of tables we don’t must feed all desk schemas directly . Primarily based on person question the related desk schema will be fetched from the RAG. Thus, rising effectivity over standard Textual content to SQL methods.

Q3. How can brokers be helpful for this use case?

A. If we’re constructing a holistic chatbot it would require lot of different instruments other than SQL question instrument . So we will leverage the agent and supply it with a number of instruments similar to net search , database sql question instrument, different rag instruments or operate calling api instruments. This may allow to deal with various kinds of person queries based mostly on the duty that must be completed to reply to the person question.

The media proven on this article just isn’t owned by Analytics Vidhya and is used on the Creator’s discretion.

I’m knowledgeable working as knowledge scientist after ending my MBA in Enterprise Analytics and Finance. A eager learner who likes to discover and perceive and simplify stuff! I’m at the moment studying about superior ML and NLP strategies and studying up on numerous subjects associated to it together with analysis papers .