Evaluating Text-to-SQL LLMs: Key Metrics and Their Significance

Text-to-SQL LLMs bridge the gap between natural language queries and structured databases by translating user queries into SQL statements. To ensure accuracy, we need robust evaluation metrics at each stage. This blog explores the key metrics used to assess keyword extraction, retrieval, LLM response quality, and final SQL query generation.

1. Keyword Extraction Metrics

Extracting the right keywords ensures the retrieval step fetches the most relevant database metadata.

Out-of-Vocabulary (OOV)

Definition: Measures the percentage of extracted keywords that are not present in the predefined vocabulary. Example: If the query is "Find top 10 employees by salary" and the extracted keywords are {"Find", "top", "salary"}, but "Find" is not in the vocabulary, the OOV rate is 33%.

Average Edit Distance

Definition: Computes the average Levenshtein distance (number of edits required to match words) between extracted and ground-truth keywords. Example: If "emploees" is extracted instead of "employees", the edit distance is 1.

Cosine Similarity

Definition: Measures vector similarity between extracted and expected keywords. Example: If "salary" and "income" are close in an embedding space, cosine similarity would be high.

Lexical Overlap

Definition: Calculates the overlap between extracted and reference keywords. Example: Extracted keywords: {"employees", "salary"}; Ground truth: {"salary", "wages"}. Overlap = 1/2 (50%).

2. Retrieval Metrics

Once keywords are extracted, relevant metadata (table/column names) and few-shot examples must be retrieved.

Cosine Similarity

Same as in keyword extraction but applied to retrieved database schema elements.

Cumulative Gain (CG)

Definition: Measures the total relevance of items in a retrieved list, without considering their positions.

Discounted Cumulative Gain (DCG)

Definition: Improves upon CG by accounting for item positions, reducing the contribution of later items.

Normalized Discounted Cumulative Gain (NDCG)

Definition: Normalizes DCG by comparing it to the ideal DCG (IDCG), ensuring fair comparisons.

Precision@k

Definition: Measures the proportion of relevant results in the top-k retrieved items.

Recall@k

Definition: Measures the proportion of relevant items retrieved out of all possible relevant items.

Mean Reciprocal Rank (MRR)

Definition: Computes the reciprocal rank of the first relevant result.

Mean Average Precision (MAP)

Definition: Averages precision over multiple queries.

3. LLM Evaluation Metrics

LLM-generated SQL queries should be contextually relevant and faithful to the input information.

Faithfulness

Definition: Measures whether the LLM output is supported by the given context without adding extraneous information.

Calculation: Faithfulness is determined by extracting statements from the generated output and checking whether each is verifiable by the provided context. An LLM assists in extracting statements and identifying supported vs. unsupported claims.

Context Relevance

Definition: The contextual relevancy metric measures the quality of your RAG pipeline's retriever by evaluating the overall relevance of the information presented in your retrieval_context for a given input.

Calculation:

Answer Relevance

Definition: The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the actual_output of your LLM application is compared to the provided input.

Calculation:

The Answer Relevancy Metric first uses an LLM to extract all statements made in the actual output. Then, the same LLM classifies whether each statement is relevant to the input query.

Geval (Custom Criteria)

Definition: A custom metric defined based on domain-specific SQL evaluation needs.

Hallucination Rate

Definition: The hallucination metric determines whether your LLM generates factually correct information by comparing the actual_output to the provided context.

Calculation:

4. Text-to-SQL Metrics

The final SQL output needs to be syntactically and semantically correct.

Exact Match

Definition: Checks if the generated SQL exactly matches a ground-truth SQL statement.

ROUGE Score

Definition: Measures n-gram overlap between generated and reference SQL queries.

BLEU Score

Definition: Computes precision-based similarity for SQL sequences.

Semantic Correctness

Definition: Ensures the SQL query retrieves the expected results even if the structure differs.

Syntax Correctness

Definition: Checks if the SQL query is valid and executable.

Query Execution Time

Definition: Measures how long the generated query takes to execute, impacting performance.

Evaluating a text-to-SQL pipeline requires a comprehensive approach covering keyword extraction, retrieval, LLM responses, and SQL correctness. Using these metrics ensures robustness, efficiency, and accuracy in generating SQL queries from natural language.

references-

https://docs.confident-ai.com/docs/metrics-hallucination

https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/

https://www.evidentlyai.com/ranking-metrics/ndcg-metric

mlTutor

Evaluating Text-to-SQL LLMs: Key Metrics and Their Significance

1. Keyword Extraction Metrics

Out-of-Vocabulary (OOV)

Average Edit Distance

Cosine Similarity

Lexical Overlap

2. Retrieval Metrics

Cosine Similarity

Cumulative Gain (CG)

Discounted Cumulative Gain (DCG)

Normalized Discounted Cumulative Gain (NDCG)

Precision@k

Recall@k

Mean Reciprocal Rank (MRR)

Mean Average Precision (MAP)

3. LLM Evaluation Metrics

Faithfulness

Context Relevance

Answer Relevance

Geval (Custom Criteria)

Hallucination Rate

4. Text-to-SQL Metrics

Exact Match

ROUGE Score

BLEU Score

Semantic Correctness

Syntax Correctness

Query Execution Time

Recent Posts

Comments

Subscribe Form