Using DuckDB for better Text to SQL

January 09, 2025

Building a codebase map is useful. All of the coding agents do this (cursor, aider, greptile - link to greptile RAG post). With code there is an obvious chunking and search unit - functions/classes. For databases the obvious choice is pre-labeled queries. But what if you don't have them? Most text to sql agents/systems use "scratch queries" to seed and search over. This reduces the problem significantly. Just look for a query that already exists. With no pre-existing queries the problem is much larger. How does this company define "active user"? Are columns mislabeled? Seemingly redundant columns. Which column is source of truth? Example queries give us a good starting point.

Solution 1: Create synthetics scratch queries We tried this first. This is roughly analogous to indexing a large site. First you start at the entry point (landing page) and then you discover pages it links to and visit those recursively. For "discovering" queries we first start with the most obvious queries. We ask an LLM to come up with some basic queries that would be useful for answering very broad high level questions. Then based on those results we recursively ask for more queries that would answer more niche or complex questions. This produces many synthetic queries that we can now use for RAG searching. Last step is to create a natural language description of each query and add that to a vector database.

This database pre-processing is slow and expensive and requires very smart LLMs to find nuances in the data. This is currently not very scalable but will likely become more feasible in the future as the cost of intelligence approaches 0.

In our experiments we find that this improves performance by X.

Solution 2: Adding annotations to database schemas Instead of pre-processing every database, what if we gave the LLM richer context for the database? For example, what if it knew the distribution of different values in the database before querying? Additionally we should let the agent iterate on the query before giving the result to the user to fix any issues it finds at runtime. DuckDB's SUMMARIZE command is helpful at producing a good overview of statistics for query results. This helps with performance without costing nearly as much time/tokens as the preprocessing approach.

Solution 3: Mix of both Instead of pre-processing, what if we could "index" good queries as we discover them? Let the agent search for queries, if it finds a new one, add it to the bank of known good queries.