Read our latest Blogs on AI, Data Science and Machine Learning

Most organizations already have the data. What they lack is reliable, fast, and accountable access to it. Ticket queues, stale dashboards, schema sprawl, and ambiguous ask-to-SQL translation conspire to slow decisions and erode trust. This essay distills what we’ve learned from shipping production text-to-SQL across hundreds of tables and thousands of columns, and connects those lessons to what top papers actually found.

‍

Typical Points of Failure

‍

1) Natural language is ambiguous; enterprise schemas are not

The leap from “orders by product type last quarter in EMEA” to the right facts, date roles, and product grain is precisely the “schema generalization” problem that benchmark designers targeted. Spider deliberately forces models to generalize to unseen databases, making this pain explicit. As the authors put it, Spider is “a complex and cross-domain… task so that different… databases appear in train and test sets.”¹

RAT-SQL showed that explicit, relation-aware schema linking matters. Its core observation: “any text-to-SQL parsing model must encode the schema” so the decoder can select correct tables/columns under domain shift.² (This is exactly the day-to-day challenge in warehouses with hundreds of tables.)

‍

**2) Models happily emit invalid or un-executable SQL**

A line of work tackles this with decoding constraints and execution feedback. PICARD “helps to find valid output sequences by rejecting inadmissible tokens at each decoding step,” taking T5-style models from “passable” to state-of-the-art on Spider/CoSQL.³ (Concretely: it parses as it decodes, and simply refuses illegal next tokens.)

Execution-guided decoding (EG) goes after semantic errors: partially executes candidates and prunes ones that fail (wrong types, broken joins). Wang et al. reported consistent gains across datasets by “conditioning on the execution of partially generated” programs.⁴

‍

3) Benchmark scores overstate production reliability

Two threads are crucial here:

Robustness under perturbations. Dr.Spider introduced 17 perturbation sets to stress synonyms, column swaps, and other realistic changes; many systems degraded significantly when surface forms shifted.⁵
Data-model drift and business semantics. A 2025 EDBT study argued that “systematic exploration of robustness towards different data models in a real-world… scenario is missing,” then evaluated multi-year deployments, highlighting brittleness when schemas evolve.⁶

‍

4) Evaluation itself is tricky—and sometimes misleading

A careful review contrasts “exact-set match” with “execution accuracy,” noting that equivalence testing is hard and that exact-match can penalize benign rewrites while execution accuracy can pass wrong-but-lucky queries.⁷ A practical takeaway: measure more than one metric and add operational checks (e.g., date-role consistency, distinct-count grain).

‍

5) Security and governance cannot live in prompts

In production, row-level security (RLS) belongs in the database. SQL Server enforces this via predicate functions and security policies (“FILTER” vs “BLOCK” predicates), so access rules apply universally to every ad-hoc query.⁸ ⁹ ¹⁰ (Community write-ups also note caveats: cost and potential bypasses if misconfigured.)¹¹

‍

What Works in Production (and why)

‍

A. Router-Enhanced RAG to tame schema sprawl

Agentic systems that classify intent (e.g., internet vs reseller sales; default order date), then bias retrieval toward a minimal anchor set (core fact + key dims) consistently reduce hallucinations and wrong table picks. This operationalizes RAT-SQL’s insight—link to the right schema elements—but achieves it with retrieval and intelligent routing instead of a single monolithic encoder (e.g., Zhang et al., 2025)..² ¹⁴

‍

B. Constrain generation, then use the database as a judge

Combine grammar-constrained generation—PICARD-style on open-weight deployments with token-level control (e.g., logit masking/grammars) or provider “structured outputs” on hosted models—with execution-guided decoding to catch most syntax errors and many semantic errors early. In practice, this pair cuts a wide class of invalid or nonsensical outputs before they’re persisted or consumed by downstream systems (e.g., vLLM’s structured-output backends for guided decoding).³ ⁴ ¹⁵

‍

C. Bake in domain-specific rules of the road

Most downstream errors are business errors: wrong date roles (ship vs order), broken grouping grain (counting distinct orders at SKU while labeling at family), mis-attributing country by customer domicile instead of sales territory, etc. Surveys from 2024–2025 call for lifecycle-level evaluation and error taxonomies that explicitly target these categories.¹² ¹³ Operationalize them as post-hoc checks with hard failures and human-readable reasons.

‍

D. Treat drift as a first-class surface

When new columns/tables appear, regenerate embeddings and refresh low-cardinality value catalogs (enums, status codes). The EDBT study’s central point is that data-model robustness can only be earned by tracking and reacting to change.⁶

E. Observe everything

Log routes, prompts, SQL, and masked row samples; track token/latency budgets; and attach compact explanations to each answer. This isn’t “nice to have” — it’s how you debug, govern, and build trust.

‍

A Brief Note on Petalytics

Petalytics (Request Early Access) implements the blueprint above so stakeholders can actually use it:

‍

An agentic, cohesive, multi-layer pipeline where each component has a clear purpose, fits the whole, receives tailored context, and reasons over your business semantics and database schema.
An intent router induces a hierarchical taxonomy over the user prompt, then conditionally steers retrieval so that category-conditioned RAG orchestrates context injection across the workflow.
Guardrails end-to-end. Decoding is constrained (PICARD-style on open-weight deployments or schema-validated structured outputs on hosted models); candidates are screened via execution; post-hoc business checks enforce date-role and grain consistency; suspicious SQL is declined rather than executed.
Scale-aware context. Designed for hundreds of tables / thousands of columns; context packs refresh as schemas evolve.
Security where it belongs. We rely on database-native RLS (FILTER/BLOCK policies).
Explainability. Every answer includes a crisp scope and assumptions (channel, date role, product grain) so business users can trust and challenge results.

‍

Implementation Checklist (production-oriented)

Codify semantics once. Define canonical meanings (“orders,” “product type,” “GM%”) and default date roles; store them in retrievable docs the router can bias toward. (This is the practical side of schema-linking.)²
Constrain + execute. Use PICARD-style decoding or structured-output constraints for syntax, then execution guidance (plan/dry-run, capped probes) for semantics; sandbox and cap runtime.³ ⁴
Enforce business grain. Add automated checks for distinct counts at the correct grain; reject queries that mix SKU-level counts with family-level labels.¹² ⁷
Instrument drift. Watch DDL events; regenerate embeddings and value catalogs as schemas evolve; alert when docs are stale.⁶
Push RLS into the database. Use SQL Server security policies (FILTER/BLOCK) and predicate functions; don’t reinvent access control in prompts.⁸ ⁹
Evaluate like an operator. Track exact-match and execution accuracy, but also operational KPIs: time-to-first-answer, correction rate, and reasoned declines (guardrails doing their job).⁵ ⁷ ¹²

‍

References

Yu, Tao, et al. “Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task.” Proceedings of EMNLP, 2018. https://aclanthology.org/D18-1425/ (accessed September 5, 2025).
Wang, Bailin, et al. “RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers.” Proceedings of ACL, 2020. https://aclanthology.org/2020.acl-main.677.pdf (accessed September 5, 2025).
Scholak, Torsten, Nathan Schucher, and Dzmitry Bahdanau. “PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models.” Proceedings of EMNLP, 2021. https://aclanthology.org/2021.emnlp-main.779/ (accessed September 5, 2025).
Wang, Chenglong, et al. “Robust Text-to-SQL Generation with Execution-Guided Decoding.” arXiv preprint, 2018. https://arxiv.org/abs/1807.03100 (accessed September 5, 2025).
Chang, Shuaichen, et al. “Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness.” arXiv preprint, 2023. https://arxiv.org/abs/2301.08881 (accessed September 5, 2025).
Fürst, Josef, et al. “Evaluating the Data Model Robustness of Text-to-SQL Systems in Practice.” Proceedings of EDBT, 2025. https://openproceedings.org/2025/conf/edbt/paper-18.pdf (accessed September 5, 2025).
Pourreza, Mohammad, et al. “On the Evaluation of Text-to-SQL Systems: Exact-Match versus Execution Accuracy.” arXiv preprint, 2023. https://arxiv.org/pdf/2310.18538 (accessed September 5, 2025).
Microsoft. “Row-Level Security (RLS) — SQL Server.” Microsoft Learn Documentation, updated November 22, 2024. https://learn.microsoft.com/en-us/sql/relational-databases/security/row-level-security (accessed September 5, 2025).
Microsoft. “CREATE SECURITY POLICY (Transact-SQL).” Microsoft Learn Documentation, updated November 22, 2024. https://learn.microsoft.com/en-us/sql/t-sql/statements/create-security-policy-transact-sql (accessed September 5, 2025).
Microsoft. “ALTER SECURITY POLICY (Transact-SQL).” Microsoft Learn Documentation, updated July 9, 2025. https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-security-policy-transact-sql (accessed September 5, 2025).
Redgate. “SQL Server Row Level Security Deep Dive (Attacks and Caveats).” Simple Talk, September 25, 2023. https://www.red-gate.com/simple-talk/blogs/sql-server-row-level-security-deep-dive-part-5-rls-attacks/ (accessed September 5, 2025).
Li, Guoliang, et al. “A Survey of Text-to-SQL in the Era of LLMs.” IEEE Transactions on Knowledge and Data Engineering (early access), 2025. https://dbgroup.cs.tsinghua.edu.cn/ligl/papers/TKDE25-NL2SQL.pdf (accessed September 5, 2025).
Liu, Zhi, et al. “A Survey of LLM-Based Text-to-SQL.” arXiv preprint, 2025. https://arxiv.org/html/2406.08426v4 (accessed September 5, 2025).
Zhang, Jiarui, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, and Guihai Chen. “Query Routing for Retrieval-Augmented Language Models.” arXiv preprint, 2025. https://arxiv.org/abs/2505.23052 (accessed September 5, 2025).
vLLM Contributors. “Structured Outputs — vLLM.” vLLM Documentation. docs.vllm.ai (accessed September 8, 2025).

‍

Why “Chat with Your Data” Usually Disappoints — and How to Make It Enterprise-Grade