Parsing SQL queries provides superpowers for monitoring data health. This post describes how to get started on parsing SQL for data observability. Query history of a data warehouse is a rich source of information to glean how data is used in your organization. Many aspects of data observability can be tracked by analyzing query history. For example, query history analysis can extract: Popular tables and columns Unused tables and columns Column-level lineage Freshness These statistics also help to automate common data engineering tasks like: Backup and Disaster Recovery Triage Data Quality issues Track sensitive data and how they are used. Challenges and Approaches SQL language is an ISO/IEC standard and the latest version is . However, every database implements the standard differently, uses different function names for the same operation, and has extensions to access specific custom features. Therefore, there isn’t one SQL parser for dialects of all popular databases and data warehouses. SQL2016 Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes. is a popular Python package that uses regular expressions to parse SQL. sqlparse An alternate approach is to implement the SQL grammar using parser generators like . There are similar open-source parser generators in other popular languages. There are multiple projects that maintain parsers for popular open-source databases like MySQL and Postgres. For other open-source databases, the grammar can be extracted from the open-source project. For commercial databases, the only option is to reverse engineer the complete grammar. There are SQL parser/optimizer platforms like that help to reduce the effort to implement the SQL dialect of your choice. ANTLR Apache Calcite Open Source Parsers Some popular open-source databases and data warehouses are: MySQL/MariaDB parser is a MySQL parser in Go. Pingcap in phpmyadmin is a validating SQL lexer and parser with a focus on MySQL dialect. SQL Parser Postgres extracts the parser (written in C) from the Postgres project and packages it as a stand-alone library. This library is wrapped in other languages by other projects like: libpg_query : Python pglast : Ruby pg_query : Golang pg_query_go : in Node and in the browser JS psql-parser pg-query-emscripten : Rust pg_query.rs Multiple Engines implements Apache Hive, Presto/Trino and Vertica dialects. queryparser implements BigQuery, Spanner, and Dataflow dialects. zetasql Generic Parsers : Python sqlparse : Rust sqlparser-rs : Python mo-sql-parseing Platforms Parser/Optimizer platforms implement the common SQL language features and allow customization as a first-class feature of the platform. Two popular open-source projects are: is a popular parser/optimizer that is used in popular databases and query engines like , , and . Apache Calcite Apache Hive BlazingSQL many others can parse multiple SQL dialects like MySQL, Postgres and Oracle. The grammar can be modified to support other SQL dialects. JSQLParser Apache Calcite allows customizations at various points of the parsing process. Parser rules can be to support custom syntax. changed Conventions such as quotes vs double quotes, case sensitivity. Add optimizer rules. Apache Calcite also provides visitors for traversing the SQL execution plan. Visitor pattern is an algorithm to traverse a SQL plan. Practical tips to Getting Started There are many abandoned open-source SQL parsers. The first filter is to use a project that will be supported in the future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming languages. What if there is no parser for your database? Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom SQL syntax. AWS Redshift, Vertica, and DuckDB are examples. Use a Postgres SQL parser to parse the query history of these databases to parse the majority of the queries. Many queries will fail to parse such as UNLOAD in AWS Redshift. If it is important to also parse the variants, consider modifying the projects to accept the custom grammar OR use a platform like Apache Calcite. Conclusion There is a demand for SQL parsers to build reports on database or data warehouse usage. There are a number of good open-source projects. However, there is a steep learning curve to use these projects and in many cases, a project may not fit your specific requirements. Struggling with parsing query history? Get in touch Also published . on Tokern