Solutions

Quesma for ELK

Schema changes

Blog

Pricing

About

Contact

Book a demo

Jun 18, 2024

What SQL could learn from Elasticsearch Query DSL

by Jacek Migdal

SQL is the most popular database query language by a wide margin. It’s been around since the 1970s and is ageing well. Over the past 50 years, it has survived many attempts to be disrupted by a vibrant software ecosystem [1]. However, the majority of technologies, such as the recent Friendly SQL extensions by DuckDB, have been innovative with and around it.

Only two widely successful query languages differ from SQL: Elasticsearch Query Domain Specific Language (DSL) and MongoDB. They both gave birth to billion+ revenue per year public company (ESTC, MDB) and are in the top 10 DB rankings [2].

MongoDB and Elasticsearch Query DSL received a fair amount of critique from traditional database gurus and were sometimes prematurely dismissed[3]. Despite that, both languages excelled in developer experience. MongoDB inspired SQL (for example, PostgreSQL) to adopt new column types[4]. However, Query DSL has not influenced SQL, hence my motivation to write this article.

Why now?

The SQL ecosystem can learn from Query DSL. Query DSL started as a way to do Google-style semantic top 10 searches. It earned most of its excellent reputation in monitoring and troubleshooting observability use cases. Even though you may not use it directly, it sets standards for UI and interactions for many observability and SIEM tools such as Kibana or Wazuh.

A new frontier has opened with AI large language models (LLM). Many companies adopting them can leverage out-of-the-box top-notch models (OpenAI, Anthropic, Gemini, Mistral), but AI needs context data through Retrieval-Augmented Generation (RAG) to be useful in most cases. The only reason why users may prefer the application LLM chat vs. general ChatGPT is that it has access to data, context, and application actions.

Query DSL seems to excel in this use case. It can mix vector, keyword and other conditions in a short query. It is very helpful in quick data exploration so that LLM can obtain the shape of any given value. Moreover, it can augment the search results with relevance/scoring, which is harder in vector search.

Query DSL - the good parts

Flexible full-text search

The Query DSL allows full-text search across all datasets and fields:

curl "http://localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
      "query": "Quesma"
    }
  }
}'

While SQL requires explicit specification of tables and columns, so even partial full-text looks very verbose:

SELECT * FROM people WHERE description ILIKE "%Quesma%" OR
  title ILIKE "%Quesma%" OR

There are tricks with UNION ALL or creating full-text indexes, but they have rough edges. Sometimes, the best way to debug search is to dump data into SQL statements and run grep on top of that.

It would be nicer if I could query SQL:

SELECT * FROM * WHERE * ILIKE "%QUESMA%"

Sane default limits

The above query would return just ten results and a count of hits up to 10000. This makes data exploration fast and queries. Of course, you can easily tweak them, but the number of hits in one query is still limited to 10,000.

SQL does not have default limits, so even a query can be heavy:

SELECT * FROM analytical_table_with_many_tb

A similar SQL query costs $14k in BigQuery. Many databases must implement limits and circuit breakers, which come with tradeoffs. Today, such prevention requires implementation at the client/application level.

Composable queries

This query DSL would find the top 10 cities by count and the total number of hits. It also operates on sample data of 5000 per shard (e.g. 5k records from each day):

{
  "aggs": {
    "sample": {
      "aggs": {
        "analyzed_city_count": { "value_count": { "field": "city" } },
        "top_10_cities": { "terms": { "field": "city", "size": 10 } }
      },
      "sampler": { "shard_size": 5000 }
  }
  "query": {...}
}

The aggregations are named and can be nested or combined with filters. One Query DSL can return multiple results in a single JSON, while SQL results are always limited to a single table. So, a single Query DSL can have results that would require multiple SQLs.

Composable JSON makes it easier to write your library, test it and have confidence.

Unfortunately, in SQL, a single word could change the whole semantic. Many SQL dialects also have differences in semantics (e.g., even date functions are non-standard) and many extensions, making uniform parsing hard and often incomplete.

Fast and partial results

Query DSL is ideal for powering interactive troubleshooting. Many Query DSLs are quick, and even more can be sped up with a simple sampler. Even heavy queries can be run in async and you can see partial results as the query progresses. So, an engineer who woke up at 2 am could quickly see that the system was flooded with errors and dig deeper, even if accurate results could take minutes.

This feels refreshing compared to traditional business intelligence and SQL warehouses, which can take minutes. You can solve it by throwing money on the cloud or hiring data engineers. Even the richest companies are limiting data warehouses to specialised teams. However, tools like Kibana can power whole operations teams, including customer support, sales, and field.

Although the underlying database behind Query DSL is not very fast, the language makes it easy to ask fast questions and see results in real time.

Learnings from Query DSL

A full text comes with a default limit in tandem

Making it all full-text, interactive, quick, and easy to play with is an exciting alternative to SQL. The closest thing on the market is SQLite FTS5 or similar languages:

CREATE VIRTUAL TABLE email USING fts5(sender, title, body);

INSERT INTO email (sender, title, body) SELECT (sender, title, body) FROM orig_emails;

SELECT * FROM email('Quesma');

Still, it would be nicer if this were an HTAP-style overlay on tables instead of an object that needs manual data syncing.

Once you open that door, you are prone to performance issues. Even if finding matches is fast, coping with the results can use many resources, particularly memory, even if you just look at the first top 10 results.

SELECT * FROM email('Quesma'); # Returns with LIMIT 10
SELECT * FROM email('Quesma') NO LIMIT; # Returns all queries

So default limits are desperately needed, but this is hard without breaking backward compatibility or making weird exceptions in the case of the full-text table. Hard sell to break a 50-year tradition.

It is also about wire protocol

SQL is not just a text language. It also comes with common ways to retrieve results. The wire protocol or interfaces such as JDBC/ODBC/ADBC assume that all results are in one table. So, it would be easy in SQL to add such statements (as shown in What if an SQL statement returned a database?), but every other part of an ecosystem would not play well with such change.

Even though redefining protocols is hard, partial results are a hard requirement for stream data. So, there are several implementations of streaming SQL, one example:

SELECT STREAM CEIL(rowtime TO HOUR) AS rowtime,
  productId,
  COUNT(*) AS c,
  SUM(units) AS units
FROM Orders
GROUP BY CEIL(rowtime TO HOUR), productId;

  rowtime | productId |       c | units
----------+-----------+---------+-------
 11:00:00 |        30 |       2 |    24
 11:00:00 |        10 |       1 |     1
 11:00:00 |        20 |       1 |     7
 12:00:00 |        10 |       3 |    11
 12:00:00 |        40 |       1 |    12

Can Query DSL become a protocol?

The new standards might make sense in the abstract, but the road to defining them is messy. Some believe that consortium-defining specs are the best way to start, though the track record of that approach is debatable (e.g. OpenCL vs. CUDA).

In contrast, many excellent protocols started with a great implementation that later evolved into a protocol (e.g. Kafka, PostgreSQL wire protocol, S3 API).

Query DSL has similar potential but is untapped in the vast SQL database arena. At Quesma, we believe that introducing a database gateway is the best way to add Query DSL power and flexibility to the existing SQL warehouses. This is an exciting new way to explore and troubleshoot using data for a wider audience, especially in the brave new world where LLMs will rule.

Footnotes:

[1] Extrapolating Nassim Taleb's Antifragile Lindy Effect, SQL got one of the best shots to be used for over 100 years since it is already halfway there.

[2] Honourable mentions are query languages like Spark, Redis, Neo4j, and observability pipeline (Splunk SPL, Sumo Logic, ES/QL, PPL).

[3] Call me maybe: MongoDB stale reads (the article is from 2013, title was renamed and problems were fixed)

[4] MongoDB 1.0 was released in February 2009, and JSONB was released in PostgreSQL 9.4 on March 2014

Table of Contents

Title