Full-text search X times faster: Inverted index vs. SQL OLAP

by Jacek Migdal

Using a sledgehammer to crack a nut is all too common in the data infrastructure. Computers are highly performant, and scaling out cloud resources is a breeze. But, when the time comes to pay a scary bill, it's a prime time to reflect on the fundamentals.

In the SQL ecosystem, there is a good understanding of the difference between transactional (OLTP) and analytical (OLAP) workloads. Many can do both, but some databases, such as MySQL or PostgreSQL, specialise in transactions, while others, such as Snowflake, in analytics.

The full-text search workloads are way less understood. Several companies (e.g. Clickhouse Clickbench 7-24x, Hydrolix 18x, and Altinity Clickhouse 4-10x) claim to be 10+ times more cost-efficient per TB than Elasticsearch in the observability use case. Considering Elastic's popularity and excellent implementation, the difference of such magnitude is explained in data structures and architectural choices.

How we got there: The rise of observability full-text search

As software keeps growing in complexity, bugs are unavoidable. With any mission-critical software, any failure is costly, yet companies must release new versions daily.

Many engineers experienced painful late-night outages, which were a murder mystery-style puzzle. Something broke, but the time to identify and understand was too long. Meanwhile, executives dread revenue loss with every minute of the outage.

Out of this frustration, a new generation of observability tools was born. Many were built on the same full-text infrastructure that powers the consumer web. There were more urgent problems to be solved instead of wondering whether inverted indexes still make sense.

A must-have for any observability tool is the excellent developer experience as they need to troubleshoot 2 am outages quickly. That is why many tools started by building a vast ecosystem of integrations. For example, log management tools need hundreds of integrations to collect, understand, display and alert on log lines. Distributed tracing needs rich instrumentation to correlate user actions like a click with a chain of microservice calls.

Most of this tooling is vertically coupled with storage and querying. This works great for vendors and produces many multibillion-dollar companies in that category. The gravity of data will keep you with a vendor for a long time. As the volume of observability rises, the cost does, too, and overages are frequent.

How Elasticsearch full-text search works

Elasticsearch is based on the library Apache Lucene for information storage and retrieval. The primary data primitives are inverted indexes.

To create an index you need to:

Tokenization: Split sentences into words/tokens.
Stemming and lemmatization: Unify some tokens by mapping them together (e.g. slept -> sleep, cars -> car).
Store a lot of metadata along the way for scoring to be able to identify top results.

The indexing is very resource-intensive; you need many IO operations per log line.

Much research went into understanding the most relevant results, working with typos and finding semantic meaning. Many consumer websites run at a massive scale with new autocompletion on every keystroke. At the same time, they have to return the top results within milliseconds.

This made full-text search incredibly complex, but the effort was worth it. The consumer loves the experience. A trillion-dollar company, Google, was built by creating the world's biggest inverted index with exceptional scoring called PageRank on the internet.

How analytical full-text is different

The observability data is noisy, and the value of each log line is low. Sometimes, the data is merely an insurance policy, primarily unanalysed, but occasionally saves your life during an outage, security audit or security incident. Many companies use it as a Swiss Army tool for all back-office integrations.

We can maximize the value of the system by optimizing for the following principles.

Real-time with a recency bias

Observability and security is time intensive. New data flows every minute, and the last 15 minutes are vital. Most searches are about current issues, and historical searches are less common.

Huge ingest and big dataset

The data can be noisy, with TBs of new data daily, even for small systems. The machines produce magnitudes more data than humans can type.

A small amount of expensive searches with spikes

There are benefits of aggregation and histogram searches, which require more processing than the top 10 styles. Usually, developers, DevOps, or security analysts run those searches. Frequently, there are massive spikes during incidents, audits or debugging sessions.

How to be 10 times faster in analytical search

Don’t use inverted indexes. Do raw search with skip indexes

The search-per-indexing ratio is entirely different. It is not worth spending so many resources on indexing, which brings a lot of costs, complexities, and potential lag.

Moreover, many full-text features, such as tokenisations, are antipatterns in logs. For example, DevOps searches are grep style:
InvalidUser
It wants to find logs like:
InvalidUserException

When after tokenisation, it is unclear if you find them. Usually not. What was a feature for consumer search is an antipattern for a technical audience. The inverted indexes are no longer helpful. A simple character match is more reliable.

Raw full-text scans beat fancy indexing. Moreover, You can optimise raw search with skip indexes. For every chunk of data (e.g. 50MB), store probabilistic data structures like bloomfilter, which can help avoid scanning whole blocks. It has a small overhead (1-10%), but can speed up finding a needle in a haystack.

Use column base storage with tailored compression

Most logs are repetitive, with a lot of metadata of their origin. Standard columns such as host.name or k8.pod can be extracted and compressed. Moreover, many codecs, such as dictionary compression, can allow searching without prior decompression.

Use all analytical database tricks

There are a lot of papers that made analytical databases performant (e.g., MonetDB/X100). The state-of-the-art is vectorised with just-in-time query execution engines. Moreover, a finite-state machine can match multiple strings and regular expressions in one pass.

SQL analytical data warehouse for observability and security data

Instead of one more tool, it would make sense to unify to one data warehouse.

However, there is a big gap in tooling. The interactive troubleshooting requirements are different: you need incremental results and be able to do quick drill-downs. Many tools are built on Kibana, OpenSearch dashboards, Grafana, or your observability SaaS vendor, such as DataDog. Each troubleshooting click maps to many SQL queries (e.g. show my last log lines, how many over time). Moreover, there are pre-made integrations with many popular components or SaaS. A typical organisation can manage hundreds of log types, each needing different ingestion, dashboards, and alerts. Building it all in-house is costly, and reusing a common library is a must-have.

Last but not least, many organisations are deeply attached to their tooling. E.g. Uber got 10k Kibana dashboards when they moved from Elasticsearch to Clickhouse. They don’t migrate them but provide backward compatibility through API.

Though some benefited tremendously from one-way migration like Zomato is saving 1 million per year by moving to Clickhouse, this is not the correct answer for everybody. The cost of migration is substantial, yet maintenance and upgrade expenses loom even larger. Instead of tapping into the tooling ecosystem, where most issues find available solutions online, you are on your own.

Best tool for the job,

However, if all you have is a hammer, everything looks like a nail.

The Elastic ecosystem is fantastic, and so are SQL analytical data warehouses. Inverted indexes make sense for consumer searches, but they have challenges at scale in observability use cases. The new generation of observability companies, such as SigNoz, BetterStack or EdgeDelta, are all based on Clickhouse.

You need access to the rich ecosystem of integrations while having the most efficient engine possible. Be able to join data from all sources to find better insights.

We are developing Quesma to bridge those worlds and use the best data tool for the job.

Please reach out if you have a lot of observability and security data. We are here to help.

Table of Content

Title