May 15, 2024

Elasticsearch and OpenSearch schema rough edges

Elasticsearch and OpenSearch schema rough edges

Elasticsearch and OpenSearch schema rough edges

Elasticsearch and OpenSearch schema rough edges

Elasticsearch and OpenSearch schema rough edges

Elasticsearch and OpenSearch schema rough edges

by Jacek Migdal

Elasticsearch and its open-source sibling, OpenSearch are the most popular full-text search engines, yet their schema approach is widely misunderstood. We at Quesma are building a database gateway that helps people choose the best tool for the job and wanted to share some of our experiences building support for Elasticsearch and OpenSearch database engines.

Most databases (like PostgreSql) define schema upfront or can accept any JSON (example: MongoDB), but Elasticsearch/OpenSearch is different. 

Feel free to follow the instructions using the local Docker container:

# Start and verify it works
docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "DISABLE_SECURITY_PLUGIN=true" -e "DISABLE_INSTALL_DEMO_CONFIG=true" opensearchproject/opensearch:2.12.0
curl localhost:9200
# You should see nice JSON

Schema defined by first JSON

There's a common misconception that Elasticsearch/OpenSearch are schema-less. While this is partially true, you can explicitly create a new index with a JSON:

# Create new index by inserting one json
curl -X POST "localhost:9200/sample/_doc/?pretty" -H 'Content-Type: application/json' -d'
{
  "@timestamp": "2038-01-19T03:13:09",
  "message": "User 'jacek@quesma.com' has logged in successfully!",
  "user": {
    "id": 42
  }
}
'

It feels fine until you get an HTTP 400 error when trying to insert:

curl -X POST "localhost:9200/sample/_doc/?pretty" -H 'Content-Type: application/json' -d'
{
  "@timestamp": "2038-01-19T03:13:09",
  "message": "User 'jacek@quesma.com' has logged in successfully!",
  "user": {
    "id": "109c01ea-9264-4edc-9d30-9e71780ecfb6"
  }
}
'
# Output:
# {
#  "error" : {
#    "root_cause" : [
#      {
#        "type" : "mapper_parsing_exception",
# ...

You get an error “mapper_parsing_exception”, with cause “failed to parse field [user.id] of type [long] in document…”.

The same JSON for the new index would work and would define a different schema.

You usually won’t realise this behavior for a while. For performance, most of the insert/ingestion is through the “sample/_bulk” API.

Then, even when you drop data, you still get lovely HTTP 200 responses; however errors are buried in the response JSON.

curl -X POST "localhost:9200/sample/_bulk?pretty" -H 'Content-Type: application/json' -d'
{"index":{}}
{"@timestamp": "2038-01-19T03:13:09", "message": "User jacek@quesma.com has logged in successfully!", "user": {"id": 42}}
{"index":{}}
{"@timestamp": "2038-01-19T03:13:15", "message": "User pawel@quesma.com has logged in successfully!", "user": {"id": "109c01ea-9264-4edc-9d30-9e71780ecfb6"}}
{"index":{}}
{"@timestamp": "2038-01-19T03:13:42", "message": "User test@quesma.com has logged in successfully!", "user": {"id": "MAGIC_ID" }}
'

HTTP/1.1 200 OK, yet two documents were dropped.

Countless Elasticsearch users are confused by this and express their pain online. Elasticsearch's antidote to that problem is just dropping the fields by default instead of whole documents.

Defining schema through mapping can help, yet there are no easy ways to never leave any data behind while taking advantage of types. Moreover, updating a mapping cannot be done on a live index and it requires reindexing the whole doc collection. The community tries to avoid this flaw by defining mapping for everything.

It’s not OK for a database to drop your data, especially in observability and security contexts. In many of those setups, you have limited control over what data you receive. You may miss the root cause of an outage or security breach.

I beg you not to create too many fields

You may also be tempted to use many fields, such as one per customer.

Though please don’t run it against a real cluster:

for i in {1..1000}; do
  timestamp=$(date -u +"%Y-%m-%dT%H:%M:%S")
  json_body="{\"@timestamp\": \"$timestamp\", \"customer_$i\": \"Waiting for Quesma\"}"

  curl -X POST "localhost:9200/kibana_sample_data_logs/_doc" -H 'Content-Type: application/json' -d "$json_body"
done

This would not work completely, as at some point you hit a maximum of 1000 fields and each JSON key adds two fields. You can configure this limit larger, but it makes the situation even worse.

First of all, too many fields can damage your UI and dashboard experience.

Before, you could rely on a field list (see a left menu with nice names and types):

After that, it becomes polluted with additional fields and makes your filtering experience way worse (see the left menu is unusable):

Secondly, it can reduce your search performance.

Many folks fell into that trap through mistake, but it is also a way to troll or burn bridges. I would not recommend it, both search performance as well as UX would suffer and most ELKs are prone.

Though new fields are helpful, there is no easy way to manage them (e.g. allow the user to create 10 new ones), make rules to control them or a way to fix it once you land in that mess.

Another way to hit it is if your multi-tenant SaaS app allows custom attributes. Even if there is a limit of 20 per customer, this might explode, and you need to implement mapping outside of Elasticsearch.

The documents look like JSON, but they are not

Suppose that in your systems, there are multiple ways to authenticate.

How about storing an array:

curl -X POST "localhost:9200/sample/_doc/leetdoc" -H 'Content-Type: application/json' -d'
{
  "@timestamp": "2038-01-19T03:13:09",
  "message": "Users bot-42, 'jacek' and 'adam@quesma.com' has started the game!",
  "users": [{"id": 42}, {"name": "Jacek"}, {"email": "adam@quesma.com"}]
}'

It feels like it works:

curl localhost:9200/sample/_doc/leetdoc
{"_index":"sample","_id":"gU2ScY8BN7HWbsntlxXY","_version":1,"_seq_no":0,"_primary_term":1,"found":true,"_source":
{
  "@timestamp": "2038-01-19T03:13:09",
  "message": "Users bot-42, jacek and adam@quesma.com has started the game!",
  "users": [{"id": 42}, {"name": "Jacek"}, {"email": "adam@quesma.com"}]
}
}

However, it doesn’t behave that through search:

Creating another:

curl -X POST "localhost:9200/sample/_doc/?pretty" -H 'Content-Type: application/json' -d'
{
  "@timestamp": "2038-01-19T03:13:10",
  "message": "Users bot-42, 'jacek' and 'adam@quesma.com' has started the game!",
  "users": [{"id": 42, "name": "Jacek", "email": "adam@quesma.com"}]
}'

Both documents appear identically in the search:

So, though the search returns the exact JSON you inserted using _source, internally, each document is a collection of arrays for each field. The more accurate representation would be:

  "fields": {
    "users.email": [
      "adam@quesma.com"
    ],
    "@timestamp": [
      "2038-01-19T03:13:10.000Z"
    ],
    "users.name": [
      "Jacek"
    ],
    "users.id": [
      42
    ],
    "message": [
      "Users bot-42, jacek and adam@quesma.com has started the game!"
    ]
  }

This has many confusing implications beyond this article. You won’t be disappointed that many overrides exist, such as nested or flattened types. Also, there are runtime and strict mappings.

Is there a hope? 

These are just three rough edges, but unfortunately, there are way more. Every successful technology has its strengths and pitfalls. 

Elasticsearch/OpenSearch feels easy to get started, and most of the time, you won’t hit those until you reach some scale. It is worth remembering, especially if you plan to or already building your own do-it-yourself Observability or Security application. If you are not ready for this - consider a hosted option.

Fixing underlying issues without breaking backward compatibility is tough, especially in databases. That’s the problem we are looking to solve at Quesma. 

Customers would benefit from:

  • A way to experiment and safe step-by-step migration to a better schema. Double-write to old and new and be able to tweak. 

  • Implicit schema created after 1000+ JSONs, not a single one. Similarly, new top-level fields should only be created if they are regular.

  • No data is left behind. A queryable catch-all should be available for parts of the data that don’t match the schema.

Traditionally, database vendors are very conservative, but Quesma, as a gateway, got the opportunity to innovate and show the better way while preserving the good parts of a full-text search database.

Stay tuned. The better world is coming.

Thank you, Ivan Brusic and Quesma Team for providing feedback on the draft.

Table of Contents

Title
line
Title
line

Table of Content

Title
line