11 May 2026

Exploring the Variant type in Parquet

Let's get shredding

I’ve had observability on my mind for quite some time now. Two years ago I was trying to expand my metrics setup with structured logs emitted by some services. The ensuing exploration only yielded the obvious: Storing and efficiently querying arbitrary data is hard. DuckDB was all the rage back then, so naturally I had to see how it handles analytics workloads. Finally, a year later, I checked out ClickHouse and found it amazingly fast for my purposes.

Ever since, I’ve run a ClickHouse instance to store arbitrary JSON payloads fed by GPS location data, sensor readings and tracing data. But a couple of days ago that nagging feeling came back: What if this can be simplified? ClickHouse is by far the heaviest service I operate on my server. It randomly fills up my disk with logging tables, too. Look, ClickHouse is an engineering marvel, but it’s not my marvel. It’s someone else’s marvel. And I have a need to create my own. Thing, not marvel, I mean. So here we are.

The hard part is still how to store arbitrary JSON payloads in such a way that aggregations on fields are performant. My current solution heavily relies on the excellent Map type in ClickHouse. To replace it, some storage engine is going to have to do some heavy lifting. My hope is that DuckDB can take that role if we massage the data just a bit. Since I last used DuckDB, quite a bit has changed. Particularly interesting for me: Parquet got a Variant type and DuckDB supports it.

But before we get into the weeds, let me set the scene. Let’s say I have a service that sends structured logs (or wide events) to my observability system. Today they might look like this:

{
  "timestamp": "2026-05-05T08:00:00Z",
  "service": "marending-dev",
  "severity": "error",
  "message": "User Bobby Tables doesn't exist"
}

But then I want to track more information.

{
  "timestamp": "2026-05-05T08:00:00Z",
  "service": "marending-dev",
  "severity": "error",
  "message": "User Bobby Tables doesn't exist",
  "user_id": "null",
  "ip": "85.0.1.42",
  "path": "/user/login"
}

You get the picture. The question is, how can we store these JSON blobs such that we can then efficiently compute, say, error rate grouped by user_id? We can’t have a static schema, since we might want to add properties to this event at any time. We can chuck this as a JSON string into any database, file format or what have you. But then any query that attempts to read from properties of that JSON string is going to incur the significant cost of extracting the appropriate fields for every row of interest. One way to get reasonable performance here is to ensure severity and user_id are each contiguously laid out so that they can be efficiently aggregated.

Parquet and Variant

That brings us to Apache Parquet. Parquet is a column-oriented file format. While DuckDB has its own file format for storing data, one of its strengths is its support for a wide variety of other formats, including Parquet. So you can write and query Parquet files natively with SQL within DuckDB.

Recently Parquet added the Variant type, a binary encoding of primitives, arrays or objects of arbitrary types. It allows representing structured data in such a way that it can be queried more efficiently than if the same data was serialized and stored as a string type. But the real kicker is “Variant shredding”. It’s the process of extracting some of the properties within the Variant into their own columns. This should then give us maximum performance when aggregating on those columns. And DuckDB apparently supports this!

Once I learned about this, a path towards my dream observability system emerged. Step 1: Write a service that accepts arbitrary JSON payloads, buffers them in memory and writes them to Parquet files with the payload being of type Variant every 10 seconds or so. Step 2: Every hour, read all the small Parquet files and shred them into a larger hourly file. Step 3: Possibly aggregate the hourly files into even broader files, but I’ll see when I get to it.

Now at query time, I will have at most an hour’s worth of data in inefficient, unshredded form. That should be easy enough to deal with. Data older than that will be in highly efficient form. And the best part? DuckDB should be able to query across many Parquet files using glob syntax, even when some have concrete, shredded columns and some don’t. But we’ll see this for ourselves.

Exploration

Let’s start with the example given in DuckDB’s documentation.

CREATE TABLE events (id INTEGER, data VARIANT);
INSERT INTO events VALUES
    (1, 42::VARIANT),
    (2, 'hello world'::VARIANT),
    (3, [1, 2, 3]::VARIANT),
    (4, {'name': 'Alice', 'age': 30}::VARIANT);

Querying the table presents the expected result.

SELECT * FROM events;
┌───────┬────────────────────────────┐
│  id   │            data            │
│ int32 │          variant           │
├───────┼────────────────────────────┤
│     1 │ 42                         │
│     2 │ hello world                │
│     3 │ [1, 2, 3]                  │
│     4 │ {'name': Alice, 'age': 30} │
└───────┴────────────────────────────┘

SELECT id, data, variant_typeof(data) AS vtype FROM events;
┌───────┬────────────────────────────┬───────────────────┐
│  id   │            data            │       vtype       │
│ int32 │          variant           │      varchar      │
├───────┼────────────────────────────┼───────────────────┤
│     1 │ 42                         │ INT32             │
│     2 │ hello world                │ VARCHAR           │
│     3 │ [1, 2, 3]                  │ ARRAY(3)          │
│     4 │ {'name': Alice, 'age': 30} │ OBJECT(name, age) │
└───────┴────────────────────────────┴───────────────────┘

We can now also query nested fields in a Variant.

SELECT data.name FROM events;
┌─────────┐
│  name   │
│ variant │
├─────────┤
│ NULL    │
│ NULL    │
│ NULL    │
│ Alice   │
└─────────┘

Writing this data as-is to a parquet file is now as simple as COPY events TO 'events.parquet' (FORMAT parquet);. And indeed, having DuckDB describe the parquet file’s shape yields the expected.

DESCRIBE SELECT * FROM 'events.parquet';
┌────────────────┐
│ events.parquet │
│                │
│ id     integer │
│ data   variant │
└────────────────┘

But hold on. According to the docs, writing to a Parquet file should automatically shred the variant type. Shouldn’t I see the individual columns then? Turns out shredding is only a storage detail, not a logical schema change. The type stays as is. To see how the data is actually stored, we have to inspect the file’s metadata.

SELECT name, type, converted_type FROM parquet_schema('events.parquet');
┌───────────────┬────────────┬────────────────┐
│     name      │    type    │ converted_type │
│    varchar    │  varchar   │    varchar     │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL       │ NULL           │
│ id            │ INT32      │ INT_32         │
│ data          │ NULL       │ NULL           │
│ metadata      │ BYTE_ARRAY │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ INT32      │ INT_32         │
└───────────────┴────────────┴────────────────┘

Well, the way this is supposed to be read is as follows. id is of type INT32, the data variant itself doesn’t really have a type but it comprises three parts: metadata, value and typed_value. Of interest are value and typed_value. Each row with the data logical column is either going to have the (fallback) byte-array value representation set, or the shredded typed_value column. In this case, DuckDB’s autoshredder apparently decided that row 1 with the integer data variant deserves to have its representation shredded, but not the other rows.

If, instead, we instruct DuckDB to shred data to a struct like so

COPY events TO 'events.parquet' (
    FORMAT parquet,
    SHREDDING {'data': 'STRUCT(name VARCHAR, age INTEGER)'}
);

we get this

SELECT name, type, converted_type, logical_type FROM parquet_schema('events.parquet');
┌───────────────┬────────────┬────────────────┐
│     name      │    type    │ converted_type │
│    varchar    │  varchar   │    varchar     │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL       │ NULL           │
│ id            │ INT32      │ INT_32         │
│ data          │ NULL       │ NULL           │
│ metadata      │ BYTE_ARRAY │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ NULL       │ NULL           │
│ name          │ NULL       │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ BYTE_ARRAY │ UTF8           │
│ age           │ NULL       │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ INT32      │ INT_32         │
└───────────────┴────────────┴────────────────┘

Now we see data with its metadata, value and typed_value properties and also a value and typed_value logical column for both name and age. For these the shredded types are UTF8 and INT32 respectively.

To get an intuitive understanding, it is interesting to look at the output of parquet-tools meta events.parquet from this tool:

Output

{
  "NumRowGroups": 1,
  "RowGroups": [
    {
      "NumRows": 4,
      "TotalByteSize": 725,
      "Columns": [
        {
          "PathInSchema": ["id"],
          "Type": "INT32",
          "ConvertedType": "convertedtype=INT_32",
          "Encodings": ["PLAIN"],
          "CompressedSize": 41,
          "UncompressedSize": 39,
          "NumValues": 4,
          "NullCount": 0,
          "MaxValue": 4,
          "MinValue": 1,
          "CompressionCodec": "SNAPPY"
        },
        {
          "PathInSchema": ["data", "metadata"],
          "Type": "BYTE_ARRAY",
          "Encodings": ["PLAIN"],
          "CompressedSize": 52,
          "UncompressedSize": 60,
          "NumValues": 4,
          "NullCount": 0,
          "CompressionCodec": "SNAPPY"
        },
        {
          "PathInSchema": ["data", "value"],
          "Type": "BYTE_ARRAY",
          "Encodings": ["PLAIN"],
          "CompressedSize": 81,
          "UncompressedSize": 137,
          "NumValues": 4,
          "NullCount": 1,
          "CompressionCodec": "SNAPPY"
        },
        {
          "PathInSchema": ["data", "typed_value", "name", "value"],
          "Type": "BYTE_ARRAY",
          "Encodings": ["PLAIN"],
          "CompressedSize": 34,
          "UncompressedSize": 119,
          "NumValues": 4,
          "NullCount": 4,
          "CompressionCodec": "SNAPPY"
        },
        {
          "PathInSchema": ["data", "typed_value", "name", "typed_value"],
          "Type": "BYTE_ARRAY",
          "ConvertedType": "convertedtype=UTF8",
          "Encodings": ["PLAIN"],
          "CompressedSize": 44,
          "UncompressedSize": 128,
          "NumValues": 4,
          "NullCount": 3,
          "CompressionCodec": "SNAPPY"
        },
        {
          "PathInSchema": ["data", "typed_value", "age", "value"],
          "Type": "BYTE_ARRAY",
          "Encodings": ["PLAIN"],
          "CompressedSize": 34,
          "UncompressedSize": 119,
          "NumValues": 4,
          "NullCount": 4,
          "CompressionCodec": "SNAPPY"
        },
        {
          "PathInSchema": ["data", "typed_value", "age", "typed_value"],
          "Type": "INT32",
          "ConvertedType": "convertedtype=INT_32",
          "Encodings": ["PLAIN"],
          "CompressedSize": 39,
          "UncompressedSize": 123,
          "NumValues": 4,
          "NullCount": 3,
          "CompressionCodec": "SNAPPY"
        }
      ]
    }
  ]
}

Let’s take this line for example ["data", "typed_value", "name", "typed_value"]. A row in our dataset is going to populate this logical column when the name property in the variant is a UTF8 string, otherwise it will fall back to the BYTE_ARRAY data.value. And indeed we see that 3 out of 4 values in this column are NULL, since row 4 is the only one that can populate this typed_value. Conversely, the ["data", "value"] entry only has a single NULL value: only row 4 does not populate the fallback column, populating the typed values instead.

If we look at ["data", "typed_value", "name", "value"] we can see that not a single non-NULL value is here. I suppose this would be populated if we had a row with an object that contains the name property, but with a type other than string.

I’ll just leave this blog post from the official Parquet website here for further reading. It has some nice visualizations that further explain how this works.

Autoshredding

So far, so good. But we don’t want to have to specify how the values should be shredded. The whole point would be for DuckDB to figure it out from the data, but somehow in our first attempt it just chose to “materialize” the integer and not our object.

Well, it looks like DuckDB decides what to shred based on how often a specific type occurs in the table. If I add two more rows that have the same shape as the object in row 4

INSERT INTO events VALUES (5, {'name': 'Bob', 'age': 27}::VARIANT);
INSERT INTO events VALUES (6, {'name': 'Flo', 'age': 31}::VARIANT);
COPY events TO 'events-more-structs.parquet' (FORMAT parquet);

I get our familiar shredding, this time without having to specify the type.

SELECT name, type, converted_type FROM parquet_schema('events-more-structs.parquet');
┌───────────────┬────────────┬────────────────┐
│     name      │    type    │ converted_type │
│    varchar    │  varchar   │    varchar     │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL       │ NULL           │
│ id            │ INT32      │ INT_32         │
│ data          │ NULL       │ NULL           │
│ metadata      │ BYTE_ARRAY │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ NULL       │ NULL           │
│ age           │ NULL       │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ INT32      │ INT_32         │
│ name          │ NULL       │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ BYTE_ARRAY │ UTF8           │
└───────────────┴────────────┴────────────────┘

If I add a bunch more string columns now, I can get the autoshredder to only shred the UTF8 type. Initially I would have expected that it should be possible to have both shredded age and name values as well as a fallback not to an untyped BYTE_ARRAY but to a UTF8 string. But thinking about this, I guess the typed_value for data is now an object of name and age (although it’s not readily visible in the above representation), so it can’t be a string simultaneously.

It should be possible, however, if the string columns were in a new property inside an object. Let me give this a shot.

INSERT INTO events VALUES (10, {'message': 'Hello world'}::VARIANT);
INSERT INTO events VALUES (11, {'message': 'message'}::VARIANT);
INSERT INTO events VALUES (12, {'message': '42'}::VARIANT);

COPY events TO 'events-message.parquet' (FORMAT parquet);
SELECT name, type, converted_type FROM parquet_schema('events-message.parquet');
┌───────────────┬────────────┬────────────────┐
│     name      │    type    │ converted_type │
│    varchar    │  varchar   │    varchar     │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL       │ NULL           │
│ id            │ INT32      │ INT_32         │
│ data          │ NULL       │ NULL           │
│ metadata      │ BYTE_ARRAY │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ NULL       │ NULL           │
│ message       │ NULL       │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ BYTE_ARRAY │ UTF8           │
│ age           │ NULL       │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ INT32      │ INT_32         │
│ name          │ NULL       │ NULL           │
│ value         │ BYTE_ARRAY │ NULL           │
│ typed_value   │ BYTE_ARRAY │ UTF8           │
└───────────────┴────────────┴────────────────┘

Indeed we can see that it now materialized age, name and message. Neat!

Real-world data

With my mental model starting to get less foggy, I think it’s time to start digging into some real-world data from my existing observability setup. Luckily ClickHouse makes it really easy to get my data out: SELECT * FROM events.otel_traces INTO OUTFILE './otel_traces.parquet' FORMAT Parquet. This dumps all of the data into a single Parquet file with just over 10 million rows, weighing in at 800 MB.

The shape of this file is determined by the standard schema the OpenTelemetry collector created when I started ingesting data. In ClickHouse it looks like this:

CREATE TABLE default.otel_traces
(
        `Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
        `TraceId` String CODEC(ZSTD(1)),
        `SpanId` String CODEC(ZSTD(1)),
        `ParentSpanId` String CODEC(ZSTD(1)),
        `TraceState` String CODEC(ZSTD(1)),
        `SpanName` LowCardinality(String) CODEC(ZSTD(1)),
        `SpanKind` LowCardinality(String) CODEC(ZSTD(1)),
        `ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
        `ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
        `ScopeName` String CODEC(ZSTD(1)),
        `ScopeVersion` String CODEC(ZSTD(1)),
        `SpanAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
        `Duration` Int64 CODEC(ZSTD(1)),
        `StatusCode` LowCardinality(String) CODEC(ZSTD(1)),
        `StatusMessage` String CODEC(ZSTD(1)),
        `Events.Timestamp` Array(DateTime64(9)) CODEC(ZSTD(1)),
        `Events.Name` Array(LowCardinality(String)) CODEC(ZSTD(1)),
        `Events.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
        `Links.TraceId` Array(String) CODEC(ZSTD(1)),
        `Links.SpanId` Array(String) CODEC(ZSTD(1)),
        `Links.TraceState` Array(String) CODEC(ZSTD(1)),
        `Links.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
        INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
        INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
        INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
        INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
        INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
        INDEX idx_duration Duration TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId)
TTL toDateTime(Timestamp) + toIntervalDay(3)
SETTINGS ttl_only_drop_parts = 1

Once dumped to the Parquet file, we can see the shape as read by DuckDB here:

DESCRIBE SELECT * FROM 'otel_traces.parquet';
┌───────────────────────────────────────────────┐
│              otel_traces.parquet              │
│                                               │
│ Timestamp          timestamp with time zone   │
│ TraceId            varchar                    │
│ SpanId             varchar                    │
│ ParentSpanId       varchar                    │
│ TraceState         varchar                    │
│ SpanName           varchar                    │
│ SpanKind           varchar                    │
│ ServiceName        varchar                    │
│ ResourceAttributes map(varchar, varchar)      │
│ ScopeName          varchar                    │
│ ScopeVersion       varchar                    │
│ SpanAttributes     map(varchar, varchar)      │
│ Duration           ubigint                    │
│ StatusCode         varchar                    │
│ StatusMessage      varchar                    │
│ Events.Timestamp   timestamp with time zone[] │
│ Events.Name        varchar[]                  │
│ Events.Attributes  map(varchar, varchar)[]    │
│ Links.TraceId      varchar[]                  │
│ Links.SpanId       varchar[]                  │
│ Links.TraceState   varchar[]                  │
│ Links.Attributes   map(varchar, varchar)[]    │
└───────────────────────────────────────────────┘

You can see a clear correspondence between the Parquet and ClickHouse types. Notably, this file is not in the shape we want yet. For this we’ll need to slice and dice it into different forms to benchmark how queries behave.

Benchmarks

To get an overview of the performance of the Variant type, I’m going to wrangle the payload into the following types to see how the different approaches stack up:

JSON
Map / Array
Variant
Fully materialized

For simplicity, I’m going to focus on two types of data I have stored in my system, GPS location data and access logs to this website. For both datasets I’ll run a query that can benefit from prefiltering data down to the relevant rows and another one that has to run expensive aggregations across the whole dataset.

For the GPS data, the first query should get distinct days I’ve visited a particular location. The second query will get average speed and altitude hourly across the whole timespan.

For the access logs, the first query will count error responses per URI, and the second one will aggregate daily traffic per browser.

Danger zone! The queries compared here were generated / translated by Claude and may not be the pinnacle of optimization.

GPS details

JSON

This nifty query gets the relevant data from my GPS entries and stores it into a new Parquet file with the data stored as JSON.

COPY (
    SELECT
    Timestamp,
    'gps-location' AS service,
    CAST(SpanAttributes AS JSON) AS data
    FROM 'otel_traces.parquet'
    WHERE SpanName = 'gps-location'
) TO 'gps_location_json.parquet' (FORMAT PARQUET);

DESCRIBE SELECT * FROM 'gps_location_json.parquet';
┌────────────────────────────────────┐
│        gps_location.parquet        │
│                                    │
│ Timestamp timestamp with time zone │
│ service   varchar                  │
│ data      json                     │
└────────────────────────────────────┘

The nested structures inside the data field are unfortunately not parsed into JSON as well, but rather stay as string values of the top-level properties. But as the DuckDB JSON type is anyhow just backed by a VARCHAR, I don’t expect too much impact for this.

Now here is the query to list distinct days where I’ve visited a particular location. Pasting these coordinates into a map is left as an exercise to the reader.

SELECT DISTINCT
  DATE_TRUNC('day',
    CAST(json_extract_string(
           CAST(json_extract_string(data, '$.location') AS JSON),
           '$.properties.timestamp'
         ) AS TIMESTAMP)
  ) AS day
FROM 'gps_location_json.parquet'
WHERE CAST(json_extract(
        CAST(json_extract_string(data, '$.location') AS JSON),
        '$.geometry.coordinates[0]'
      ) AS DOUBLE) BETWEEN 8.486986 AND 8.493896
  AND CAST(json_extract(
        CAST(json_extract_string(data, '$.location') AS JSON),
        '$.geometry.coordinates[1]'
      ) AS DOUBLE) BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;

The second query to aggregate some data into hourly buckets.

WITH g AS (
  SELECT
    CAST(json_extract_string(data, '$.location') AS JSON) AS loc
  FROM 'gps_location_json.parquet'
)
SELECT
  TIME_BUCKET(INTERVAL '1 hour',
    CAST(json_extract_string(loc, '$.properties.timestamp') AS TIMESTAMP)
  ) AS hour,
  AVG(CAST(json_extract(loc, '$.properties.speed') AS DOUBLE))               AS avg_speed,
  AVG(CAST(json_extract(loc, '$.properties.altitude') AS DOUBLE))            AS avg_altitude,
  MAX(CAST(json_extract(loc, '$.properties.horizontal_accuracy') AS DOUBLE)) AS worst_accuracy,
  COUNT(*) AS points
FROM g
GROUP BY hour
ORDER BY hour;

Note how in those two queries we use the nested timestamp from within the payload instead of the top-level Timestamp. That’s due to getting those proper timestamps messed up when migrating data from an even older system into the one I use currently. If what I’m trying here comes to fruition, I’m going to fix this up.

The benchmark results will follow at the end.

Map

For the Map type I could just leave the data as-is, since the original Parquet file already has the SpanAttributes property as map(varchar, varchar). But due to the nesting of the data I care about, this would actually end up being the JSON type just with a detour. Instead, I’m flattening the structure into a proper Map below. Of course, this kind of defeats the purpose of not having to care about the exact schema of the data coming my way. But for benchmarking the data types it will do.

COPY (
  WITH src AS (
    SELECT
      Timestamp,
      CAST(SpanAttributes['location'] AS JSON) AS loc
    FROM 'otel_traces.parquet'
    WHERE SpanName = 'gps-location'
  )
  SELECT
    Timestamp,
    'gps-location' AS service,
    MAP {
      'location.type':                           json_extract_string(loc, '$.type'),
      'location.geometry.type':                  json_extract_string(loc, '$.geometry.type'),
      'location.geometry.coordinates[0]':        json_extract_string(loc, '$.geometry.coordinates[0]'),
      'location.geometry.coordinates[1]':        json_extract_string(loc, '$.geometry.coordinates[1]'),
      'location.properties.altitude':            json_extract_string(loc, '$.properties.altitude'),
      'location.properties.battery_level':       json_extract_string(loc, '$.properties.battery_level'),
      'location.properties.battery_state':       json_extract_string(loc, '$.properties.battery_state'),
      'location.properties.course':              json_extract_string(loc, '$.properties.course'),
      'location.properties.course_accuracy':     json_extract_string(loc, '$.properties.course_accuracy'),
      'location.properties.horizontal_accuracy': json_extract_string(loc, '$.properties.horizontal_accuracy'),
      'location.properties.motion[0]':           json_extract_string(loc, '$.properties.motion[0]'),
      'location.properties.speed':               json_extract_string(loc, '$.properties.speed'),
      'location.properties.speed_accuracy':      json_extract_string(loc, '$.properties.speed_accuracy'),
      'location.properties.timestamp':           json_extract_string(loc, '$.properties.timestamp'),
      'location.properties.vertical_accuracy':   json_extract_string(loc, '$.properties.vertical_accuracy'),
      'location.properties.wifi':                json_extract_string(loc, '$.properties.wifi')
    } AS data
  FROM src
) TO 'gps_location_map.parquet' (FORMAT PARQUET);

DESCRIBE SELECT * FROM 'gps_location_map.parquet';
┌────────────────────────────────────┐
│      gps_location_map.parquet      │
│                                    │
│ Timestamp timestamp with time zone │
│ service   varchar                  │
│ data      map(varchar, varchar)    │
└────────────────────────────────────┘

SELECT DISTINCT
  DATE_TRUNC('day', CAST(data['location.properties.timestamp'] AS TIMESTAMP)) AS day
FROM 'gps_location_map.parquet'
WHERE CAST(data['location.geometry.coordinates[0]'] AS DOUBLE)
        BETWEEN 8.486986 AND 8.493896
  AND CAST(data['location.geometry.coordinates[1]'] AS DOUBLE)
        BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;

SELECT
  TIME_BUCKET(INTERVAL '1 hour',
    CAST(data['location.properties.timestamp'] AS TIMESTAMP)) AS hour,
  AVG(CAST(data['location.properties.speed'] AS DOUBLE))               AS avg_speed,
  AVG(CAST(data['location.properties.altitude'] AS DOUBLE))            AS avg_altitude,
  MAX(CAST(data['location.properties.horizontal_accuracy'] AS DOUBLE)) AS worst_accuracy,
  COUNT(*) AS points
FROM 'gps_location_map.parquet'
GROUP BY hour
ORDER BY hour;

Variant

Finally, let’s get to the real meat of the story. Here we store the gps location data into a Parquet file where the relevant data is stored as a Variant.

COPY (
  SELECT
    Timestamp,
    'gps-location' AS service,
    CAST(CAST(SpanAttributes['location'] AS JSON) AS VARIANT) AS data
  FROM 'otel_traces.parquet'
  WHERE SpanName = 'gps-location'
) TO 'gps_location_variant.parquet' (FORMAT PARQUET);

DESCRIBE SELECT * FROM 'gps_location_variant.parquet';
┌────────────────────────────────────┐
│    gps_location_variant.parquet    │
│                                    │
│ Timestamp timestamp with time zone │
│ service   varchar                  │
│ data      variant                  │
└────────────────────────────────────┘

SELECT DISTINCT
  DATE_TRUNC('day', CAST(data.properties.timestamp AS TIMESTAMP)) AS day
FROM 'gps_location_variant.parquet'
WHERE CAST(data.geometry.coordinates[1] AS DOUBLE) BETWEEN 8.486986 AND 8.493896
  AND CAST(data.geometry.coordinates[2] AS DOUBLE) BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;

SELECT
  TIME_BUCKET(INTERVAL '1 hour',
    CAST(data.properties.timestamp AS TIMESTAMP)) AS hour,
  AVG(CAST(data.properties.speed AS DOUBLE))               AS avg_speed,
  AVG(CAST(data.properties.altitude AS DOUBLE))            AS avg_altitude,
  MAX(CAST(data.properties.horizontal_accuracy AS DOUBLE)) AS worst_accuracy,
  COUNT(*) AS points
FROM 'gps_location_variant.parquet'
GROUP BY hour
ORDER BY hour;

The double casting to JSON and then Variant is apparently doing heavy lifting to ensure nested values are also properly typed. I can confirm that everything is nicely shredded after I save this file.

Materialized

This one again is not going to be an actual option for my use case as I need to know the schema up front, but it serves as a comparison.

COPY (
  WITH src AS (
    SELECT
      Timestamp,
      CAST(SpanAttributes['location'] AS JSON) AS loc
    FROM 'otel_traces.parquet'
    WHERE SpanName = 'gps-location'
  )
  SELECT
    Timestamp,
    'gps-location' AS service,
    json_extract_string(loc, '$.type')                                      AS location_type,
    json_extract_string(loc, '$.geometry.type')                             AS geometry_type,
    CAST(json_extract(loc, '$.geometry.coordinates[0]') AS DOUBLE)          AS lon,
    CAST(json_extract(loc, '$.geometry.coordinates[1]') AS DOUBLE)          AS lat,
    CAST(json_extract(loc, '$.properties.altitude') AS DOUBLE)              AS altitude,
    CAST(json_extract(loc, '$.properties.battery_level') AS DOUBLE)         AS battery_level,
    json_extract_string(loc, '$.properties.battery_state')                  AS battery_state,
    CAST(json_extract(loc, '$.properties.course') AS DOUBLE)                AS course,
    CAST(json_extract(loc, '$.properties.course_accuracy') AS DOUBLE)       AS course_accuracy,
    CAST(json_extract(loc, '$.properties.horizontal_accuracy') AS DOUBLE)   AS horizontal_accuracy,
    json_extract_string(loc, '$.properties.motion[0]')                      AS motion,
    CAST(json_extract(loc, '$.properties.speed') AS DOUBLE)                 AS speed,
    CAST(json_extract(loc, '$.properties.speed_accuracy') AS DOUBLE)        AS speed_accuracy,
    CAST(json_extract_string(loc, '$.properties.timestamp') AS TIMESTAMP)   AS gps_timestamp,
    CAST(json_extract(loc, '$.properties.vertical_accuracy') AS DOUBLE)     AS vertical_accuracy,
    json_extract_string(loc, '$.properties.wifi')                           AS wifi
  FROM src
) TO 'gps_location_materialized.parquet' (FORMAT PARQUET);

DESCRIBE SELECT * FROM 'gps_location_materialized.parquet';
┌──────────────────────────────────────────────┐
│      gps_location_materialized.parquet       │
│                                              │
│ Timestamp           timestamp with time zone │
│ service             varchar                  │
│ location_type       varchar                  │
│ geometry_type       varchar                  │
│ lon                 double                   │
│ lat                 double                   │
│ altitude            double                   │
│ battery_level       double                   │
│ battery_state       varchar                  │
│ course              double                   │
│ course_accuracy     double                   │
│ horizontal_accuracy double                   │
│ motion              varchar                  │
│ speed               double                   │
│ speed_accuracy      double                   │
│ gps_timestamp       timestamp                │
│ vertical_accuracy   double                   │
│ wifi                varchar                  │
└──────────────────────────────────────────────┘

The query here is very straight-forward.

SELECT DISTINCT DATE_TRUNC('day', gps_timestamp) AS day
FROM 'gps_location_materialized.parquet'
WHERE lon BETWEEN 8.486986 AND 8.493896
  AND lat BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;

SELECT
  TIME_BUCKET(INTERVAL '1 hour', gps_timestamp) AS hour,
  AVG(speed)               AS avg_speed,
  AVG(altitude)            AS avg_altitude,
  MAX(horizontal_accuracy) AS worst_accuracy,
  COUNT(*) AS points
FROM 'gps_location_materialized.parquet'
GROUP BY hour
ORDER BY hour;

Access logs details

Now let’s look at another dataset I have in my observability system: Access logs to this website. I build this page into a static bundle which I then serve using a Rust service. The only reason I have this service and reverse proxy via Caddy instead of just letting Caddy serve the static page directly is exactly these access logs. Caddy doesn’t give you insights per domain, but only aggregated metrics. So this code sets up an OpenTelemetry exporter and then logs events on request and on response.

Given the way OpenTelemetry ingests this data, the Events.Attributes property takes the following form. The two objects correspond to the two info! invocations in the code.

[
  {
    "code.filepath": "src/main.rs",
    "code.lineno": "67",
    "code.namespace": "marending_dev",
    "level": "INFO",
    "referrer": "",
    "request": "HEAD",
    "target": "marending_dev",
    "uri": "/",
    "user_agent": "updown.io daemon 2.11"
  },
  {
    "code.filepath": "src/main.rs",
    "code.lineno": "85",
    "code.namespace": "marending_dev",
    "latency": "424164",
    "level": "INFO",
    "status": "200",
    "target": "marending_dev"
  }
]

JSON

Now, going through the same spiel again, we extract the data into a Parquet file with the relevant fields cast to JSON.

COPY (
  SELECT
    Timestamp,
    'marending' AS service,
    CAST("Events.Attributes" AS JSON) AS data
  FROM 'otel_traces.parquet'
  WHERE ServiceName = 'marending'
) TO 'marending_json.parquet' (FORMAT PARQUET);

DESCRIBE SELECT * FROM 'marending_json.parquet';
┌────────────────────────────────────┐
│       marending_json.parquet       │
│                                    │
│ Timestamp timestamp with time zone │
│ service   varchar                  │
│ data      json                     │
└────────────────────────────────────┘

Here’s the first query that counts error responses by URI. The top result is /robots.txt if you care to know. .env is not much further down the list though.

SELECT
  json_extract_string(data, '$[0].uri') AS uri,
  COUNT(*) AS error_count
FROM 'marending_json.parquet'
WHERE json_extract_string(data, '$[1].status') LIKE '4%'
   OR json_extract_string(data, '$[1].status') LIKE '5%'
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;

And the second query to get the daily traffic by browser:

WITH r AS (
  SELECT
    Timestamp,
    json_extract_string(data, '$[0].user_agent') AS user_agent,
    CAST(json_extract(data, '$[1].latency') AS DOUBLE) AS latency_us
  FROM 'marending_json.parquet'
)
SELECT
  DATE_TRUNC('day', Timestamp) AS day,
  CASE
    WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
    WHEN user_agent LIKE '%Chrome%' AND user_agent NOT LIKE '%Edg%' THEN 'Chrome'
    WHEN user_agent LIKE '%Safari%' AND user_agent NOT LIKE '%Chrome%' THEN 'Safari'
    WHEN user_agent LIKE '%Edg%' THEN 'Edge'
    WHEN user_agent LIKE '%bot%' OR user_agent LIKE '%Bot%' THEN 'Bot'
    ELSE 'Other'
  END AS browser,
  COUNT(*) AS requests,
  AVG(latency_us) AS avg_latency_us
FROM r
GROUP BY day, browser
ORDER BY day, requests DESC;

Map

COPY (
  SELECT
    Timestamp,
    'marending' AS service,
    "Events.Attributes" AS data
  FROM 'otel_traces.parquet'
  WHERE ServiceName = 'marending'
) TO 'marending_map.parquet' (FORMAT PARQUET);

DESCRIBE SELECT * FROM 'marending_map.parquet';
┌────────────────────────────────────┐
│       marending_map.parquet        │
│                                    │
│ Timestamp timestamp with time zone │
│ service   varchar                  │
│ data      map(varchar, varchar)[]  │
└────────────────────────────────────┘

SELECT
  data[1]['uri'] AS uri,
  COUNT(*) AS error_count
FROM 'marending_map.parquet'
WHERE data[2]['status'] LIKE '4%'
   OR data[2]['status'] LIKE '5%'
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;

WITH r AS (
  SELECT
    Timestamp,
    data[1]['user_agent'] AS user_agent,
    CAST(data[2]['latency'] AS DOUBLE) AS latency_us
  FROM 'marending_map.parquet'
)
SELECT
  DATE_TRUNC('day', Timestamp) AS day,
  CASE
    WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
    WHEN user_agent LIKE '%Edg%'     THEN 'Edge'
    WHEN user_agent LIKE '%Chrome%'  THEN 'Chrome'
    WHEN user_agent LIKE '%Safari%'  THEN 'Safari'
    WHEN user_agent ILIKE '%bot%'    THEN 'Bot'
    ELSE 'Other'
  END AS browser,
  COUNT(*) AS requests,
  AVG(latency_us) AS avg_latency_us
FROM r
GROUP BY day, browser
ORDER BY day, requests DESC;

Variant

COPY (
  SELECT
    Timestamp,
    'marending' AS service,
    CAST(CAST("Events.Attributes" AS JSON) AS VARIANT) AS data
  FROM 'otel_traces.parquet'
  WHERE ServiceName = 'marending'
) TO 'marending_variant.parquet' (FORMAT PARQUET);

DESCRIBE SELECT * FROM 'marending_variant.parquet';
┌────────────────────────────────────┐
│     marending_variant.parquet      │
│                                    │
│ Timestamp timestamp with time zone │
│ service   varchar                  │
│ data      variant                  │
└────────────────────────────────────┘

SELECT
  CAST(data[1].uri AS VARCHAR) AS uri,
  COUNT(*) AS error_count
FROM 'marending_variant.parquet'
WHERE CAST(data[2].status AS VARCHAR) LIKE '4%'
   OR CAST(data[2].status AS VARCHAR) LIKE '5%'
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;

WITH r AS (
  SELECT
    Timestamp,
    CAST(data[1].user_agent AS VARCHAR) AS user_agent,
    CAST(data[2].latency AS DOUBLE) AS latency_us
  FROM 'marending_variant.parquet'
)
SELECT
  DATE_TRUNC('day', Timestamp) AS day,
  CASE
    WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
    WHEN user_agent LIKE '%Edg%'     THEN 'Edge'
    WHEN user_agent LIKE '%Chrome%'  THEN 'Chrome'
    WHEN user_agent LIKE '%Safari%'  THEN 'Safari'
    WHEN user_agent ILIKE '%bot%'    THEN 'Bot'
    ELSE 'Other'
  END AS browser,
  COUNT(*) AS requests,
  AVG(latency_us) AS avg_latency_us
FROM r
GROUP BY day, browser
ORDER BY day, requests DESC;

Materialized

COPY (
  SELECT
    Timestamp,
    'marending' AS service,
    -- Request event (index 1)
    "Events.Attributes"[1]['code.filepath']  AS request_code_filepath,
    "Events.Attributes"[1]['code.lineno']    AS request_code_lineno,
    "Events.Attributes"[1]['code.namespace'] AS request_code_namespace,
    "Events.Attributes"[1]['level']          AS request_level,
    "Events.Attributes"[1]['referrer']       AS referrer,
    "Events.Attributes"[1]['request']        AS request_method,
    "Events.Attributes"[1]['target']         AS request_target,
    "Events.Attributes"[1]['uri']            AS uri,
    "Events.Attributes"[1]['user_agent']     AS user_agent,
    -- Response event (index 2)
    "Events.Attributes"[2]['code.filepath']            AS response_code_filepath,
    "Events.Attributes"[2]['code.lineno']              AS response_code_lineno,
    "Events.Attributes"[2]['code.namespace']           AS response_code_namespace,
    CAST("Events.Attributes"[2]['latency'] AS BIGINT)  AS latency_us,
    "Events.Attributes"[2]['level']                    AS response_level,
    CAST("Events.Attributes"[2]['status']  AS SMALLINT) AS status,
    "Events.Attributes"[2]['target']                   AS response_target
  FROM 'otel_traces.parquet'
  WHERE ServiceName = 'marending'
) TO 'marending_materialized.parquet' (FORMAT PARQUET);

SELECT uri, COUNT(*) AS error_count
FROM 'marending_materialized.parquet'
WHERE status >= 400
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;

SELECT
  DATE_TRUNC('day', Timestamp) AS day,
  CASE
    WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
    WHEN user_agent LIKE '%Edg%'     THEN 'Edge'
    WHEN user_agent LIKE '%Chrome%'  THEN 'Chrome'
    WHEN user_agent LIKE '%Safari%'  THEN 'Safari'
    WHEN user_agent ILIKE '%bot%'    THEN 'Bot'
    ELSE 'Other'
  END AS browser,
  COUNT(*) AS requests,
  AVG(latency_us) AS avg_latency_us
FROM 'marending_materialized.parquet'
GROUP BY day, browser
ORDER BY day, requests DESC;

Results

Figure 1. Duration of GPS queries against different Parquet files

JSON, Map and materialized columns behave pretty much exactly like I would have expected. Only the one I actually cared about, Variant, is disappointing. I might be hitting this issue. Although at the time of writing, I get similar performance on a nightly build of DuckDB even though it’s supposedly fixed on main. I think this must truly be a limitation that will be addressed at some point. Looking at the query plan with EXPLAIN ANALYZE, it’s clear that a lot of variant extractions happen, even though in my understanding it shouldn’t have to do that.

In general, I’m impressed with the performance of DuckDB here. There are just over 30’000 rows in this dataset and it churns through that in no time even with JSON types.

Also interesting are the file sizes of the different parquet files, one can clearly see how the faster formats also compress the data better.

JSON: 2.2MB
Map: 1.3MB
Variant: 1.1MB
Materialized: 1.0MB

For the access logs, it looks similar. Although here this DuckDB limitation around Variant hits even harder. We must be hitting a highly unoptimized code path here to be that much slower than even JSON handling.

Figure 2. Duration of access log queries against different Parquet files

ClickHouse comparison

I don’t want to get too deep into this, as the goal here is primarily to establish the feasibility of querying my data from a Parquet file. But just to get a feel for the landscape of query engines, I’m going to translate the four queries I used here to ClickHouse equivalents and run them against the live system. Note that this is very much an apples-to-oranges comparison: The numbers thus far have been achieved on an M1 MacBook Air, the ClickHouse ones on a Hetzner CAX21. And the Parquet files have first been isolated to their respective datasets, while ClickHouse has an assortment of data in its otel_traces table. Not to mention that the Hetzner server is a noisy shared VPS.

Figure 3. Duration of queries against ClickHouse

The closest equivalent of the types we’ve looked at so far to the current ClickHouse schema would be the Map flavour. Honestly quite surprised. Goes to show that while you might want to get to those single-digit millisecond latencies, in real-world usage these fractions of a second have been just fine. I would suspect we could go considerably faster by using ClickHouse’s own JSON type, which behaves similarly to Variant in my understanding.

Verdict

I think I’m reaching a different conclusion than what I set out to see, but I’m still not complaining. I was expecting to see subpar performance on JSON strings, with Variant saving the day and making my undertaking feasible. But I got the opposite: JSON handling is surprisingly fast in DuckDB and the Variant support is unfortunately too young for real-world use.

But the bigger take-away is that the general pattern of writing and querying Parquet files could actually be quite elegant. While I would be very nervous trying to alter the otel_traces table in a running system right now, rewriting Parquet files to use Variant down the line should be a walk in the park (save rewriting queries of course). In a follow-up note I’ll have to hone in on Parquet file handling, e.g. is there an impact when I have a single file that mixes data sources? And how to simultaneously write files periodically and query across different files?

While I’m terrible at leaving my side-projects be, I do enjoy that the iteration tends to result in leaner and simpler systems.