11 May 2026
Exploring the Variant type in Parquet
Let's get shredding
I’ve had observability on my mind for quite some time now. Two years ago I was trying to expand my metrics setup with structured logs emitted by some services. The ensuing exploration only yielded the obvious: Storing and efficiently querying arbitrary data is hard. DuckDB was all the rage back then, so naturally I had to see how it handles analytics workloads. Finally, a year later, I checked out ClickHouse and found it amazingly fast for my purposes.
Ever since, I’ve run a ClickHouse instance to store arbitrary JSON payloads fed by GPS location data, sensor readings and tracing data. But a couple of days ago that nagging feeling came back: What if this can be simplified? ClickHouse is by far the heaviest service I operate on my server. It randomly fills up my disk with logging tables, too. Look, ClickHouse is an engineering marvel, but it’s not my marvel. It’s someone else’s marvel. And I have a need to create my own. Thing, not marvel, I mean. So here we are.
The hard part is still how to store arbitrary JSON payloads in such a way that aggregations on
fields are performant. My current solution heavily relies on the excellent Map type in ClickHouse.
To replace it, some storage engine is going to have to do some heavy lifting. My hope is that DuckDB
can take that role if we massage the data just a bit. Since I last used DuckDB, quite a bit has
changed. Particularly interesting for me: Parquet got a Variant type and DuckDB supports it.
But before we get into the weeds, let me set the scene. Let’s say I have a service that sends structured logs (or wide events) to my observability system. Today they might look like this:
{
"timestamp": "2026-05-05T08:00:00Z",
"service": "marending-dev",
"severity": "error",
"message": "User Bobby Tables doesn't exist"
}
But then I want to track more information.
{
"timestamp": "2026-05-05T08:00:00Z",
"service": "marending-dev",
"severity": "error",
"message": "User Bobby Tables doesn't exist",
"user_id": "null",
"ip": "85.0.1.42",
"path": "/user/login"
}
You get the picture. The question is, how can we store these JSON blobs such that we can then
efficiently compute, say, error rate grouped by user_id? We can’t have a static schema, since we
might want to add properties to this event at any time. We can chuck this as a JSON string into
any database, file format or what have you. But then any query that attempts to read from properties
of that JSON string is going to incur the significant cost of extracting the appropriate fields for
every row of interest. One way to get reasonable performance here is to ensure severity and user_id
are each contiguously laid out so that they can be efficiently aggregated.
Parquet and Variant
That brings us to Apache Parquet. Parquet is a column-oriented file format. While DuckDB has its own file format for storing data, one of its strengths is its support for a wide variety of other formats, including Parquet. So you can write and query Parquet files natively with SQL within DuckDB.
Recently Parquet added the Variant type, a binary encoding of primitives, arrays or objects of arbitrary types. It allows representing structured data in such a way that it can be queried more efficiently than if the same data was serialized and stored as a string type. But the real kicker is “Variant shredding”. It’s the process of extracting some of the properties within the Variant into their own columns. This should then give us maximum performance when aggregating on those columns. And DuckDB apparently supports this!
Once I learned about this, a path towards my dream observability system emerged. Step 1: Write a service that accepts arbitrary JSON payloads, buffers them in memory and writes them to Parquet files with the payload being of type Variant every 10 seconds or so. Step 2: Every hour, read all the small Parquet files and shred them into a larger hourly file. Step 3: Possibly aggregate the hourly files into even broader files, but I’ll see when I get to it.
Now at query time, I will have at most an hour’s worth of data in inefficient, unshredded form. That should be easy enough to deal with. Data older than that will be in highly efficient form. And the best part? DuckDB should be able to query across many Parquet files using glob syntax, even when some have concrete, shredded columns and some don’t. But we’ll see this for ourselves.
Exploration
Let’s start with the example given in DuckDB’s documentation.
CREATE TABLE events (id INTEGER, data VARIANT);
INSERT INTO events VALUES
(1, 42::VARIANT),
(2, 'hello world'::VARIANT),
(3, [1, 2, 3]::VARIANT),
(4, {'name': 'Alice', 'age': 30}::VARIANT);
Querying the table presents the expected result.
SELECT * FROM events;
┌───────┬────────────────────────────┐
│ id │ data │
│ int32 │ variant │
├───────┼────────────────────────────┤
│ 1 │ 42 │
│ 2 │ hello world │
│ 3 │ [1, 2, 3] │
│ 4 │ {'name': Alice, 'age': 30} │
└───────┴────────────────────────────┘
SELECT id, data, variant_typeof(data) AS vtype FROM events;
┌───────┬────────────────────────────┬───────────────────┐
│ id │ data │ vtype │
│ int32 │ variant │ varchar │
├───────┼────────────────────────────┼───────────────────┤
│ 1 │ 42 │ INT32 │
│ 2 │ hello world │ VARCHAR │
│ 3 │ [1, 2, 3] │ ARRAY(3) │
│ 4 │ {'name': Alice, 'age': 30} │ OBJECT(name, age) │
└───────┴────────────────────────────┴───────────────────┘
We can now also query nested fields in a Variant.
SELECT data.name FROM events;
┌─────────┐
│ name │
│ variant │
├─────────┤
│ NULL │
│ NULL │
│ NULL │
│ Alice │
└─────────┘
Writing this data as-is to a parquet file is now as simple as
COPY events TO 'events.parquet' (FORMAT parquet);. And indeed, having DuckDB describe the parquet
file’s shape yields the expected.
DESCRIBE SELECT * FROM 'events.parquet';
┌────────────────┐
│ events.parquet │
│ │
│ id integer │
│ data variant │
└────────────────┘
But hold on. According to the docs, writing to a Parquet file should automatically shred the variant type. Shouldn’t I see the individual columns then? Turns out shredding is only a storage detail, not a logical schema change. The type stays as is. To see how the data is actually stored, we have to inspect the file’s metadata.
SELECT name, type, converted_type FROM parquet_schema('events.parquet');
┌───────────────┬────────────┬────────────────┐
│ name │ type │ converted_type │
│ varchar │ varchar │ varchar │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL │ NULL │
│ id │ INT32 │ INT_32 │
│ data │ NULL │ NULL │
│ metadata │ BYTE_ARRAY │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ INT32 │ INT_32 │
└───────────────┴────────────┴────────────────┘
Well, the way this is supposed to be read is as follows. id is of type INT32, the data variant
itself doesn’t really have a type but it comprises three parts: metadata, value and
typed_value. Of interest are value and typed_value. Each row with the data logical column is
either going to have the (fallback) byte-array value representation set, or the shredded
typed_value column. In this case, DuckDB’s autoshredder apparently decided that row 1 with the
integer data variant deserves to have its representation shredded, but not the other rows.
If, instead, we instruct DuckDB to shred data to a struct like so
COPY events TO 'events.parquet' (
FORMAT parquet,
SHREDDING {'data': 'STRUCT(name VARCHAR, age INTEGER)'}
);
we get this
SELECT name, type, converted_type, logical_type FROM parquet_schema('events.parquet');
┌───────────────┬────────────┬────────────────┐
│ name │ type │ converted_type │
│ varchar │ varchar │ varchar │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL │ NULL │
│ id │ INT32 │ INT_32 │
│ data │ NULL │ NULL │
│ metadata │ BYTE_ARRAY │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ NULL │ NULL │
│ name │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ BYTE_ARRAY │ UTF8 │
│ age │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ INT32 │ INT_32 │
└───────────────┴────────────┴────────────────┘
Now we see data with its metadata, value and typed_value properties and also a value and
typed_value logical column for both name and age. For these the shredded types are UTF8 and
INT32 respectively.
To get an intuitive understanding, it is interesting to look at the output of
parquet-tools meta events.parquet from this tool:
Output
{
"NumRowGroups": 1,
"RowGroups": [
{
"NumRows": 4,
"TotalByteSize": 725,
"Columns": [
{
"PathInSchema": ["id"],
"Type": "INT32",
"ConvertedType": "convertedtype=INT_32",
"Encodings": ["PLAIN"],
"CompressedSize": 41,
"UncompressedSize": 39,
"NumValues": 4,
"NullCount": 0,
"MaxValue": 4,
"MinValue": 1,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "metadata"],
"Type": "BYTE_ARRAY",
"Encodings": ["PLAIN"],
"CompressedSize": 52,
"UncompressedSize": 60,
"NumValues": 4,
"NullCount": 0,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "value"],
"Type": "BYTE_ARRAY",
"Encodings": ["PLAIN"],
"CompressedSize": 81,
"UncompressedSize": 137,
"NumValues": 4,
"NullCount": 1,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "typed_value", "name", "value"],
"Type": "BYTE_ARRAY",
"Encodings": ["PLAIN"],
"CompressedSize": 34,
"UncompressedSize": 119,
"NumValues": 4,
"NullCount": 4,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "typed_value", "name", "typed_value"],
"Type": "BYTE_ARRAY",
"ConvertedType": "convertedtype=UTF8",
"Encodings": ["PLAIN"],
"CompressedSize": 44,
"UncompressedSize": 128,
"NumValues": 4,
"NullCount": 3,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "typed_value", "age", "value"],
"Type": "BYTE_ARRAY",
"Encodings": ["PLAIN"],
"CompressedSize": 34,
"UncompressedSize": 119,
"NumValues": 4,
"NullCount": 4,
"CompressionCodec": "SNAPPY"
},
{
"PathInSchema": ["data", "typed_value", "age", "typed_value"],
"Type": "INT32",
"ConvertedType": "convertedtype=INT_32",
"Encodings": ["PLAIN"],
"CompressedSize": 39,
"UncompressedSize": 123,
"NumValues": 4,
"NullCount": 3,
"CompressionCodec": "SNAPPY"
}
]
}
]
}Let’s take this line for example ["data", "typed_value", "name", "typed_value"]. A row in our
dataset is going to populate this logical column when the name property in the variant is a UTF8
string, otherwise it will fall back to the BYTE_ARRAY data.value. And indeed we see that 3 out of
4 values in this column are NULL, since row 4 is the only one that can populate this typed_value.
Conversely, the ["data", "value"] entry only has a single NULL value: only row 4 does not populate
the fallback column, populating the typed values instead.
If we look at ["data", "typed_value", "name", "value"] we can see that not a single non-NULL value
is here. I suppose this would be populated if we had a row with an object that contains the name
property, but with a type other than string.
I’ll just leave this blog post from the official Parquet website here for further reading. It has some nice visualizations that further explain how this works.
Autoshredding
So far, so good. But we don’t want to have to specify how the values should be shredded. The whole point would be for DuckDB to figure it out from the data, but somehow in our first attempt it just chose to “materialize” the integer and not our object.
Well, it looks like DuckDB decides what to shred based on how often a specific type occurs in the table. If I add two more rows that have the same shape as the object in row 4
INSERT INTO events VALUES (5, {'name': 'Bob', 'age': 27}::VARIANT);
INSERT INTO events VALUES (6, {'name': 'Flo', 'age': 31}::VARIANT);
COPY events TO 'events-more-structs.parquet' (FORMAT parquet);
I get our familiar shredding, this time without having to specify the type.
SELECT name, type, converted_type FROM parquet_schema('events-more-structs.parquet');
┌───────────────┬────────────┬────────────────┐
│ name │ type │ converted_type │
│ varchar │ varchar │ varchar │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL │ NULL │
│ id │ INT32 │ INT_32 │
│ data │ NULL │ NULL │
│ metadata │ BYTE_ARRAY │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ NULL │ NULL │
│ age │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ INT32 │ INT_32 │
│ name │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ BYTE_ARRAY │ UTF8 │
└───────────────┴────────────┴────────────────┘
If I add a bunch more string columns now, I can get the autoshredder to only shred the UTF8 type.
Initially I would have expected that it should be possible to have both shredded age and name
values as well as a fallback not to an untyped BYTE_ARRAY but to a UTF8 string. But thinking about
this, I guess the typed_value for data is now an object of name and age (although it’s not
readily visible in the above representation), so it can’t be a string simultaneously.
It should be possible, however, if the string columns were in a new property inside an object. Let me give this a shot.
INSERT INTO events VALUES (10, {'message': 'Hello world'}::VARIANT);
INSERT INTO events VALUES (11, {'message': 'message'}::VARIANT);
INSERT INTO events VALUES (12, {'message': '42'}::VARIANT);
COPY events TO 'events-message.parquet' (FORMAT parquet);
SELECT name, type, converted_type FROM parquet_schema('events-message.parquet');
┌───────────────┬────────────┬────────────────┐
│ name │ type │ converted_type │
│ varchar │ varchar │ varchar │
├───────────────┼────────────┼────────────────┤
│ duckdb_schema │ NULL │ NULL │
│ id │ INT32 │ INT_32 │
│ data │ NULL │ NULL │
│ metadata │ BYTE_ARRAY │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ NULL │ NULL │
│ message │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ BYTE_ARRAY │ UTF8 │
│ age │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ INT32 │ INT_32 │
│ name │ NULL │ NULL │
│ value │ BYTE_ARRAY │ NULL │
│ typed_value │ BYTE_ARRAY │ UTF8 │
└───────────────┴────────────┴────────────────┘
Indeed we can see that it now materialized age, name and message. Neat!
Real-world data
With my mental model starting to get less foggy, I think it’s time to start digging into some
real-world data from my existing observability setup. Luckily ClickHouse makes it really easy to get
my data out: SELECT * FROM events.otel_traces INTO OUTFILE './otel_traces.parquet' FORMAT Parquet.
This dumps all of the data into a single Parquet file with just over 10 million rows, weighing in at
800 MB.
The shape of this file is determined by the standard schema the OpenTelemetry collector created when I started ingesting data. In ClickHouse it looks like this:
CREATE TABLE default.otel_traces
(
`Timestamp` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
`TraceId` String CODEC(ZSTD(1)),
`SpanId` String CODEC(ZSTD(1)),
`ParentSpanId` String CODEC(ZSTD(1)),
`TraceState` String CODEC(ZSTD(1)),
`SpanName` LowCardinality(String) CODEC(ZSTD(1)),
`SpanKind` LowCardinality(String) CODEC(ZSTD(1)),
`ServiceName` LowCardinality(String) CODEC(ZSTD(1)),
`ResourceAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`ScopeName` String CODEC(ZSTD(1)),
`ScopeVersion` String CODEC(ZSTD(1)),
`SpanAttributes` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`Duration` Int64 CODEC(ZSTD(1)),
`StatusCode` LowCardinality(String) CODEC(ZSTD(1)),
`StatusMessage` String CODEC(ZSTD(1)),
`Events.Timestamp` Array(DateTime64(9)) CODEC(ZSTD(1)),
`Events.Name` Array(LowCardinality(String)) CODEC(ZSTD(1)),
`Events.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
`Links.TraceId` Array(String) CODEC(ZSTD(1)),
`Links.SpanId` Array(String) CODEC(ZSTD(1)),
`Links.TraceState` Array(String) CODEC(ZSTD(1)),
`Links.Attributes` Array(Map(LowCardinality(String), String)) CODEC(ZSTD(1)),
INDEX idx_trace_id TraceId TYPE bloom_filter(0.001) GRANULARITY 1,
INDEX idx_res_attr_key mapKeys(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_res_attr_value mapValues(ResourceAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_span_attr_key mapKeys(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_span_attr_value mapValues(SpanAttributes) TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_duration Duration TYPE minmax GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toDate(Timestamp)
ORDER BY (ServiceName, SpanName, toUnixTimestamp(Timestamp), TraceId)
TTL toDateTime(Timestamp) + toIntervalDay(3)
SETTINGS ttl_only_drop_parts = 1
Once dumped to the Parquet file, we can see the shape as read by DuckDB here:
DESCRIBE SELECT * FROM 'otel_traces.parquet';
┌───────────────────────────────────────────────┐
│ otel_traces.parquet │
│ │
│ Timestamp timestamp with time zone │
│ TraceId varchar │
│ SpanId varchar │
│ ParentSpanId varchar │
│ TraceState varchar │
│ SpanName varchar │
│ SpanKind varchar │
│ ServiceName varchar │
│ ResourceAttributes map(varchar, varchar) │
│ ScopeName varchar │
│ ScopeVersion varchar │
│ SpanAttributes map(varchar, varchar) │
│ Duration ubigint │
│ StatusCode varchar │
│ StatusMessage varchar │
│ Events.Timestamp timestamp with time zone[] │
│ Events.Name varchar[] │
│ Events.Attributes map(varchar, varchar)[] │
│ Links.TraceId varchar[] │
│ Links.SpanId varchar[] │
│ Links.TraceState varchar[] │
│ Links.Attributes map(varchar, varchar)[] │
└───────────────────────────────────────────────┘
You can see a clear correspondence between the Parquet and ClickHouse types. Notably, this file is not in the shape we want yet. For this we’ll need to slice and dice it into different forms to benchmark how queries behave.
Benchmarks
To get an overview of the performance of the Variant type, I’m going to wrangle the payload into the following types to see how the different approaches stack up:
- JSON
- Map / Array
- Variant
- Fully materialized
For simplicity, I’m going to focus on two types of data I have stored in my system, GPS location data and access logs to this website. For both datasets I’ll run a query that can benefit from prefiltering data down to the relevant rows and another one that has to run expensive aggregations across the whole dataset.
For the GPS data, the first query should get distinct days I’ve visited a particular location. The second query will get average speed and altitude hourly across the whole timespan.
For the access logs, the first query will count error responses per URI, and the second one will aggregate daily traffic per browser.
Danger zone! The queries compared here were generated / translated by Claude and may not be the pinnacle of optimization.
GPS details
JSON
This nifty query gets the relevant data from my GPS entries and stores it into a new Parquet file with the data stored as JSON.
COPY (
SELECT
Timestamp,
'gps-location' AS service,
CAST(SpanAttributes AS JSON) AS data
FROM 'otel_traces.parquet'
WHERE SpanName = 'gps-location'
) TO 'gps_location_json.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'gps_location_json.parquet';
┌────────────────────────────────────┐
│ gps_location.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data json │
└────────────────────────────────────┘The nested structures inside the data field are unfortunately not parsed into JSON as well, but rather stay as string values of the top-level properties. But as the DuckDB JSON type is anyhow just backed by a VARCHAR, I don’t expect too much impact for this.
Now here is the query to list distinct days where I’ve visited a particular location. Pasting these coordinates into a map is left as an exercise to the reader.
SELECT DISTINCT
DATE_TRUNC('day',
CAST(json_extract_string(
CAST(json_extract_string(data, '$.location') AS JSON),
'$.properties.timestamp'
) AS TIMESTAMP)
) AS day
FROM 'gps_location_json.parquet'
WHERE CAST(json_extract(
CAST(json_extract_string(data, '$.location') AS JSON),
'$.geometry.coordinates[0]'
) AS DOUBLE) BETWEEN 8.486986 AND 8.493896
AND CAST(json_extract(
CAST(json_extract_string(data, '$.location') AS JSON),
'$.geometry.coordinates[1]'
) AS DOUBLE) BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;The second query to aggregate some data into hourly buckets.
WITH g AS (
SELECT
CAST(json_extract_string(data, '$.location') AS JSON) AS loc
FROM 'gps_location_json.parquet'
)
SELECT
TIME_BUCKET(INTERVAL '1 hour',
CAST(json_extract_string(loc, '$.properties.timestamp') AS TIMESTAMP)
) AS hour,
AVG(CAST(json_extract(loc, '$.properties.speed') AS DOUBLE)) AS avg_speed,
AVG(CAST(json_extract(loc, '$.properties.altitude') AS DOUBLE)) AS avg_altitude,
MAX(CAST(json_extract(loc, '$.properties.horizontal_accuracy') AS DOUBLE)) AS worst_accuracy,
COUNT(*) AS points
FROM g
GROUP BY hour
ORDER BY hour;Note how in those two queries we use the nested timestamp from within the payload instead of the
top-level Timestamp. That’s due to getting those proper timestamps messed up when migrating data
from an even older system into the one I use currently. If what I’m trying here comes to fruition,
I’m going to fix this up.
The benchmark results will follow at the end.
Map
For the Map type I could just leave the data as-is, since the original Parquet file already has the
SpanAttributes property as map(varchar, varchar). But due to the nesting of the data I care
about, this would actually end up being the JSON type just with a detour. Instead, I’m flattening
the structure into a proper Map below. Of course, this kind of defeats the purpose of not having to
care about the exact schema of the data coming my way. But for benchmarking the data types it will
do.
COPY (
WITH src AS (
SELECT
Timestamp,
CAST(SpanAttributes['location'] AS JSON) AS loc
FROM 'otel_traces.parquet'
WHERE SpanName = 'gps-location'
)
SELECT
Timestamp,
'gps-location' AS service,
MAP {
'location.type': json_extract_string(loc, '$.type'),
'location.geometry.type': json_extract_string(loc, '$.geometry.type'),
'location.geometry.coordinates[0]': json_extract_string(loc, '$.geometry.coordinates[0]'),
'location.geometry.coordinates[1]': json_extract_string(loc, '$.geometry.coordinates[1]'),
'location.properties.altitude': json_extract_string(loc, '$.properties.altitude'),
'location.properties.battery_level': json_extract_string(loc, '$.properties.battery_level'),
'location.properties.battery_state': json_extract_string(loc, '$.properties.battery_state'),
'location.properties.course': json_extract_string(loc, '$.properties.course'),
'location.properties.course_accuracy': json_extract_string(loc, '$.properties.course_accuracy'),
'location.properties.horizontal_accuracy': json_extract_string(loc, '$.properties.horizontal_accuracy'),
'location.properties.motion[0]': json_extract_string(loc, '$.properties.motion[0]'),
'location.properties.speed': json_extract_string(loc, '$.properties.speed'),
'location.properties.speed_accuracy': json_extract_string(loc, '$.properties.speed_accuracy'),
'location.properties.timestamp': json_extract_string(loc, '$.properties.timestamp'),
'location.properties.vertical_accuracy': json_extract_string(loc, '$.properties.vertical_accuracy'),
'location.properties.wifi': json_extract_string(loc, '$.properties.wifi')
} AS data
FROM src
) TO 'gps_location_map.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'gps_location_map.parquet';
┌────────────────────────────────────┐
│ gps_location_map.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data map(varchar, varchar) │
└────────────────────────────────────┘SELECT DISTINCT
DATE_TRUNC('day', CAST(data['location.properties.timestamp'] AS TIMESTAMP)) AS day
FROM 'gps_location_map.parquet'
WHERE CAST(data['location.geometry.coordinates[0]'] AS DOUBLE)
BETWEEN 8.486986 AND 8.493896
AND CAST(data['location.geometry.coordinates[1]'] AS DOUBLE)
BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;SELECT
TIME_BUCKET(INTERVAL '1 hour',
CAST(data['location.properties.timestamp'] AS TIMESTAMP)) AS hour,
AVG(CAST(data['location.properties.speed'] AS DOUBLE)) AS avg_speed,
AVG(CAST(data['location.properties.altitude'] AS DOUBLE)) AS avg_altitude,
MAX(CAST(data['location.properties.horizontal_accuracy'] AS DOUBLE)) AS worst_accuracy,
COUNT(*) AS points
FROM 'gps_location_map.parquet'
GROUP BY hour
ORDER BY hour;Variant
Finally, let’s get to the real meat of the story. Here we store the gps location data into a Parquet file where the relevant data is stored as a Variant.
COPY (
SELECT
Timestamp,
'gps-location' AS service,
CAST(CAST(SpanAttributes['location'] AS JSON) AS VARIANT) AS data
FROM 'otel_traces.parquet'
WHERE SpanName = 'gps-location'
) TO 'gps_location_variant.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'gps_location_variant.parquet';
┌────────────────────────────────────┐
│ gps_location_variant.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data variant │
└────────────────────────────────────┘SELECT DISTINCT
DATE_TRUNC('day', CAST(data.properties.timestamp AS TIMESTAMP)) AS day
FROM 'gps_location_variant.parquet'
WHERE CAST(data.geometry.coordinates[1] AS DOUBLE) BETWEEN 8.486986 AND 8.493896
AND CAST(data.geometry.coordinates[2] AS DOUBLE) BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;SELECT
TIME_BUCKET(INTERVAL '1 hour',
CAST(data.properties.timestamp AS TIMESTAMP)) AS hour,
AVG(CAST(data.properties.speed AS DOUBLE)) AS avg_speed,
AVG(CAST(data.properties.altitude AS DOUBLE)) AS avg_altitude,
MAX(CAST(data.properties.horizontal_accuracy AS DOUBLE)) AS worst_accuracy,
COUNT(*) AS points
FROM 'gps_location_variant.parquet'
GROUP BY hour
ORDER BY hour;The double casting to JSON and then Variant is apparently doing heavy lifting to ensure nested values are also properly typed. I can confirm that everything is nicely shredded after I save this file.
Materialized
This one again is not going to be an actual option for my use case as I need to know the schema up front, but it serves as a comparison.
COPY (
WITH src AS (
SELECT
Timestamp,
CAST(SpanAttributes['location'] AS JSON) AS loc
FROM 'otel_traces.parquet'
WHERE SpanName = 'gps-location'
)
SELECT
Timestamp,
'gps-location' AS service,
json_extract_string(loc, '$.type') AS location_type,
json_extract_string(loc, '$.geometry.type') AS geometry_type,
CAST(json_extract(loc, '$.geometry.coordinates[0]') AS DOUBLE) AS lon,
CAST(json_extract(loc, '$.geometry.coordinates[1]') AS DOUBLE) AS lat,
CAST(json_extract(loc, '$.properties.altitude') AS DOUBLE) AS altitude,
CAST(json_extract(loc, '$.properties.battery_level') AS DOUBLE) AS battery_level,
json_extract_string(loc, '$.properties.battery_state') AS battery_state,
CAST(json_extract(loc, '$.properties.course') AS DOUBLE) AS course,
CAST(json_extract(loc, '$.properties.course_accuracy') AS DOUBLE) AS course_accuracy,
CAST(json_extract(loc, '$.properties.horizontal_accuracy') AS DOUBLE) AS horizontal_accuracy,
json_extract_string(loc, '$.properties.motion[0]') AS motion,
CAST(json_extract(loc, '$.properties.speed') AS DOUBLE) AS speed,
CAST(json_extract(loc, '$.properties.speed_accuracy') AS DOUBLE) AS speed_accuracy,
CAST(json_extract_string(loc, '$.properties.timestamp') AS TIMESTAMP) AS gps_timestamp,
CAST(json_extract(loc, '$.properties.vertical_accuracy') AS DOUBLE) AS vertical_accuracy,
json_extract_string(loc, '$.properties.wifi') AS wifi
FROM src
) TO 'gps_location_materialized.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'gps_location_materialized.parquet';
┌──────────────────────────────────────────────┐
│ gps_location_materialized.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ location_type varchar │
│ geometry_type varchar │
│ lon double │
│ lat double │
│ altitude double │
│ battery_level double │
│ battery_state varchar │
│ course double │
│ course_accuracy double │
│ horizontal_accuracy double │
│ motion varchar │
│ speed double │
│ speed_accuracy double │
│ gps_timestamp timestamp │
│ vertical_accuracy double │
│ wifi varchar │
└──────────────────────────────────────────────┘The query here is very straight-forward.
SELECT DISTINCT DATE_TRUNC('day', gps_timestamp) AS day
FROM 'gps_location_materialized.parquet'
WHERE lon BETWEEN 8.486986 AND 8.493896
AND lat BETWEEN 47.348284 AND 47.353111
ORDER BY day DESC;SELECT
TIME_BUCKET(INTERVAL '1 hour', gps_timestamp) AS hour,
AVG(speed) AS avg_speed,
AVG(altitude) AS avg_altitude,
MAX(horizontal_accuracy) AS worst_accuracy,
COUNT(*) AS points
FROM 'gps_location_materialized.parquet'
GROUP BY hour
ORDER BY hour;Access logs details
Now let’s look at another dataset I have in my observability system: Access logs to this website. I build this page into a static bundle which I then serve using a Rust service. The only reason I have this service and reverse proxy via Caddy instead of just letting Caddy serve the static page directly is exactly these access logs. Caddy doesn’t give you insights per domain, but only aggregated metrics. So this code sets up an OpenTelemetry exporter and then logs events on request and on response.
Given the way OpenTelemetry ingests this data, the Events.Attributes property takes the following
form. The two objects correspond to the two info! invocations in the code.
[
{
"code.filepath": "src/main.rs",
"code.lineno": "67",
"code.namespace": "marending_dev",
"level": "INFO",
"referrer": "",
"request": "HEAD",
"target": "marending_dev",
"uri": "/",
"user_agent": "updown.io daemon 2.11"
},
{
"code.filepath": "src/main.rs",
"code.lineno": "85",
"code.namespace": "marending_dev",
"latency": "424164",
"level": "INFO",
"status": "200",
"target": "marending_dev"
}
]JSON
Now, going through the same spiel again, we extract the data into a Parquet file with the relevant fields cast to JSON.
COPY (
SELECT
Timestamp,
'marending' AS service,
CAST("Events.Attributes" AS JSON) AS data
FROM 'otel_traces.parquet'
WHERE ServiceName = 'marending'
) TO 'marending_json.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'marending_json.parquet';
┌────────────────────────────────────┐
│ marending_json.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data json │
└────────────────────────────────────┘Here’s the first query that counts error responses by URI. The top result is /robots.txt if you
care to know. .env is not much further down the list though.
SELECT
json_extract_string(data, '$[0].uri') AS uri,
COUNT(*) AS error_count
FROM 'marending_json.parquet'
WHERE json_extract_string(data, '$[1].status') LIKE '4%'
OR json_extract_string(data, '$[1].status') LIKE '5%'
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;And the second query to get the daily traffic by browser:
WITH r AS (
SELECT
Timestamp,
json_extract_string(data, '$[0].user_agent') AS user_agent,
CAST(json_extract(data, '$[1].latency') AS DOUBLE) AS latency_us
FROM 'marending_json.parquet'
)
SELECT
DATE_TRUNC('day', Timestamp) AS day,
CASE
WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
WHEN user_agent LIKE '%Chrome%' AND user_agent NOT LIKE '%Edg%' THEN 'Chrome'
WHEN user_agent LIKE '%Safari%' AND user_agent NOT LIKE '%Chrome%' THEN 'Safari'
WHEN user_agent LIKE '%Edg%' THEN 'Edge'
WHEN user_agent LIKE '%bot%' OR user_agent LIKE '%Bot%' THEN 'Bot'
ELSE 'Other'
END AS browser,
COUNT(*) AS requests,
AVG(latency_us) AS avg_latency_us
FROM r
GROUP BY day, browser
ORDER BY day, requests DESC;Map
COPY (
SELECT
Timestamp,
'marending' AS service,
"Events.Attributes" AS data
FROM 'otel_traces.parquet'
WHERE ServiceName = 'marending'
) TO 'marending_map.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'marending_map.parquet';
┌────────────────────────────────────┐
│ marending_map.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data map(varchar, varchar)[] │
└────────────────────────────────────┘SELECT
data[1]['uri'] AS uri,
COUNT(*) AS error_count
FROM 'marending_map.parquet'
WHERE data[2]['status'] LIKE '4%'
OR data[2]['status'] LIKE '5%'
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;WITH r AS (
SELECT
Timestamp,
data[1]['user_agent'] AS user_agent,
CAST(data[2]['latency'] AS DOUBLE) AS latency_us
FROM 'marending_map.parquet'
)
SELECT
DATE_TRUNC('day', Timestamp) AS day,
CASE
WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
WHEN user_agent LIKE '%Edg%' THEN 'Edge'
WHEN user_agent LIKE '%Chrome%' THEN 'Chrome'
WHEN user_agent LIKE '%Safari%' THEN 'Safari'
WHEN user_agent ILIKE '%bot%' THEN 'Bot'
ELSE 'Other'
END AS browser,
COUNT(*) AS requests,
AVG(latency_us) AS avg_latency_us
FROM r
GROUP BY day, browser
ORDER BY day, requests DESC;Variant
COPY (
SELECT
Timestamp,
'marending' AS service,
CAST(CAST("Events.Attributes" AS JSON) AS VARIANT) AS data
FROM 'otel_traces.parquet'
WHERE ServiceName = 'marending'
) TO 'marending_variant.parquet' (FORMAT PARQUET);
DESCRIBE SELECT * FROM 'marending_variant.parquet';
┌────────────────────────────────────┐
│ marending_variant.parquet │
│ │
│ Timestamp timestamp with time zone │
│ service varchar │
│ data variant │
└────────────────────────────────────┘SELECT
CAST(data[1].uri AS VARCHAR) AS uri,
COUNT(*) AS error_count
FROM 'marending_variant.parquet'
WHERE CAST(data[2].status AS VARCHAR) LIKE '4%'
OR CAST(data[2].status AS VARCHAR) LIKE '5%'
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;WITH r AS (
SELECT
Timestamp,
CAST(data[1].user_agent AS VARCHAR) AS user_agent,
CAST(data[2].latency AS DOUBLE) AS latency_us
FROM 'marending_variant.parquet'
)
SELECT
DATE_TRUNC('day', Timestamp) AS day,
CASE
WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
WHEN user_agent LIKE '%Edg%' THEN 'Edge'
WHEN user_agent LIKE '%Chrome%' THEN 'Chrome'
WHEN user_agent LIKE '%Safari%' THEN 'Safari'
WHEN user_agent ILIKE '%bot%' THEN 'Bot'
ELSE 'Other'
END AS browser,
COUNT(*) AS requests,
AVG(latency_us) AS avg_latency_us
FROM r
GROUP BY day, browser
ORDER BY day, requests DESC;Materialized
COPY (
SELECT
Timestamp,
'marending' AS service,
-- Request event (index 1)
"Events.Attributes"[1]['code.filepath'] AS request_code_filepath,
"Events.Attributes"[1]['code.lineno'] AS request_code_lineno,
"Events.Attributes"[1]['code.namespace'] AS request_code_namespace,
"Events.Attributes"[1]['level'] AS request_level,
"Events.Attributes"[1]['referrer'] AS referrer,
"Events.Attributes"[1]['request'] AS request_method,
"Events.Attributes"[1]['target'] AS request_target,
"Events.Attributes"[1]['uri'] AS uri,
"Events.Attributes"[1]['user_agent'] AS user_agent,
-- Response event (index 2)
"Events.Attributes"[2]['code.filepath'] AS response_code_filepath,
"Events.Attributes"[2]['code.lineno'] AS response_code_lineno,
"Events.Attributes"[2]['code.namespace'] AS response_code_namespace,
CAST("Events.Attributes"[2]['latency'] AS BIGINT) AS latency_us,
"Events.Attributes"[2]['level'] AS response_level,
CAST("Events.Attributes"[2]['status'] AS SMALLINT) AS status,
"Events.Attributes"[2]['target'] AS response_target
FROM 'otel_traces.parquet'
WHERE ServiceName = 'marending'
) TO 'marending_materialized.parquet' (FORMAT PARQUET);SELECT uri, COUNT(*) AS error_count
FROM 'marending_materialized.parquet'
WHERE status >= 400
GROUP BY uri
ORDER BY error_count DESC
LIMIT 50;SELECT
DATE_TRUNC('day', Timestamp) AS day,
CASE
WHEN user_agent LIKE '%Firefox%' THEN 'Firefox'
WHEN user_agent LIKE '%Edg%' THEN 'Edge'
WHEN user_agent LIKE '%Chrome%' THEN 'Chrome'
WHEN user_agent LIKE '%Safari%' THEN 'Safari'
WHEN user_agent ILIKE '%bot%' THEN 'Bot'
ELSE 'Other'
END AS browser,
COUNT(*) AS requests,
AVG(latency_us) AS avg_latency_us
FROM 'marending_materialized.parquet'
GROUP BY day, browser
ORDER BY day, requests DESC;Results
JSON, Map and materialized columns behave pretty much exactly like I would have expected. Only the
one I actually cared about, Variant, is disappointing. I might be hitting
this issue. Although at the time of writing, I get
similar performance on a nightly build of DuckDB even though it’s supposedly fixed on main. I think
this must truly be a limitation that will be addressed at some point. Looking at the query plan with
EXPLAIN ANALYZE, it’s clear that a lot of variant extractions happen, even though in my
understanding it shouldn’t have to do that.
In general, I’m impressed with the performance of DuckDB here. There are just over 30’000 rows in this dataset and it churns through that in no time even with JSON types.
Also interesting are the file sizes of the different parquet files, one can clearly see how the faster formats also compress the data better.
- JSON: 2.2MB
- Map: 1.3MB
- Variant: 1.1MB
- Materialized: 1.0MB
For the access logs, it looks similar. Although here this DuckDB limitation around Variant hits even harder. We must be hitting a highly unoptimized code path here to be that much slower than even JSON handling.
ClickHouse comparison
I don’t want to get too deep into this, as the goal here is primarily to establish the feasibility
of querying my data from a Parquet file. But just to get a feel for the landscape of query engines,
I’m going to translate the four queries I used here to ClickHouse equivalents and run them against
the live system. Note that this is very much an apples-to-oranges comparison: The numbers thus far
have been achieved on an M1 MacBook Air, the ClickHouse ones on a Hetzner CAX21. And the Parquet
files have first been isolated to their respective datasets, while ClickHouse has an assortment of
data in its otel_traces table. Not to mention that the Hetzner server is a noisy shared VPS.
The closest equivalent of the types we’ve looked at so far to the current ClickHouse schema would be
the Map flavour. Honestly quite surprised. Goes to show that while you might want to get to those
single-digit millisecond latencies, in real-world usage these fractions of a second have been just
fine. I would suspect we could go considerably faster by using ClickHouse’s own JSON type, which
behaves similarly to Variant in my understanding.
Verdict
I think I’m reaching a different conclusion than what I set out to see, but I’m still not complaining. I was expecting to see subpar performance on JSON strings, with Variant saving the day and making my undertaking feasible. But I got the opposite: JSON handling is surprisingly fast in DuckDB and the Variant support is unfortunately too young for real-world use.
But the bigger take-away is that the general pattern of writing and querying Parquet files could
actually be quite elegant. While I would be very nervous trying to alter the otel_traces table in
a running system right now, rewriting Parquet files to use Variant down the line should be a walk in
the park (save rewriting queries of course). In a follow-up note I’ll have to hone in on Parquet
file handling, e.g. is there an impact when I have a single file that mixes data sources? And how to
simultaneously write files periodically and query across different files?
While I’m terrible at leaving my side-projects be, I do enjoy that the iteration tends to result in leaner and simpler systems.