Files
The files warehouse driver writes session data to either local filesystem or object storage (S3/MinIO or GCS). It does not require a running database.
Data is written continuously to an active file per stream. When a file reaches a size or age threshold it is sealed and uploaded. d8a recovers any undelivered segments automatically on restart.
What you need
- A local spool directory with sufficient disk space for buffering
- One of: a destination directory (local), an S3/MinIO bucket, or a GCS bucket
Configuration
Full configuration reference is available here.
Add the following to your config.yaml file:
storage:
spool_enabled: true # required by the files warehouse driver
spool_directory: ./spool # where active/sealed segments are staged
warehouse:
driver: files
files:
format: csv
storage: filesystem # filesystem, s3, or gcs
filesystem:
path: /data/warehouse
Storage destinations
Filesystem
Files are moved to the configured directory once sealed. Useful for local pipelines or when another process picks them up from disk.
warehouse:
driver: files
files:
storage: filesystem
filesystem:
path: /data/warehouse
S3 / MinIO
warehouse:
driver: files
files:
storage: s3
s3:
host: s3.amazonaws.com
bucket: my-bucket
access_key: AKIAIOSFODNN7EXAMPLE
secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
region: us-east-1
protocol: https
GCS
warehouse:
driver: files
files:
storage: gcs
gcs:
bucket: my-gcs-bucket
creds_json: |
{ "type": "service_account", ... }
Leave creds_json empty to use Application Default Credentials (ADC).
Segment tuning
Segments are sealed when either threshold is crossed first.
| Option | Default | Description |
|---|---|---|
warehouse.files.max_segment_size | 1073741824 (1 GiB) | Seal when the active file reaches this size |
warehouse.files.max_segment_age | 1h | Seal when the active file is this old |
warehouse.files.seal_check_interval | 15s | How often to evaluate sealing triggers |
Path template
The files warehouse writes data to paths generated from a configurable template. See --warehouse-files-path-template for the default value and customization options.
Available variables:
| Variable | Type | Description |
|---|---|---|
Table | string | Escaped table name |
Schema | string | 16-character schema fingerprint |
SegmentID | string | Segment identifier (unixSeconds_uuid) |
Extension | string | File extension (csv or csv.gz) |
Year | int | Year (e.g., 2026) |
Month | int | Month number (1-12) |
MonthPadded | string | Month with leading zero (01-12) |
Day | int | Day of month (1-31) |
DayPadded | string | Day with leading zero (01-31) |
Example: table={{.Table}}/year={{.Year}}/month={{.MonthPadded}}/day={{.DayPadded}}/{{.SegmentID}}.{{.Extension}}
Important notes
- Spool required:
storage.spool_enabledmust betrue. The files warehouse uses the spool directory to stage segments before upload. - Schema migrations:
CreateTableandAddColumnare no-ops. The files warehouse does not create or alter tables. Schema evolution is handled by the consumer of the files. - Crash recovery: On startup d8a scans the spool directory, moves any interrupted uploads back to sealed, and retries them.
- Upload retries: A segment that fails to upload is retried up to 3 times. After 3 failures it is moved to a quarantine directory (
streams/<table>/<fingerprint>/failed/) and will not be retried until manually addressed.
Verifying your setup
After configuring the files warehouse, start d8a and check the logs. You should see messages indicating segments being sealed and uploaded to your configured destination.
Querying your files
The files written by this driver can be consumed directly by most analytics warehouses without an external pipeline.
-
Snowflake — Automate ingestion using Snowpipe to continuously load new CSV/CSV.GZ files from an external stage into a target table as soon as they arrive. This completely eliminates the need for an external pipeline, though querying a configured External Table with auto-refresh is an alternative if you prefer zero data movement. See the Snowflake Snowpipe documentation to set this up.
-
Amazon Redshift — Use Redshift Spectrum to create an external table and query the files directly from S3, leveraging partition pruning if you update your prefixes to the standard
year=YYYY/month=MM/day=DDformat. While Spectrum requires no pipeline for direct querying, Redshift's native auto-copy feature can automatically ingest new files into managed storage without external orchestrators. Check out the Redshift Spectrum documentation for implementation steps. -
Databricks SQL — Use Auto Loader to incrementally and automatically process new CSV files as they land in your cloud storage. This serves as a native, pipeline-free ingestion method into Delta tables, but renaming your prefixes to standard Hive partitioning will significantly optimize the initial directory discovery. Read the Databricks Auto Loader documentation for configuration.
-
Azure Synapse Analytics — Query object storage directly using Serverless SQL pools by creating an external table or using the
OPENROWSETfunction on your bucket path. This requires no pipeline for basic querying, though for optimal performance on large datasets you would need Azure Data Factory/Synapse Pipelines to copy the data into Dedicated SQL pools. Explore the Azure Synapse Serverless SQL documentation to get started. -
Starburst / Trino — Connect a Hive or Iceberg catalog to your bucket and define an external table directly over your CSV directory. Trino is a federated query engine, so it inherently requires no ingestion pipeline to query the files in-place, but changing your directory structure to
y=YYYY/m=MM/d=DDis highly recommended for efficient partition pruning. Visit the Trino Hive Connector documentation for setup details. -
Apache Druid — Continuously ingest the CSVs using Druid's native batch or streaming ingestion by pointing a supervisor spec to the object storage path. No external pipeline is needed because Druid manages the ingestion tasks internally, but you must ensure your CSV contains a timestamp column for Druid's mandatory primary time partitioning. Review the Apache Druid Ingestion documentation to configure your spec.
-
Firebolt — Create an external table pointing to your bucket and then use standard
INSERT INTO ... SELECTstatements to load data into a fact table. While external tables allow direct querying without a pipeline, moving data to native Firebolt tables requires some external orchestration (such as Airflow) to regularly trigger the ingestion of new files. More information is available in the Firebolt External Tables documentation. -
Teradata Vantage — Use the Native Object Store (NOS) feature to create a foreign table directly over your CSV/CSV.GZ files in cloud storage. This eliminates the need for an external pipeline for direct querying, though using standard Hive naming conventions (
$path/$var=...) will allow NOS to filter partitions efficiently without scanning every file. The Teradata Native Object Store documentation outlines the syntax required. -
Google BigQuery — For the best experience, use the dedicated BigQuery warehouse driver instead. If you prefer to consume the files directly, you can query them by creating an external table over the bucket, ideally updating your prefix to Hive partitioning (
y=YYYY/m=MM/d=DD) for better query performance. This works entirely without a pipeline, but for faster querying on frequent appends you should use native scheduled queries to move data into partitioned native tables. Refer to the BigQuery External Tables documentation for more details. -
ClickHouse — For the best experience, use the dedicated ClickHouse warehouse driver instead. If you prefer to consume the files directly, you can use the
S3table engine or function to query them using glob patterns likey/m/d/*.csv.gz. A pipeline isn't strictly necessary as you can set up a Materialized View over the S3 engine to automatically ingest new data into a native MergeTree table, but adding they=YYYY/Hive format improves partition filtering. Learn more in the ClickHouse S3 Engine documentation. -
DuckDB — Query the files directly using
read_csvor theread_csv_autofunction with glob patterns (e.g.,SELECT * FROM read_csv_auto('/data/warehouse/events/**/*.csv')). DuckDB requires no pipeline and no server — it runs in-process, making it ideal for ad-hoc analysis or lightweight local setups. For S3-hosted files, install thehttpfsextension and configure your credentials withSET s3_region,SET s3_access_key_id, andSET s3_secret_access_keybefore querying. See the DuckDB CSV import documentation for details.