Amazon S3
Amazon Simple Storage Service (S3) is a scalable cloud storage service offered by Amazon Web Services (AWS). It allows users to store and retrieve extensive amounts of data from anywhere on the web.
ingestr supports Amazon S3 as both a data source and destination.
URI Format
The URI for connecting to Amazon S3 is structured as follows:
s3://?access_key_id=<your_access_key_id>&secret_access_key=<your_secret_access_key>URI Parameters:
access_key_id: Your AWS access key ID.secret_access_key: Your AWS secret access key.region: AWS region for S3 and, unlessathena_regionis set, Athena.endpoint_url: URL of an S3-Compatible API Server (optional, for S3-compatible storage like Minio)file_discovery: File discovery backend for S3 sources. Supported values arelist(default) andathena_inventory.athena_inventory_table: Qualified Athena table name for S3 Inventory, e.g.inventory_db.inventory_table. Required whenfile_discovery=athena_inventory.athena_results_location: S3 location for Athena query results, e.g.s3://my-query-results/ingestr/. Required whenfile_discovery=athena_inventory.athena_workgroup: Athena workgroup to use for inventory queries (optional).athena_region: Athena region if different fromregion(optional).athena_inventory_bucket_column: Bucket column in the inventory table. Defaults tobucket.athena_inventory_key_column: Object key column in the inventory table. Defaults tokey.athena_inventory_modified_column: Object timestamp column in the inventory table used for source file timestamp filtering. Defaults tolast_modified_date.layout: Layout template (optional, destination only)
These credentials are required to authenticate and authorize access to your S3 buckets.
The --source-table parameter specifies the S3 bucket and file pattern using the following format:
<bucket-name>/<file-glob-pattern>Setting up an S3 Integration
To integrate ingestr with Amazon S3, you need an access_key_id and a secret_access_key.
Step 1: Create an IAM User (if needed)
- Go to the AWS IAM Console
- Click Users → Add users
- Enter a username (e.g., "s3-integration")
- Select Access key - Programmatic access
- Click Next: Permissions
Step 2: Assign S3 Permissions
Attach a policy that grants S3 access. You can use managed policies or create a custom one:
- Read-only access: Use
AmazonS3ReadOnlyAccessmanaged policy - Read/write access: Use
AmazonS3FullAccessmanaged policy - Specific bucket access: Create a custom policy like:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}Step 3: Get Access Keys
- After creating the user, you'll see the Access key ID and Secret access key
- Download or copy these credentials securely (the secret key is shown only once)
If you already have an IAM user:
- Go to IAM → Users → select your user
- Click the Security credentials tab
- Click Create access key
Once you have your credentials, you can configure the S3 URI. The bucket_name and path_to_files (file glob pattern) are specified in the --source-table argument.
Example: Loading data from S3
Let's assume the following details:
access_key_id:AKC3YOW7Esecret_access_key:XCtkpL5B- S3 bucket name:
my_bucket - Path to files within the bucket:
students/students_details.csv
The following command demonstrates how to copy data from the specified S3 location to a DuckDB database:
ingestr ingest \
--source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--source-table 'my_bucket/students/students_details.csv' \
--dest-uri duckdb:///s3_data.duckdb \
--dest-table 'processed_students.student_details'This command will create a table named student_details within the processed_students schema (or equivalent grouping) in the DuckDB database file located at s3_data.duckdb.
Example: Uploading data to S3
For this, example we'll assume that:
records.dbis a duckdb database.- has a table called
public.users. - the S3 credentials are the same as the example above.
The following command demonstrates how to copy data from a local duckdb database to S3:
ingestr ingest \
--source-uri 'duckdb:///records.db' \
--source-table 'public.users' \
--dest-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--dest-table 'my_bucket/records'This will result in a file structure like the following:
my_bucket/
└── records
├── _dlt_loads
├── _dlt_pipeline_state
├── _dlt_version
└── users
└── <load_id>.<file_id>.parquetThe value of load_id and file_id is determined at runtime. The default layout creates a folder with the same table name as the source and places the data inside a parquet file. This layout is configurable using the layout parameter.
For example, if you would like to create a parquet file with the same name as the source table (as opposed to a folder) you can set layout to {table_name}.{ext} in the commandline above:
ingestr ingest \
--source-uri 'duckdb:///records.db' \
--source-table 'public.users' \
--dest-uri 's3://?layout={table_name}.{ext}&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--dest-table 'my_bucket/records'Result:
my_bucket/
└── records
├── _dlt_loads
├── _dlt_pipeline_state
├── _dlt_version
└── users.parquetAvailable Layout Placeholders
The following placeholders can be used in the layout parameter:
| Placeholder | Description |
|---|---|
{table_name} | Name of the table being written |
{load_id} | Unique identifier for the current load operation |
{file_id} | Unique identifier for each file within a load |
{ext} | File extension (parquet) |
Example layouts:
{load_id}.{file_id}.{ext}- Default layout{table_name}/{load_id}.{file_id}.{ext}- Organized by table name{table_name}.{ext}- Single file per table
Working with S3-Compatible object stores
ingestr supports S3-compatible storage services like Minio, Digital Ocean Spaces and Cloudflare R2. You can set the endpoint_url in your URI to read from or write to these object stores.
For example, if you're running Minio on localhost:9000, you can read data from it:
ingestr ingest \
--source-uri 's3://?endpoint_url=http://localhost:9000&access_key_id=minioadmin&secret_access_key=minioadmin' \
--source-table 'my_bucket/data.csv' \
--dest-uri 'duckdb:///local.duckdb' \
--dest-table 'public.my_data'Or write data to it:
ingestr ingest \
--source-uri 'duckdb:///records.db' \
--source-table 'public.users' \
--dest-uri 's3://?endpoint_url=http://localhost:9000&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--dest-table 'my_bucket/records'File Glob Pattern Examples:
WARNING
Glob patterns only apply when loading data from S3 as source.
The <file-glob-pattern> in the --source-table argument allows for flexible file selection. Here are some common patterns and their descriptions:
| Pattern | Description |
|---|---|
bucket/**/*.csv | Retrieves all CSV files recursively from s3://bucket. |
bucket/*.csv | Retrieves all CSV files located at the root level of s3://bucket. |
bucket/myFolder/**/*.jsonl | Retrieves all JSONL files recursively from the myFolder directory and its subdirectories in s3://bucket. |
bucket/myFolder/mySubFolder/users.parquet | Retrieves the specific users.parquet file from the myFolder/mySubFolder/ path in s3://bucket. |
bucket/employees.jsonl | Retrieves the employees.jsonl file located at the root level of the s3://bucket. |
Working with compressed files
ingestr automatically detects and handles gzipped files in your S3 bucket. You can load data from compressed files with the .gz extension without any additional configuration.
For example, to load data from a gzipped CSV file:
ingestr ingest \
--source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--source-table 'my_bucket/logs/event-data.csv.gz' \
--dest-uri duckdb:///compressed_data.duckdb \
--dest-table 'logs.events'You can also use glob patterns to load multiple compressed files:
ingestr ingest \
--source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--source-table 'my_bucket/logs/**/*.csv.gz' \
--dest-uri duckdb:///compressed_data.duckdb \
--dest-table 'logs.events'File type hinting
If your files are properly encoded but lack the correct file extension (CSV, JSONL, or Parquet), you can provide a file type hint to inform ingestr about the format of the files. This is done by appending a fragment identifier (#format) to the end of the path in your --source-table parameter.
For example, if you have JSONL-formatted log files stored in S3 with a non-standard extension:
--source-table "my_bucket/logs/event-data#jsonl"This tells ingestr to process the files as JSONL, regardless of their actual extension.
Supported format hints include:
#csv- For comma-separated values files with headers#csv_headless- For CSV files without headers#jsonl- For line-delimited JSON files#parquet- For Parquet format files
TIP
File type hinting works with gzip compressed files as well.
Incremental loading by source file timestamps
When S3 is used as a source, ingestr can use the S3 object timestamp as the incremental key. This filters matching objects before they are downloaded, so only files inside the requested timestamp interval are read.
This mode is opt-in for backward compatibility. Existing S3 ingestions without --incremental-key _ingestr_source_file_modified_at or --incremental-key _ingestr_source_file_created_at keep the previous schema and file-listing behavior.
When enabled, the source adds the selected timestamp column and the source file path to every emitted row:
_ingestr_source_file_modified_at: the S3 objectLastModifiedtimestamp in UTC._ingestr_source_file_created_at: the S3 object timestamp in UTC. S3 listing and standard S3 Inventory expose the objectLastModifiedtimestamp, so this uses the same timestamp as_ingestr_source_file_modified_at._ingestr_source_file_path: the fulls3://bucket/keypath for the source object.
Use --incremental-key _ingestr_source_file_modified_at or --incremental-key _ingestr_source_file_created_at to select this mode. Other --incremental-key values are treated as regular columns from the file data for destination write strategies. They do not enable S3 object timestamp filtering, add file metadata columns, or filter rows while reading the files.
Use --interval-start and optionally --interval-end to load only objects whose selected timestamp is in that window:
ingestr ingest \
--source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--source-table 'my_bucket/logs/**/*.jsonl' \
--dest-uri duckdb:///logs.duckdb \
--dest-table 'logs.events' \
--incremental-strategy append \
--incremental-key _ingestr_source_file_modified_at \
--interval-start '2026-01-01T00:00:00Z'The interval bounds are compared to the selected S3 object timestamp, not row values inside the files. The interval is half-open: --interval-start is inclusive, and --interval-end is exclusive. If no interval is provided, ingestr reads all files matching the source-table pattern and still emits the selected metadata columns.
The _ingestr_source_file_modified_at, _ingestr_source_file_created_at, and _ingestr_source_file_path column names must not already exist in the files when they are emitted. Use --exclude-columns only if you intentionally want to suppress one of these emitted metadata columns; excluding the selected timestamp column also means it will not be available to destination strategies as an incremental key.
Large buckets: S3 Inventory with Athena
By default, ingestr discovers S3 source files with the S3 ListObjectsV2 API. S3 can filter that listing by prefix, but not by object LastModified, so broad patterns such as my_bucket/**/*.jsonl may require many S3 list requests.
For large buckets, you can opt into Athena-backed file discovery over an S3 Inventory table:
ingestr ingest \
--source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B®ion=us-east-1&file_discovery=athena_inventory&athena_inventory_table=s3_inventory.my_bucket_inventory&athena_results_location=s3://my-athena-results/ingestr/&athena_workgroup=primary' \
--source-table 'my_bucket/logs/**/*.jsonl' \
--dest-uri duckdb:///logs.duckdb \
--dest-table 'logs.events' \
--incremental-strategy append \
--incremental-key _ingestr_source_file_modified_at \
--interval-start '2026-01-01T00:00:00Z'In this mode, Athena is used only to discover candidate files. For a source table such as my_bucket/logs/**/*.jsonl, ingestr derives the static prefix logs/ and queries the inventory table by bucket, prefix, and selected source file timestamp interval. It then applies the full glob pattern locally before downloading the selected files. For both _ingestr_source_file_modified_at and _ingestr_source_file_created_at, the Athena query uses the configured athena_inventory_modified_column.
S3 Inventory is usually generated on a schedule, so Athena-backed discovery reflects the latest available inventory snapshot rather than real-time object changes. Use the default file_discovery=list mode when you need to see newly written files immediately.
CSV files without headers
For CSV files that don't have a header row, use the #csv_headless format hint. You can optionally provide column names using the --columns flag:
# With custom column names
ingestr ingest \
--source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--source-table 'my_bucket/data/raw-data.csv#csv_headless' \
--columns "id:bigint,name:text,value:double" \
--dest-uri duckdb:///local.duckdb \
--dest-table 'public.raw_data'If no column names are provided, columns will be automatically named unknown_col_0, unknown_col_1, etc.:
# Without column names (auto-generated)
ingestr ingest \
--source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--source-table 'my_bucket/data/raw-data.csv#csv_headless' \
--dest-uri duckdb:///local.duckdb \
--dest-table 'public.raw_data'Character encoding
ingestr handles UTF-8 (with or without BOM) and UTF-16 LE/BE (with BOM) automatically. If your CSV is in a different encoding and you see garbled characters (� or wrong letters) in the destination, declare the encoding via a #encoding= hint on --source-table:
--source-table 'my_bucket/data/file.csv#encoding=windows-1252'It can be combined with a format hint, comma-separated, in any order:
--source-table 'my_bucket/data/file.dat#csv,encoding=windows-1252'To check what encoding your file actually is, inspect the first bytes:
aws s3 cp s3://my_bucket/data/file.csv - | head -c 4 | xxd| First bytes | Encoding | Action |
|---|---|---|
efbb bf | UTF-8 with BOM | None (auto-detected) |
fffe | UTF-16 LE | None (auto-detected) |
feff | UTF-16 BE | None (auto-detected) |
fffe 0000 | UTF-32 LE | #encoding=utf-32le |
0000 feff | UTF-32 BE | #encoding=utf-32be |
| ASCII-like bytes | UTF-8 or some 8-bit encoding | Try without hint; if you see �, declare it (most often windows-1252) |
Common encoding values: windows-1252 / cp1252, iso-8859-1 / latin1, windows-1250, windows-1251, shift_jis, gb18030, euc-kr. Names follow the WHATWG Encoding Standard (case-insensitive). Unknown names return an error.