Amazon S3

Amazon Simple Storage Service (S3) is a scalable cloud storage service offered by Amazon Web Services (AWS). It allows users to store and retrieve extensive amounts of data from anywhere on the web.

ingestr supports Amazon S3 as both a data source and destination.

URI Format

The URI for connecting to Amazon S3 is structured as follows:

plaintext

s3://?access_key_id=<your_access_key_id>&secret_access_key=<your_secret_access_key>

URI Parameters:

access_key_id: Your AWS access key ID.
secret_access_key: Your AWS secret access key.
endpoint_url: URL of an S3-Compatiable API Server (optional, destination only)
layout: Layout template (optional, destination only)

These credentials are required to authenticate and authorize access to your S3 buckets.

The --source-table parameter specifies the S3 bucket and file pattern using the following format:

<bucket-name>/<file-glob-pattern>

Setting up an S3 Integration

To integrate ingestr with Amazon S3, you need an access_key_id and a secret_access_key. For guidance on obtaining these credentials, refer to the dltHub documentation on AWS credentials.

Once you have your credentials, you can configure the S3 URI. The bucket_name and path_to_files (file glob pattern) are specified in the --source-table argument.

Example: Loading data from S3

Let's assume the following details:

access_key_id: AKC3YOW7E
secret_access_key: XCtkpL5B
S3 bucket name: my_bucket
Path to files within the bucket: students/students_details.csv

The following command demonstrates how to copy data from the specified S3 location to a DuckDB database:

ingestr ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/students/students_details.csv' \
    --dest-uri duckdb:///s3_data.duckdb \
    --dest-table 'processed_students.student_details'

This command will create a table named student_details within the processed_students schema (or equivalent grouping) in the DuckDB database file located at s3_data.duckdb.

Example: Uploading data to S3

For this, example we'll assume that:

records.db is a duckdb database.
has a table called public.users.
the S3 credentials are the same as the example above.

The following command demonstrates how to copy data from a local duckdb database to S3:

ingestr ingest \
    --source-uri 'duckdb:///records.db' \
    --source-table 'public.users' \
    --dest-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --dest-table 'my_bucket/records'

This will result in a file structure like the following:

my_bucket/
└── records
    ├── _dlt_loads
    ├── _dlt_pipeline_state
    ├── _dlt_version
    └── users
        └── <load_id>.<file_id>.parquet

The value of load_id and file_id is determined at runtime. The default layout creates a folder with the same table name as the source and places the data inside a parquet file. This layout is configurable using the layout parameter.

For example, if you would like to create a parquet file with the same name as the source table (as opposed to a folder) you can set layout to {table_name}.{ext} in the commandline above:

ingestr ingest \
    --source-uri 'duckdb:///records.db' \
    --source-table 'public.users' \
    --dest-uri 's3://?layout={table_name}.{ext}&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ 
    --dest-table 'my_bucket/records'

Result:

my_bucket/
└── records
    ├── _dlt_loads
    ├── _dlt_pipeline_state
    ├── _dlt_version
    └── users.parquet

List of available Layout variables is available here

Working with S3-Compatiable object stores

ingestr support S3 compatiable storage services like Minio, Digital Ocean spaces and Cloudflare R2. You can set the endpoint_url in your destination URI to write data to these object stores.

For example, if you're running minio on localhost:9000, you can write the same data as the example above by running:

ingestr ingest \
    --source-uri 'duckdb:///records.db' \
    --source-table 'public.users' \
    --dest-uri 's3://?endpoint_url=http://localhost:9000&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --dest-table 'my_bucket/records'

NOTE

S3-Compatiable object stores are currently only supported as destinations.

File Glob Pattern Examples:

WARNING

Glob patterns only apply when loading data from S3 as source.

The <file-glob-pattern> in the --source-table argument allows for flexible file selection. Here are some common patterns and their descriptions:

Pattern	Description
`bucket/*/.csv`	Retrieves all CSV files recursively from `s3://bucket`.
`bucket/*.csv`	Retrieves all CSV files located at the root level of `s3://bucket`.
`bucket/myFolder/*/.jsonl`	Retrieves all JSONL files recursively from the `myFolder` directory and its subdirectories in `s3://bucket`.
`bucket/myFolder/mySubFolder/users.parquet`	Retrieves the specific `users.parquet` file from the `myFolder/mySubFolder/` path in `s3://bucket`.
`bucket/employees.jsonl`	Retrieves the `employees.jsonl` file located at the root level of the `s3://bucket`.

Working with compressed files

ingestr automatically detects and handles gzipped files in your S3 bucket. You can load data from compressed files with the .gz extension without any additional configuration.

For example, to load data from a gzipped CSV file:

ingestr ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/logs/event-data.csv.gz' \
    --dest-uri duckdb:///compressed_data.duckdb \
    --dest-table 'logs.events'

You can also use glob patterns to load multiple compressed files:

ingestr ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/logs/**/*.csv.gz' \
    --dest-uri duckdb:///compressed_data.duckdb \
    --dest-table 'logs.events'

File type hinting

If your files are properly encoded but lack the correct file extension (CSV, JSONL, or Parquet), you can provide a file type hint to inform ingestr about the format of the files. This is done by appending a fragment identifier (#format) to the end of the path in your --source-table parameter.

For example, if you have JSONL-formatted log files stored in S3 with a non-standard extension:

--source-table "my_bucket/logs/event-data#jsonl"

This tells ingestr to process the files as JSONL, regardless of their actual extension.

Supported format hints include:

#csv - For comma-separated values files
#jsonl - For line-delimited JSON files
#parquet - For Parquet format files

TIP

File type hinting works with gzip compressed files as well.

Amazon S3 ​

URI Format ​

Setting up an S3 Integration ​

Example: Loading data from S3 ​

Example: Uploading data to S3 ​

Working with S3-Compatiable object stores ​

File Glob Pattern Examples: ​

Working with compressed files ​

File type hinting ​