Amazon S3
Amazon Simple Storage Service (S3) is a scalable cloud storage service offered by Amazon Web Services (AWS). It allows users to store and retrieve extensive amounts of data from anywhere on the web.
ingestr
supports Amazon S3 as both a data source and destination.
URI Format
The URI for connecting to Amazon S3 is structured as follows:
s3://?access_key_id=<your_access_key_id>&secret_access_key=<your_secret_access_key>
URI Parameters:
access_key_id
: Your AWS access key ID.secret_access_key
: Your AWS secret access key.endpoint_url
: URL of an S3-Compatiable API Server (optional, destination only)layout
: Layout template (optional, destination only)
These credentials are required to authenticate and authorize access to your S3 buckets.
The --source-table
parameter specifies the S3 bucket and file pattern using the following format:
<bucket-name>/<file-glob-pattern>
Setting up an S3 Integration
To integrate ingestr
with Amazon S3, you need an access_key_id
and a secret_access_key
. For guidance on obtaining these credentials, refer to the dltHub documentation on AWS credentials.
Once you have your credentials, you can configure the S3 URI. The bucket_name
and path_to_files
(file glob pattern) are specified in the --source-table
argument.
Example: Loading data from S3
Let's assume the following details:
access_key_id
:AKC3YOW7E
secret_access_key
:XCtkpL5B
- S3 bucket name:
my_bucket
- Path to files within the bucket:
students/students_details.csv
The following command demonstrates how to copy data from the specified S3 location to a DuckDB database:
ingestr ingest \
--source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--source-table 'my_bucket/students/students_details.csv' \
--dest-uri duckdb:///s3_data.duckdb \
--dest-table 'processed_students.student_details'
This command will create a table named student_details
within the processed_students
schema (or equivalent grouping) in the DuckDB database file located at s3_data.duckdb
.
Example: Uploading data to S3
For this, example we'll assume that:
records.db
is a duckdb database.- has a table called
public.users
. - the S3 credentials are the same as the example above.
The following command demonstrates how to copy data from a local duckdb database to S3:
ingestr ingest \
--source-uri 'duckdb:///records.db' \
--source-table 'public.users' \
--dest-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--dest-table 'my_bucket/records'
This will result in a file structure like the following:
my_bucket/
└── records
├── _dlt_loads
├── _dlt_pipeline_state
├── _dlt_version
└── users
└── <load_id>.<file_id>.parquet
The value of load_id
and file_id
is determined at runtime. The default layout creates a folder with the same table name as the source and places the data inside a parquet file. This layout is configurable using the layout
parameter.
For example, if you would like to create a parquet file with the same name as the source table (as opposed to a folder) you can set layout
to {table_name}.{ext}
in the commandline above:
ingestr ingest \
--source-uri 'duckdb:///records.db' \
--source-table 'public.users' \
--dest-uri 's3://?layout={table_name}.{ext}&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--dest-table 'my_bucket/records'
Result:
my_bucket/
└── records
├── _dlt_loads
├── _dlt_pipeline_state
├── _dlt_version
└── users.parquet
List of available Layout variables is available here
Working with S3-Compatiable object stores
ingestr
support S3 compatiable storage services like Minio, Digital Ocean spaces and Cloudflare R2. You can set the endpoint_url
in your destination URI to write data to these object stores.
For example, if you're running minio on localhost:9000
, you can write the same data as the example above by running:
ingestr ingest \
--source-uri 'duckdb:///records.db' \
--source-table 'public.users' \
--dest-uri 's3://?endpoint_url=http://localhost:9000&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--dest-table 'my_bucket/records'
NOTE
S3-Compatiable object stores are currently only supported as destinations.
File Glob Pattern Examples:
WARNING
Glob patterns only apply when loading data from S3 as source.
The <file-glob-pattern>
in the --source-table
argument allows for flexible file selection. Here are some common patterns and their descriptions:
Pattern | Description |
---|---|
bucket/**/*.csv | Retrieves all CSV files recursively from s3://bucket . |
bucket/*.csv | Retrieves all CSV files located at the root level of s3://bucket . |
bucket/myFolder/**/*.jsonl | Retrieves all JSONL files recursively from the myFolder directory and its subdirectories in s3://bucket . |
bucket/myFolder/mySubFolder/users.parquet | Retrieves the specific users.parquet file from the myFolder/mySubFolder/ path in s3://bucket . |
bucket/employees.jsonl | Retrieves the employees.jsonl file located at the root level of the s3://bucket . |