S3
S3 is a bucket for storing data in Amazon's Simple Storage Service, a cloud-based storage solution provided by AWS. S3 buckets allow users to store and retrieve data at any time from anywhere on the web.
ingestr supports S3 as a source.
URI format
The URI format for S3 is as follows:
s3://?access_key_id=<access_key_id>&secret_access_key=<secret_access_key>
URI parameters:
access_key_id
andsecret_access_key
: Used for accessing S3 bucket.
The --source-table
must be in the format:
{bucket name}/{file glob}
Setting up a S3 Integration
S3 requires an access_key_id
and a secret_access_key
to access the bucket. Please follow the guide on dltHub to obtain credentials. Once you've completed the guide, you should have an access_key_id
and secret_access_key
. From the S3 URI, you can extract the bucket_name
and path_to_files
For example, if your access_key_id
is AKC3YOW7E
, secret_access_key
is XCtkpL5B
, bucket name is my_bucket
, and path_to_files
is students/students_details.csv
, here's a sample command that will copy the data from the S3 bucket into a DuckDB database:
ingestr ingest \
--source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
--source-table 'my_bucket/students/students_details.csv' \
--dest-uri duckdb:///s3.duckdb \
--dest-table 'dest.students_details'
The result of this command will be a table in the DuckDB database in the path s3.duckdb
.
Below are some examples of path patterns, each path pattern is a glob you can specify after the bucket name:
**/*.csv
: Retrieves all the CSV files, regardless of how deep they are within the folder structure.*.csv
: Retrieves all the CSV files from the first level of a folder.myFolder/**/*.jsonl
: Retrieves all the JSONL files from anywhere undermyFolder
.myFolder/mySubFolder/users.parquet
: Retrieves theusers.parquet
file frommySubFolder
.employees.jsonl
: Retrieves theemployees.jsonl
file from the root level of the bucket.