Skip to content

Amazon S3

Amazon Simple Storage Service (S3) is a scalable cloud storage service offered by Amazon Web Services (AWS). It allows users to store and retrieve extensive amounts of data from anywhere on the web.

ingestr supports Amazon S3 as both a data source and destination.

URI Format

The URI for connecting to Amazon S3 is structured as follows:

plaintext
s3://?access_key_id=<your_access_key_id>&secret_access_key=<your_secret_access_key>

URI Parameters:

  • access_key_id: Your AWS access key ID.
  • secret_access_key: Your AWS secret access key.
  • endpoint_url: URL of an S3-Compatiable API Server (optional, destination only)
  • layout: Layout template (optional, destination only)

These credentials are required to authenticate and authorize access to your S3 buckets.

The --source-table parameter specifies the S3 bucket and file pattern using the following format:

<bucket-name>/<file-glob-pattern>

Setting up an S3 Integration

To integrate ingestr with Amazon S3, you need an access_key_id and a secret_access_key. For guidance on obtaining these credentials, refer to the dltHub documentation on AWS credentials.

Once you have your credentials, you can configure the S3 URI. The bucket_name and path_to_files (file glob pattern) are specified in the --source-table argument.

Example: Loading data from S3

Let's assume the following details:

  • access_key_id: AKC3YOW7E
  • secret_access_key: XCtkpL5B
  • S3 bucket name: my_bucket
  • Path to files within the bucket: students/students_details.csv

The following command demonstrates how to copy data from the specified S3 location to a DuckDB database:

sh
ingestr ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/students/students_details.csv' \
    --dest-uri duckdb:///s3_data.duckdb \
    --dest-table 'processed_students.student_details'

This command will create a table named student_details within the processed_students schema (or equivalent grouping) in the DuckDB database file located at s3_data.duckdb.

Example: Uploading data to S3

For this, example we'll assume that:

  • records.db is a duckdb database.
  • has a table called public.users.
  • the S3 credentials are the same as the example above.

The following command demonstrates how to copy data from a local duckdb database to S3:

sh
ingestr ingest \
    --source-uri 'duckdb:///records.db' \
    --source-table 'public.users' \
    --dest-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --dest-table 'my_bucket/records'

This will result in a file structure like the following:

my_bucket/
└── records
    ├── _dlt_loads
    ├── _dlt_pipeline_state
    ├── _dlt_version
    └── users
        └── <load_id>.<file_id>.parquet

The value of load_id and file_id is determined at runtime. The default layout creates a folder with the same table name as the source and places the data inside a parquet file. This layout is configurable using the layout parameter.

For example, if you would like to create a parquet file with the same name as the source table (as opposed to a folder) you can set layout to {table_name}.{ext} in the commandline above:

sh
ingestr ingest \
    --source-uri 'duckdb:///records.db' \
    --source-table 'public.users' \
    --dest-uri 's3://?layout={table_name}.{ext}&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \ 
    --dest-table 'my_bucket/records'

Result:

my_bucket/
└── records
    ├── _dlt_loads
    ├── _dlt_pipeline_state
    ├── _dlt_version
    └── users.parquet

List of available Layout variables is available here

Working with S3-Compatiable object stores

ingestr support S3 compatiable storage services like Minio, Digital Ocean spaces and Cloudflare R2. You can set the endpoint_url in your destination URI to write data to these object stores.

For example, if you're running minio on localhost:9000, you can write the same data as the example above by running:

sh
ingestr ingest \
    --source-uri 'duckdb:///records.db' \
    --source-table 'public.users' \
    --dest-uri 's3://?endpoint_url=http://localhost:9000&access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --dest-table 'my_bucket/records'

NOTE

S3-Compatiable object stores are currently only supported as destinations.

File Glob Pattern Examples:

WARNING

Glob patterns only apply when loading data from S3 as source.

The <file-glob-pattern> in the --source-table argument allows for flexible file selection. Here are some common patterns and their descriptions:

PatternDescription
bucket/**/*.csvRetrieves all CSV files recursively from s3://bucket.
bucket/*.csvRetrieves all CSV files located at the root level of s3://bucket.
bucket/myFolder/**/*.jsonlRetrieves all JSONL files recursively from the myFolder directory and its subdirectories in s3://bucket.
bucket/myFolder/mySubFolder/users.parquetRetrieves the specific users.parquet file from the myFolder/mySubFolder/ path in s3://bucket.
bucket/employees.jsonlRetrieves the employees.jsonl file located at the root level of the s3://bucket.