Skip to content

S3

S3 is a bucket for storing data in Amazon's Simple Storage Service, a cloud-based storage solution provided by AWS. S3 buckets allow users to store and retrieve data at any time from anywhere on the web.

ingestr supports S3 as a source.

URI format

The URI format for S3 is as follows:

plaintext
s3://?access_key_id=<access_key_id>&secret_access_key=<secret_access_key>

URI parameters:

  • access_key_id and secret_access_key : Used for accessing S3 bucket.

The --source-table must be in the format:

{bucket name}/{file glob}

Setting up a S3 Integration

S3 requires an access_key_id and a secret_access_key to access the bucket. Please follow the guide on dltHub to obtain credentials. Once you've completed the guide, you should have an access_key_id and secret_access_key. From the S3 URI, you can extract the bucket_name and path_to_files

For example, if your access_key_id is AKC3YOW7E, secret_access_key is XCtkpL5B, bucket name is my_bucket, and path_to_files is students/students_details.csv, here's a sample command that will copy the data from the S3 bucket into a DuckDB database:

sh
ingestr ingest \
    --source-uri 's3://?access_key_id=AKC3YOW7E&secret_access_key=XCtkpL5B' \
    --source-table 'my_bucket/students/students_details.csv' \
    --dest-uri duckdb:///s3.duckdb \
    --dest-table 'dest.students_details'

The result of this command will be a table in the DuckDB database in the path s3.duckdb.

Below are some examples of path patterns, each path pattern is a glob you can specify after the bucket name:

  • **/*.csv: Retrieves all the CSV files, regardless of how deep they are within the folder structure.
  • *.csv: Retrieves all the CSV files from the first level of a folder.
  • myFolder/**/*.jsonl: Retrieves all the JSONL files from anywhere under myFolder.
  • myFolder/mySubFolder/users.parquet: Retrieves the users.parquet file from mySubFolder.
  • employees.jsonl: Retrieves the employees.jsonl file from the root level of the bucket.