Google Cloud Storage
Google Cloud Storage is an online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing capabilities. It is an Infrastructure as a Service (IaaS), comparable to Amazon S3.
URI format
The URI format for Google Cloud Storage is as follows:
gs://<bucket_name>?credentials_path=/path/to/service-account.json>
URI parameters:
bucket_name
: The name of the bucketcredentials_path
: path to file containing your Google Cloud Service Account
Setting up a GCS Integration
To use Google Cloud Storage source in ingestr
, you will need:
- A Google Cloud Project.
- A Service Account with atleast roles/storage.objectUser IAM permission.
- A Service Account key file for the corresponding service account.
For more information on how to create a Service Account or it's keys, see Create service accounts and Create or delete service account keys on Google Cloud docs.
Example
Let's assume that:
- Service account key in available in the current directory, under the filename
service_account.json
. - The bucket you want to load data from is called
my-org-bucket
- The source file is available at
/data/latest/dump.csv
- The data needs to be saved in a DuckDB database called
local.db
- The destination table name will be
public.latest_dump
You can run the following command line to achieve this:
ingestr ingest \
--source-uri "gs://my-org-bucket?credentials_path=$PWD/service_account.json" \
--source-table "/data/latest/dump.csv" \
--dest-uri "duckdb:///local.db" \
--dest-table "public.latest_dump"
Supported File Formats
gs
source only supports loading files in the following formats:
csv
: Comma Separated Values (supports Tab Separated Values as well)parquet
: Apache Parquet storage format.jsonl
: Line delimited JSON. see https://jsonlines.org/
File Pattern
ingestr
supports glob like pattern matching for gs
source. This allows for a powerful pattern matching mechanism that allows you to specify multiple files in a single --source-table
.
Below are some examples of path patterns, each path pattern is a reference from the root of the bucket:
**/*.csv
: Retrieves all the CSV files, regardless of how deep they are within the folder structure.*.csv
: Retrieves all the CSV files from the first level of a folder.myFolder/**/*.jsonl
: Retrieves all the JSONL files from anywhere undermyFolder
.myFolder/mySubFolder/users.parquet
: Retrieves theusers.parquet
file frommySubFolder
.employees.jsonl
: Retrieves theemployees.jsonl
file from the root level of the bucket.