Redshift¶
The Redshift driver depends on psycopg2 and can be installed with:
pip install "sayn[redshift]"
The Redshift connector looks for the following parameters:
Parameter | Description | Default |
---|---|---|
host | Host name or public IP of the cluster | Required on standard user/password connection |
port | Connection port | 5439 |
user | User name used to connect | Required |
password | Password for that user | Required on standard user/password connection |
cluster_id | Cluster id as registered in AWS | |
dbname | Database in use upon connection | Required |
For advanced configurations, SAYN will pass other parameters to create_engine
, so check the
sqlalchemy psycopg2
dialect for extra parameters.
Connection Types¶
SAYN supports 2 connection models for Redshift: standard user/password connection and IAM based.
Standard User/Password Connection¶
If you have a user name and password for redshift use the first model and ensure host and password are specified.
settings.yaml
credentials:
redshift-conn:
type: redshift
host: my-redshift-cluster.adhfjlasdljfd.eu-west-1.redshift.amazonaws.com
port: 5439
user: awsuser
password: 'Pas$w0rd' #use quotes to avoid conflict with special characters
dbname: models
Connecting With IAM¶
With an IAM based connection SAYN uses the AWS API to obtain a temporary password to stablish the connection, so only user, dbname and cluster_id are required.
settings.yaml
credentials:
redshift-conn:
type: redshift
cluster_id: my-redshift-cluster
user: awsuser
dbname: models
For this connection type to work:
boto3
needs to be installed in the project virtual environmentpip install boto3
.- The AWS cli need to be setup.
- The
user
anddbname
still need to be specified (use the database user, not theIAM:user
). host
andport
can be skipped and these values will be obtained using boto3'sredshift describe-clusters
.
Redshift Specific DDL¶
Indexes¶
Redshift doesn't support index definitions, and so autosql and copy tasks will forbid its definition
in the ddl
entry in the task definition.
Sorting¶
Table sorting can be specified under the ddl
entry in the task definition
tasks/base.yaml
tasks:
f_battles:
type: autosql
file_name: f_battles.sql
materialisation: table
destination:
table: f_battles
table_properties:
sorting:
columns:
- arena_name
- fighter1_name
With the above example, the table f_battles
will be sorted by arena_name
and fighter1_name
using a compound key (Redshift default). The type of sorting can be changed to interleaved.
tasks/base.yaml
tasks:
f_battles:
type: autosql
file_name: f_battles.sql
materialisation: table
destination:
table: f_battles
table_properties:
sorting:
type: interleaved
columns:
- arena_name
- fighter1_name
For more information, read the latest docs about SORTKEY.
Distribution¶
We can also specify the type of distribution: even, all or key based. If not specified, the Redshift default is even distribution.
tasks/base.yaml
tasks:
f_battles:
type: autosql
file_name: f_battles.sql
materialisation: table
destination:
table: f_battles
table_properties:
distribution: all
If we want to distribute the table by a given column use the following:
tasks/base.yaml
tasks:
f_battles:
type: autosql
file_name: f_battles.sql
materialisation: table
destination:
table: f_battles
table_properties:
distribution: key(tournament_name)
For more information, read the latest docs about DISTKEY.