Skip to content

Redshift

The Redshift driver depends on psycopg2 and can be installed with:

pip install "sayn[redshift]"

The Redshift connector looks for the following parameters:

Parameter Description Default
host Host name or public IP of the cluster Required on standard user/password connection
port Connection port 5439
user User name used to connect Required
password Password for that user Required on standard user/password connection
cluster_id Cluster id as registered in AWS
dbname Database in use upon connection Required

For advanced configurations, SAYN will pass other parameters to create_engine, so check the sqlalchemy psycopg2 dialect for extra parameters.

Connection Types

SAYN supports 2 connection models for Redshift: standard user/password connection and IAM based.

Standard User/Password Connection

If you have a user name and password for redshift use the first model and ensure host and password are specified.

settings.yaml

credentials:
  redshift-conn:
    type: redshift
    host: my-redshift-cluster.adhfjlasdljfd.eu-west-1.redshift.amazonaws.com
    port: 5439
    user: awsuser
    password: 'Pas$w0rd' #use quotes to avoid conflict with special characters
    dbname: models

Connecting With IAM

With an IAM based connection SAYN uses the AWS API to obtain a temporary password to stablish the connection, so only user, dbname and cluster_id are required.

settings.yaml

credentials:
  redshift-conn:
    type: redshift
    cluster_id: my-redshift-cluster
    user: awsuser
    dbname: models

For this connection type to work:

  • boto3 needs to be installed in the project virtual environment pip install boto3.
  • The AWS cli need to be setup.
  • The user and dbname still need to be specified (use the database user, not the IAM:user).
  • host and port can be skipped and these values will be obtained using boto3's redshift describe-clusters.

Redshift Specific DDL

Indexes

Redshift doesn't support index definitions, and so autosql and copy tasks will forbid its definition in the ddl entry in the task definition.

Sorting

Table sorting can be specified under the ddl entry in the task definition

tasks/base.yaml

tasks:
  f_battles:
    type: autosql
    file_name: f_battles.sql
    materialisation: table
    destination:
      table: f_battles
    table_properties:
      sorting:
        columns:
          - arena_name
          - fighter1_name

With the above example, the table f_battles will be sorted by arena_name and fighter1_name using a compound key (Redshift default). The type of sorting can be changed to interleaved.

tasks/base.yaml

tasks:
  f_battles:
    type: autosql
    file_name: f_battles.sql
    materialisation: table
    destination:
      table: f_battles
    table_properties:
      sorting:
        type: interleaved
        columns:
          - arena_name
          - fighter1_name

For more information, read the latest docs about SORTKEY.

Distribution

We can also specify the type of distribution: even, all or key based. If not specified, the Redshift default is even distribution.

tasks/base.yaml

tasks:
  f_battles:
    type: autosql
    file_name: f_battles.sql
    materialisation: table
    destination:
      table: f_battles
    table_properties:
      distribution: all

If we want to distribute the table by a given column use the following:

tasks/base.yaml

tasks:
  f_battles:
    type: autosql
    file_name: f_battles.sql
    materialisation: table
    destination:
      table: f_battles
    table_properties:
      distribution: key(tournament_name)

For more information, read the latest docs about DISTKEY.