Tutorial Part 2: Data Modelling With SAYN¶

In the first part of the tutorial, we executed our first SAYN run and went through the core components of a SAYN project. We will now see how to use SAYN for data modelling, a major process of analytics warehousing.

SQL Tasks With SAYN¶

SAYN can execute two main types of SQL tasks: * autosql: these tasks take a SELECT statement and create a table or view using it. All the processes are automated by SAYN in the background. * sql: these tasks take your SQL statement as is and execute it. These are not covered in this tutorial.

Defining The Task Group¶

In order to execute your tasks, you will need to define a task group. This is done in the project.yaml file, under the groups entry. This is the models group we have defined in our project:

project.yaml

...

groups:
  models:
    type: autosql
    file_name: "{{ task.group }}/*.sql"
    materialisation: table
    destination:
      table: "{{ task.name }}"

This group effectively does the following:

It creates a task group called models.
This task group is defined to be autosql tasks.
Each file with a .sql extension in the sql/models folder will be turned into an autosql task. Do not worry about the {{ task.group }} notation for now, this is simple Jinja templating to dynamically pick the models name of the group.
All tasks from the group will model the output as tables.
The tables will be named as per the task name. This task name is automatically generated from your file name, excluding the .sql extension.

Writing Your Models¶

A Simple Model¶

As explained in the previous section, each file with a .sql extension in the sql/models folder will be turned into an autosql task following our models group definition.

For example, the dim_arenas.sql file in the sql/models folder will be turned into a dim_arenas task. This is the SQL code of this file:

sql/dim_arenas.sql

SELECT l.arena_id
     , l.arena_name
  FROM logs_arenas l

When executed, this task will create a table called dim_arenas using this SQL code. This is the dim_arenas you can find in the dev.db SQLite database at the root level of the project folder. You can execute this task only by running the command sayn run -t dim_arenas, where dim_arenas is our task name.

A Model With Dependency¶

Now, let's have a look at a model which depends on another model. The file f_battles.sql is a good example and actually depends on multiple other models. This is how the SQL query is written:

sql/f_battles.sql

SELECT t.tournament_name
     , t.tournament_name || '-' || CAST(b.battle_id AS VARCHAR) AS battle_id
     , a.arena_name
     , f1.fighter_name AS fighter1_name
     , f2.fighter_name AS fighter2_name
     , w.fighter_name AS winner_name

  FROM logs_battles b

  LEFT JOIN {{ src('dim_tournaments') }} t
    ON b.tournament_id = t.tournament_id

  LEFT JOIN {{ src('dim_arenas') }} a
    ON b.arena_id = a.arena_id

  LEFT JOIN {{ src('dim_fighters') }} f1
    ON b.fighter1_id = f1.fighter_id

  LEFT JOIN {{ src('dim_fighters') }} f2
    ON b.fighter2_id = f2.fighter_id

  LEFT JOIN {{ src('dim_fighters') }} w
    ON b.winner_id = w.fighter_id

As you can see, this query uses another Jinja templating notation, the src function which is core to using SAYN efficiently. You pass this function the name of a table, and SAYN will automatically build the dependencies between your tasks! For example, this f_battles task sources the table dim_tournaments (amongst others) with the src function. As a result, SAYN will look for the task that produces this dim_tournaments table (which is the dim_tournaments task) and set this task as a parent of the f_battles task. Therefore, dim_tournaments will always execute before f_battles. From the above code, you can see that many tasks will be set as parents of the f_battles task.

Changing Your Model Materialisation¶

You can easily amend the configuration of a single task when necessary with SAYN. For example, the task f_rankings, generated by f_rankings.sql, uses the following query:

sql/f_rankings.sql

{{ config(
    materialisation='view'
   )
}}

SELECT fr.fighter_name
     , CAST(SUM(fr.is_winner) AS FLOAT) / COUNT(DISTINCT fr.battle_id) AS win_rate
  FROM {{ src('f_fighter_results') }} fr
 GROUP BY 1
 ORDER BY 2 DESC

As you can see, this task uses the config function, which will in this case overwrite the materialisation of the task's output to a view instead of a table. This config function can be really useful when you want to overwrite some attributes of specific tasks within your group.

What Next?¶

You now know how to build data models with SAYN. The next section of this tutorial will now go through how to use SAYN for Python processes. This will enable you to leverage SAYN for end-to-end ELT processes or data science tasks!