Tutorial Part 3: Using Python With SAYN¶
The previous section of this tutorial showed you how to use SAYN for data modelling purposes. We will now show you how to use Python with SAYN. This will therefore enable you to write end-to-end ELT processes and data science tasks with SAYN. Let's dive in!
Adding Your Python Task Group¶
As we did for the autosql tasks, we will need to add a task group for our Python tasks. To do so, add the following to your project.yaml:
project.yaml
...
groups:
models:
...
say_hello:
type: python
module: say_hello
This will do the following:
- Create a task group called
say_hello. - All tasks in this group will be of type
python. - All functions using the
taskdecorator in the filepython/say_hello.pywill be transformed into Python tasks. This file should already exist in yourpythonfolder and defines one task:say_hello. Allpythontasks should be stored in thepythonfolder where an__init__.pyfile must exist.
Writing Your Python Tasks¶
A Simple Python Task¶
Our tutorial project has two python tasks. It starts with a simple Python task that interacts with the task's context. This is the say_hello task in the python/say_hello.py file, defined as follows:
python/say_hello.py
from sayn import task
@task
def say_hello(context):
context.info('Hello!')
Here are the core concepts to know for running Python tasks with SAYN:
- You should import the
taskdecorator fromsaynwhich you can then use to define your tasks. There is a more advanced way to definepythontasks with classes using SAYN which you can find in the documentation related topythontasks. - Use the
@taskdecorator and then simply define your function. The task name will be the name of the function, in this casesay_hello. - We pass the
contextto our function, which can then be used in our code to access task related information and control the logger. In our case, we logHello!as information.
Creating Data Logs¶
The second task in the python/load_data.py module actually does something more interesting. It creates some random logs, which is the data you initially had in the logs table of dev.db. First, let's add this task to project.yaml by adding a task group:
project.yaml
...
groups:
models:
...
logs:
type: python
module: load_data
Let's look at whole the code from the python/load_data.py file:
python/load_data.py
import random
from uuid import uuid4
from sayn import task
@task()
def say_hello(context):
context.info('Hello!')
@task(outputs=[
'logs_arenas',
'logs_tournaments',
'logs_battles',
'logs_fighters']
)
def load_data(context, warehouse):
fighters = ["Son Goku", "John", "Lucy", "Dr. x", "Carlos", "Dr. y?"]
arenas = ["Earth Canyon", "World Edge", "Namek", "Volcanic Crater", "Underwater"]
tournaments = ["World Championships", "Tenka-ichi Budokai", "King of the Mountain"]
context.set_run_steps(
[
"Generate Dimensions",
"Generate Battles",
"Load fighters",
"Load arenas",
"Load tournaments",
"Load battles",
]
)
with context.step("Generate Dimensions"):
# Add ids to the dimensions
fighters = [
{"fighter_id": str(uuid4()), "fighter_name": val}
for id, val in enumerate(fighters)
]
arenas = [
{"arena_id": str(uuid4()), "arena_name": val}
for id, val in enumerate(arenas)
]
tournaments = [
{"tournament_id": str(uuid4()), "tournament_name": val}
for id, val in enumerate(tournaments)
]
with context.step("Generate Battles"):
battles = list()
for tournament in tournaments:
tournament_id = tournament["tournament_id"]
# Randomly select a number of battles to generate for each tournament
n_battles = random.choice([10, 20, 30])
for _ in range(n_battles):
battle_id = str(uuid4())
# Randomly choose fighters and arena
fighter1_id = random.choice(fighters)["fighter_id"]
fighter2_id = random.choice(
[f for f in fighters if f["fighter_id"] != fighter1_id]
)["fighter_id"]
arena_id = random.choice(arenas)["arena_id"]
# Pick a winner
winner_id = (
fighter1_id if random.uniform(0, 1) <= 0.5 else fighter2_id
)
battles.append(
{
"event_id": str(uuid4()),
"tournament_id": tournament_id,
"battle_id": battle_id,
"arena_id": arena_id,
"fighter1_id": fighter1_id,
"fighter2_id": fighter2_id,
"winner_id": winner_id,
}
)
data_to_load = {
"fighters": fighters,
"arenas": arenas,
"tournaments": tournaments,
"battles": battles,
}
# Load logs
for log_type, log_data in data_to_load.items():
with context.step(f"Load {log_type}"):
warehouse.load_data(f"logs_{log_type}", log_data, replace=True)
The second task defined in this module is the load_data one. It uses some further features:
- The
load_datatask producesoutputswhich are defined in the decorator. This will enable you to refer to these outputs with thesrcfunction inautosqltasks and automatically set dependencies to theload_datatask. warehouseis also passed as a parameter to the function. This enables you to easily access thewarehouseconnection in your task. You can notably see that at the end of the script with the call towarehouse.load_data.
Setting Dependencies With load_data¶
As mentioned, we now have a python task which produces some logs which we want our autosql tasks to use for the data modelling process. As a result, we should ensure that the load_data task is always executed first. Because our load_data task produces outputs, we can refer to these with the src function in autosql tasks and automatically create dependencies. For example, the SQL query of the dim_arenas task should be changed from:
sql/dim_arenas.sql
SELECT l.arena_id
, l.arena_name
FROM logs_arenas l
To:
sql/dim_arenas.sql amended
SELECT l.arena_id
, l.arena_name
FROM {{ src('logs_arenas') }} l
This will now mention that the dim_arenas task sources the logs_arenas table which is an output of the load_data task. SAYN will automatically make load_data a parent of the dim_arenas task and therefore always execute it before. You can do the same for all the other logs tables used in the other autosql tasks.
What Next?¶
This is it for our tutorial. You should now have a good understanding of the true power of SAYN! Our documentation has more extensive details about all the SAYN core concepts:
Enjoy SAYN and happy ELT-ing! :)