Tutorial Part 3: Using Python With SAYN¶
The previous section of this tutorial showed you how to use SAYN for data modelling purposes. We will now show you how to use Python with SAYN. This will therefore enable you to write end-to-end ELT processes and data science tasks with SAYN. Let's dive in!
Adding Your Python Task Group¶
As we did for the autosql
tasks, we will need to add a task group for our Python tasks. To do so, add the following to your project.yaml
:
project.yaml
...
groups:
models:
...
say_hello:
type: python
module: say_hello
This will do the following:
- Create a task group called
say_hello
. - All tasks in this group will be of type
python
. - All functions using the
task
decorator in the filepython/say_hello.py
will be transformed into Python tasks. This file should already exist in yourpython
folder and defines one task:say_hello
. Allpython
tasks should be stored in thepython
folder where an__init__.py
file must exist.
Writing Your Python Tasks¶
A Simple Python Task¶
Our tutorial project has two python
tasks. It starts with a simple Python task that interacts with the task's context. This is the say_hello
task in the python/say_hello.py
file, defined as follows:
python/say_hello.py
from sayn import task
@task
def say_hello(context):
context.info('Hello!')
Here are the core concepts to know for running Python tasks with SAYN:
- You should import the
task
decorator fromsayn
which you can then use to define your tasks. There is a more advanced way to definepython
tasks with classes using SAYN which you can find in the documentation related topython
tasks. - Use the
@task
decorator and then simply define your function. The task name will be the name of the function, in this casesay_hello
. - We pass the
context
to our function, which can then be used in our code to access task related information and control the logger. In our case, we logHello!
as information.
Creating Data Logs¶
The second task in the python/load_data.py
module actually does something more interesting. It creates some random logs, which is the data you initially had in the logs table of dev.db
. First, let's add this task to project.yaml
by adding a task group:
project.yaml
...
groups:
models:
...
logs:
type: python
module: load_data
Let's look at whole the code from the python/load_data.py
file:
python/load_data.py
import random
from uuid import uuid4
from sayn import task
@task()
def say_hello(context):
context.info('Hello!')
@task(outputs=[
'logs_arenas',
'logs_tournaments',
'logs_battles',
'logs_fighters']
)
def load_data(context, warehouse):
fighters = ["Son Goku", "John", "Lucy", "Dr. x", "Carlos", "Dr. y?"]
arenas = ["Earth Canyon", "World Edge", "Namek", "Volcanic Crater", "Underwater"]
tournaments = ["World Championships", "Tenka-ichi Budokai", "King of the Mountain"]
context.set_run_steps(
[
"Generate Dimensions",
"Generate Battles",
"Load fighters",
"Load arenas",
"Load tournaments",
"Load battles",
]
)
with context.step("Generate Dimensions"):
# Add ids to the dimensions
fighters = [
{"fighter_id": str(uuid4()), "fighter_name": val}
for id, val in enumerate(fighters)
]
arenas = [
{"arena_id": str(uuid4()), "arena_name": val}
for id, val in enumerate(arenas)
]
tournaments = [
{"tournament_id": str(uuid4()), "tournament_name": val}
for id, val in enumerate(tournaments)
]
with context.step("Generate Battles"):
battles = list()
for tournament in tournaments:
tournament_id = tournament["tournament_id"]
# Randomly select a number of battles to generate for each tournament
n_battles = random.choice([10, 20, 30])
for _ in range(n_battles):
battle_id = str(uuid4())
# Randomly choose fighters and arena
fighter1_id = random.choice(fighters)["fighter_id"]
fighter2_id = random.choice(
[f for f in fighters if f["fighter_id"] != fighter1_id]
)["fighter_id"]
arena_id = random.choice(arenas)["arena_id"]
# Pick a winner
winner_id = (
fighter1_id if random.uniform(0, 1) <= 0.5 else fighter2_id
)
battles.append(
{
"event_id": str(uuid4()),
"tournament_id": tournament_id,
"battle_id": battle_id,
"arena_id": arena_id,
"fighter1_id": fighter1_id,
"fighter2_id": fighter2_id,
"winner_id": winner_id,
}
)
data_to_load = {
"fighters": fighters,
"arenas": arenas,
"tournaments": tournaments,
"battles": battles,
}
# Load logs
for log_type, log_data in data_to_load.items():
with context.step(f"Load {log_type}"):
warehouse.load_data(f"logs_{log_type}", log_data, replace=True)
The second task defined in this module is the load_data
one. It uses some further features:
- The
load_data
task producesoutputs
which are defined in the decorator. This will enable you to refer to these outputs with thesrc
function inautosql
tasks and automatically set dependencies to theload_data
task. warehouse
is also passed as a parameter to the function. This enables you to easily access thewarehouse
connection in your task. You can notably see that at the end of the script with the call towarehouse.load_data
.
Setting Dependencies With load_data
¶
As mentioned, we now have a python
task which produces some logs which we want our autosql
tasks to use for the data modelling process. As a result, we should ensure that the load_data
task is always executed first. Because our load_data
task produces outputs
, we can refer to these with the src
function in autosql
tasks and automatically create dependencies. For example, the SQL query of the dim_arenas
task should be changed from:
sql/dim_arenas.sql
SELECT l.arena_id
, l.arena_name
FROM logs_arenas l
To:
sql/dim_arenas.sql amended
SELECT l.arena_id
, l.arena_name
FROM {{ src('logs_arenas') }} l
This will now mention that the dim_arenas
task sources the logs_arenas
table which is an output
of the load_data
task. SAYN will automatically make load_data
a parent of the dim_arenas
task and therefore always execute it before. You can do the same for all the other logs tables used in the other autosql
tasks.
What Next?¶
This is it for our tutorial. You should now have a good understanding of the true power of SAYN! Our documentation has more extensive details about all the SAYN core concepts:
Enjoy SAYN and happy ELT-ing! :)