python
Task¶
About¶
The python
task allows you to run python scripts. Therefore, those tasks can do anything Python can
do. They are extremely useful for data extraction or data science models.
There are two models for specifying python tasks in SAYN: a simple way through using decorators and a more advanced way which is class based.
Simple Definition of python
Tasks¶
You can define python
tasks in SAYN very simply by using decorators. This will let you write a Python function and turn that function into a task. First, you need to add a group in project.yaml
pointing to the .py
file where the task code lives:
project.yaml
groups:
decorator_tasks:
type: python
module: decorator_tasks
parameters:
param1: some_value
Now all tasks defined in python/decorator_tasks.py
will be added to the DAG. The module
property expects a python path from the python
folder in a similar way as you would import a module in python. For example, if our task definition exists in python/example_mod/decorator_tasks.py
the value in module
would be example_mod.decorator_tasks
.
python/decorator_tasks.py
from sayn import task
@task(outputs='logs.api_table', sources='logs.another_table')
def example_task(context, warehouse, param1):
src_table = context.src('logs.another_table')
out_table = context.out('logs.api_table')
warehouse.execute(f'CREATE OR REPLACE TABLE {out_table} AS SELECT * from {src_table}')
The above example showcases the key elements to a python task:
task
: we import SAYN'stask
decorator which is used to turn functions into SAYN tasks added to the DAG.- parameters to
task
: we can pass parameterssources
,outputs
andparents
which are either lists of table names or a single table name. This allows SAYN define the task dependencies. We can also pass a value foron_fail
andtags
. - function name: the name of the function (
example_task
here) will be the name of the task. We can use this name with-t
to execute this task only for example. - function parameters: arguments to the function have special meaning and so the names need to be respected:
context
: is an object granting access to some functionality like project parameters, connections and other functions as seen further down.warehouse
: connection names (required_credentials
inproject.yaml
) will automatically provide the object of that connection. You can specify any number of connections here.- param1: the rest of the function arguments are matched against task parameters, these are values defined in the
parameter
property in the group.
Python decorators
Decorators in python are used to modify the behaviour of a function. It can be a bit daunting to understand when we first encounter them but for the purpose of SAYN all you need to know is that @task
turns a standard python
function into a SAYN task which can assess useful properties via arguments. There are many resources online describing how decorators work,
for example this.
Given the code above, this task will:
- Depend on (execute after) the tasks that produce
logs.another_table
since we added thesources
argument to the decorator. - Be the parent of (execute before) any task reading from
logs.api_table
since we added theoutputs
argument to the decorator. - Get the compiled value of
logs.another_table
andlogs.api_table
and keep in 2 variables. For details on database objects compilation make sure you check the database objects page. - Execute a create table statement using the tables above on the database called
warehouse
in the project.
Advanced python
Task Definition With Classes¶
The second model for defining python tasks is through classes. When using this model we get an opportunity to:
- do validation before the task is executed by overloading the
setup
method, which is useful as a way to alert early during the execution that something is incorrectly defined rather than waiting for the task to fail. - define more complex dependencies than
sources
andoutputs
by overloading theconfig
method. - implement code for the
compile
stage allowing for more early stage indication of problems.
A python
task using classes is defined as follows:
tasks/base.yaml
task_python:
type: python
class: file_name.ClassName
Where class
is a python path to the Python class implementing the task. This code should be stored in the python
folder of your project, which in itself is a python module that's dynamically loaded, so it needs an empty __init__.py
file in the folder. The class then needs to be defined as follows:
python/file_name.py
from sayn import PythonTask
class ClassName(PythonTask):
def config(self):
self.src('logs.source_table')
self.out('logs.output_table')
def setup(self):
# Do some validation of the parameters
return self.success()
def run(self):
# Do something useful
return self.success()
In this example:
- We create a new class inheriting from SAYN's PythonTask.
- We set some dependencies by calling
self.src
andself.out
. - We define a setup method to do some sanity checks. This method can be skipped, but it's useful to check the validity of project parameters or so some initial setup.
- We define the actual process to execute during
sayn run
with therun
method. - Both
setup
andrun
return the task status as successfulreturn self.success()
, however we can indicate a task failure to sayn withreturn self.fail()
. Failing a python task forces child tasks to be skipped.
Attention
Python tasks can return self.success()
or self.fail()
to indicate the result of the execution, but it's not mandatory. If the code throws a python exception, the task will be considered as failed.
Using the SAYN API¶
When defining our python
task, you would want to access parts of the SAYN infrastructure like
parameters and connections. When using the decorator model, to access this functionality we need to include the context
argument in the function, when using the class model the more standard self
is used, and both give access to the same functionality. The list of available properties through self
and context
is:
parameters
: accesses project and task parameters. For more details onparameters
, see the Parameters section.run_arguments
: provides access to the arguments passed to thesayn run
command like the incremental values (full_load
,start_dt
andend_dt
).connections
: dictionary containing the databases and other custom API credentials. API connections appear as simple python dictionaries, while databases are SAYN's Database objects.default_db
: provides access to thedefault_db
database object specified in theproject.yaml
file.src
: thesrc
macro that translates database object names as described in database objects. Bear in mind that using this function also adds dependencies to the task, but only when called from theconfig
method of a python task defined with the class model.out
: theout
macro that translates database object names as described in database objects. Bear in mind that using this function also creates dependencies between tasks, but only when called from theconfig
method of a python task defined with the class model.
Tip
You can use self.default_db
to easily perform some operations on the default database such as reading or
loading data or if using decorators simply include an argument with the name of the connection. See the
methods available on the Database API.
Tip
We all love pandas
! If you want to load a pandas dataframe you can use one of these options:
- with the
pandas.DataFrame.to_sql
method:df.to_sql(self.default_db.engine, 'table')
. - with the
self.default_db.load_data
method:self.default_db.load_data('table', df.to_dict('records'))
.
Logging For python
Tasks With The SAYN API¶
The unit of process within a task in SAYN is the step
. Using steps is useful to indicate current
progress of execution but also for debugging purposes. The tutorial is a good example of usage, as we define the load_data
task as having 5 steps:
python/load_data.py
context.set_run_steps(
[
"Generate Data",
"Load fighters",
"Load arenas",
"Load tournaments",
"Load battles",
]
)
This code defines which steps form the task. Then we can define the start and end of that step with:
python/load_data.py
with context.step('Generate Data'):
data_to_load = get_data(tournament_battles)
Which will output the following on the screen:
CLI output
[1/7] load_data (started at 15:25): Step [1/5] Generate Data
The default cli presentation will show only the current step being executed, which in the
case of the tutorial project goes very quickly. However we can persist these messages using
the debug flag to the cli sayn run -d
giving you this:
CLI ouput
[1/7] load_data (started at 15:29)
Run Steps: Generate Data, Load fighters, Load arenas, Load tournaments, Load battles
ℹ [1/5] [15:29] Executing Generate Data
✔ [1/5] [15:29] Generate Data (19.5ms)
ℹ [2/5] [15:29] Executing Load fighters
✔ [2/5] [15:29] Load fighters (16.9ms)
ℹ [3/5] [15:29] Executing Load arenas
✔ [3/5] [15:29] Load arenas (12.3ms)
ℹ [4/5] [15:29] Executing Load tournaments
✔ [4/5] [15:29] Load tournaments (10.9ms)
ℹ [5/5] [15:29] Executing Load battles
✔ [5/5] [15:29] Load battles (210.3ms)
✔ Took (273ms)
So you can see the time it takes to perform each step.
Sometimes it's useful to output some extra text beyond steps. In those cases, the API provides some methods for a more adhoc logging model:
self.debug(text)
: debug log to console and file. Not printed unless-d
is used.self.info(text)
: info log to console and file. Not persisted to the screen if-d
is not specified.self.warning(text)
: warning log to console and file. Remains on the screen after the task finishes (look for yellow lines).self.error(text)
: error log to console and file. Remains on the screen after the task finishes (look for red lines).
Note
self.error
doesn't abort the execution of the task, nor it sets the final status to being failed.
To indicate a python task has failed, use this construct: return self.fail(text)
where text
is an optional message string that will be showed on the screen.
For more details on the SAYN API, check the API reference page.