python Task¶
About¶
The python task allows you to run python scripts. Therefore, those tasks can do anything Python can
do. They are extremely useful for data extraction or data science models.
There are two models for specifying python tasks in SAYN: a simple way through using decorators and a more advanced way which is class based.
Simple Definition of python Tasks¶
You can define python tasks in SAYN very simply by using decorators. This will let you write a Python function and turn that function into a task. First, you need to add a group in project.yaml pointing to the .py file where the task code lives:
project.yaml
groups:
decorator_tasks:
type: python
module: decorator_tasks
parameters:
param1: some_value
Now all tasks defined in python/decorator_tasks.py will be added to the DAG. The module property expects a python path from the python folder in a similar way as you would import a module in python. For example, if our task definition exists in python/example_mod/decorator_tasks.py the value in module would be example_mod.decorator_tasks.
python/decorator_tasks.py
from sayn import task
@task(outputs='logs.api_table', sources='logs.another_table')
def example_task(context, warehouse, param1):
src_table = context.src('logs.another_table')
out_table = context.out('logs.api_table')
warehouse.execute(f'CREATE OR REPLACE TABLE {out_table} AS SELECT * from {src_table}')
The above example showcases the key elements to a python task:
task: we import SAYN'staskdecorator which is used to turn functions into SAYN tasks added to the DAG.- parameters to
task: we can pass parameterssources,outputsandparentswhich are either lists of table names or a single table name. This allows SAYN define the task dependencies. We can also pass a value foron_failandtags. - function name: the name of the function (
example_taskhere) will be the name of the task. We can use this name with-tto execute this task only for example. - function parameters: arguments to the function have special meaning and so the names need to be respected:
context: is an object granting access to some functionality like project parameters, connections and other functions as seen further down.warehouse: connection names (required_credentialsinproject.yaml) will automatically provide the object of that connection. You can specify any number of connections here.- param1: the rest of the function arguments are matched against task parameters, these are values defined in the
parameterproperty in the group.
Python decorators
Decorators in python are used to modify the behaviour of a function. It can be a bit daunting to understand when we first encounter them but for the purpose of SAYN all you need to know is that @task turns a standard python
function into a SAYN task which can assess useful properties via arguments. There are many resources online describing how decorators work,
for example this.
Given the code above, this task will:
- Depend on (execute after) the tasks that produce
logs.another_tablesince we added thesourcesargument to the decorator. - Be the parent of (execute before) any task reading from
logs.api_tablesince we added theoutputsargument to the decorator. - Get the compiled value of
logs.another_tableandlogs.api_tableand keep in 2 variables. For details on database objects compilation make sure you check the database objects page. - Execute a create table statement using the tables above on the database called
warehousein the project.
Advanced python Task Definition With Classes¶
The second model for defining python tasks is through classes. When using this model we get an opportunity to:
- do validation before the task is executed by overloading the
setupmethod, which is useful as a way to alert early during the execution that something is incorrectly defined rather than waiting for the task to fail. - define more complex dependencies than
sourcesandoutputsby overloading theconfigmethod. - implement code for the
compilestage allowing for more early stage indication of problems.
A python task using classes is defined as follows:
tasks/base.yaml
task_python:
type: python
class: file_name.ClassName
Where class is a python path to the Python class implementing the task. This code should be stored in the python folder of your project, which in itself is a python module that's dynamically loaded, so it needs an empty __init__.py file in the folder. The class then needs to be defined as follows:
python/file_name.py
from sayn import PythonTask
class ClassName(PythonTask):
def config(self):
self.src('logs.source_table')
self.out('logs.output_table')
def setup(self):
# Do some validation of the parameters
return self.success()
def run(self):
# Do something useful
return self.success()
In this example:
- We create a new class inheriting from SAYN's PythonTask.
- We set some dependencies by calling
self.srcandself.out. - We define a setup method to do some sanity checks. This method can be skipped, but it's useful to check the validity of project parameters or so some initial setup.
- We define the actual process to execute during
sayn runwith therunmethod. - Both
setupandrunreturn the task status as successfulreturn self.success(), however we can indicate a task failure to sayn withreturn self.fail(). Failing a python task forces child tasks to be skipped.
Attention
Python tasks can return self.success() or self.fail() to indicate the result of the execution, but it's not mandatory. If the code throws a python exception, the task will be considered as failed.
Using the SAYN API¶
When defining our python task, you would want to access parts of the SAYN infrastructure like
parameters and connections. When using the decorator model, to access this functionality we need to include the context argument in the function, when using the class model the more standard self is used, and both give access to the same functionality. The list of available properties through self and context is:
parameters: accesses project and task parameters. For more details onparameters, see the Parameters section.run_arguments: provides access to the arguments passed to thesayn runcommand like the incremental values (full_load,start_dtandend_dt).connections: dictionary containing the databases and other custom API credentials. API connections appear as simple python dictionaries, while databases are SAYN's Database objects.default_db: provides access to thedefault_dbdatabase object specified in theproject.yamlfile.src: thesrcmacro that translates database object names as described in database objects. Bear in mind that using this function also adds dependencies to the task, but only when called from theconfigmethod of a python task defined with the class model.out: theoutmacro that translates database object names as described in database objects. Bear in mind that using this function also creates dependencies between tasks, but only when called from theconfigmethod of a python task defined with the class model.
Tip
You can use self.default_db to easily perform some operations on the default database such as reading or
loading data or if using decorators simply include an argument with the name of the connection. See the
methods available on the Database API.
Tip
We all love pandas! If you want to load a pandas dataframe you can use one of these options:
- with the
pandas.DataFrame.to_sqlmethod:df.to_sql(self.default_db.engine, 'table'). - with the
self.default_db.load_datamethod:self.default_db.load_data('table', df.to_dict('records')).
Logging For python Tasks With The SAYN API¶
The unit of process within a task in SAYN is the step. Using steps is useful to indicate current
progress of execution but also for debugging purposes. The tutorial is a good example of usage, as we define the load_data task as having 5 steps:
python/load_data.py
context.set_run_steps(
[
"Generate Data",
"Load fighters",
"Load arenas",
"Load tournaments",
"Load battles",
]
)
This code defines which steps form the task. Then we can define the start and end of that step with:
python/load_data.py
with context.step('Generate Data'):
data_to_load = get_data(tournament_battles)
Which will output the following on the screen:
CLI output
[1/7] load_data (started at 15:25): Step [1/5] Generate Data
The default cli presentation will show only the current step being executed, which in the
case of the tutorial project goes very quickly. However we can persist these messages using
the debug flag to the cli sayn run -d giving you this:
CLI ouput
[1/7] load_data (started at 15:29)
Run Steps: Generate Data, Load fighters, Load arenas, Load tournaments, Load battles
ℹ [1/5] [15:29] Executing Generate Data
✔ [1/5] [15:29] Generate Data (19.5ms)
ℹ [2/5] [15:29] Executing Load fighters
✔ [2/5] [15:29] Load fighters (16.9ms)
ℹ [3/5] [15:29] Executing Load arenas
✔ [3/5] [15:29] Load arenas (12.3ms)
ℹ [4/5] [15:29] Executing Load tournaments
✔ [4/5] [15:29] Load tournaments (10.9ms)
ℹ [5/5] [15:29] Executing Load battles
✔ [5/5] [15:29] Load battles (210.3ms)
✔ Took (273ms)
So you can see the time it takes to perform each step.
Sometimes it's useful to output some extra text beyond steps. In those cases, the API provides some methods for a more adhoc logging model:
self.debug(text): debug log to console and file. Not printed unless-dis used.self.info(text): info log to console and file. Not persisted to the screen if-dis not specified.self.warning(text): warning log to console and file. Remains on the screen after the task finishes (look for yellow lines).self.error(text): error log to console and file. Remains on the screen after the task finishes (look for red lines).
Note
self.error doesn't abort the execution of the task, nor it sets the final status to being failed.
To indicate a python task has failed, use this construct: return self.fail(text) where text
is an optional message string that will be showed on the screen.
For more details on the SAYN API, check the API reference page.