Pipeline YAML Files¶
Two YAML-format configuration files are needed to run a pipeline.
The first describes which steps to run in a pipeline, the overall inputs for it, execution information, and directories for outputs. It is described on this page. It includes the path to the second file, (see Config below); that file is described in more depth on the page Config YAML files.
Here is an example, from test/test.yml
. The different pieces are described below.
# There are currently three defined launchers
# mini, parsl, and cwl
launcher:
name: mini
interval: 0.5
# and three sites:
# local, cori, and cori-interactive
site:
name: local
max_threads: 2
# The list of stages to run and the number of processors
# to use for each.
stages:
- name: WLGCSummaryStatistic
nprocess: 1
- name: SysMapMaker
nprocess: 1
- name: shearMeasurementPipe
nprocess: 1
- name: PZEstimationPipe
nprocess: 1
- name: WLGCRandoms
nprocess: 1
- name: WLGCSelector
nprocess: 1
- name: SourceSummarizer
nprocess: 1
- name: WLGCTwoPoint
nprocess: 1
- name: WLGCCov
nprocess: 1
# Definitions of where to find inputs for the overall pipeline.
# Any input required by a pipeline stage that is not generated by
# a previous stage must be defined here. They are listed by tag.
inputs:
DM: ./test/inputs/dm.txt
fiducial_cosmology: ./test/inputs/fiducial_cosmology.txt
# Overall configuration file
config: ./test/config.yml
# If all the outputs for a stage already exist then do not re-run that stage
resume: False
# Put all the output files in this directory:
output_dir: ./test/outputs
# Put the logs from the individual stages in this directory:
log_dir: ./test/logs
# These will be run before and after the pipeline respectively
pre_script: ""
post_script: ""
Modules¶
The modules
option, which is a string, consists of the names of python modules to import and search for pipeline stages (with spaces between each).
Each module is imported at the start of the pipeline. For a stage to be found, it should be imported somewhere in the chain of imports under __init__.py
in one of the packages listed here. You can specify subpackages, like module.submodule
in this list after module
if you need to.
The python_paths
option can be set to a single string or list of strings, and gives paths to add to python’s sys.path
before attempting the import above.
Stages¶
The stages
parameter should be a list of dictionaries. Each element in the list is one pipeline stage to be run. You don’t have to put the stages in order - ceci will figure that out for you.
Each dictionary represents one stage, and has these options, with the defaults as shown:
- name: NameOfClass # required
nprocess: 1 # optional
threads_per_process: 1 # optional
nodes: 1 # optional
threads_per_process
is the number of threads, and therefore also the number of cores to assign to each process. OpenMP is the usual threading method used for our jobs, so OMP_NUM_THREADS
is set to this value for the job.
nodes
is the number of nodes to assign to the job. The processes are spread evenly across nodes.
nprocess
is the total number of processes, (across all nodes, not per-node). Process-level parallelism is currently implemented only using MPI, but if you need other approaches please open an issue.
Launcher¶
The launcher
parameter should be a dictionary that configures the workflow manager used to launch the jobs.
The name
item in the dictionary sets which launcher is used. These options are currently allowed: mini
, parsl
, and cwl
.
See the Launchers page for information on these launchers, and the other options they take.
Site¶
The site
parameter should be a dictionary that configures the machine on which you are running the pipeline.
The name
item in the dictionary sets which site is used. These options are currently allowed: local
, cori-batch
, and cori-interactive
.
See the Sites page for information on these sites, and the other options they take.
Inputs¶
The inputs
parameter is required, and should be set to a dictionary. It must describe any files that are overall inputs to the pipeline, and are not generated internally by it. Files that are made inside the pipeline must not be listed.
The keys are tags, strings from the inputs
attribute on the classes that represent the pipeline stage. They should map to values which are the paths to find those inputs.
Config¶
The parameter config
is required, and should be set to a path to another input YAML config file.
See the Config YAML files page for what that file should contain.
Resume¶
The parameter resume
is required, and should be set to True
or False
.
If the parameter is True
, then any pipeline stages whose outputs all exist already will be skipped and not run.
In the current implementation, a pipeline stage with missing input will not cause “downstream” stages to be run as well - e.g. if the final stage in your pipeline has all its outputs present it will not be re-run, even if earlier stages are re-run because their outputs had been removed.
Directories¶
The parameter output_dir
is required, and should be set to a directory where all the outputs from the pipeline will be saved. If the directory does not exist it will be created.
If the resume parameter is set to True, then this is the directory that will be checked for existing outputs.
The parameter log_dir
is required, and should be set to a directory where the printed output of the stages will be saved, in one file per stage.
Scripts¶
Two parameters can be set to run additional scripts before or after a pipeline stage. You can use them to perform checks or process results.
Any executable specified by pre_script
will be run before the pipeline. If it returns a non-zero status then the pipeline will not be run and an exception will be raised.
Any executable specified by post_script
will be run after the pipeline, but only if the pipeline completes successfully. If the post_script returns a non-zero status then it will be returned as the ceci exit code, but no exception will be raised.
Both scripts are called with the same arguments as the original executable was called with.