Pipeline YAML Files

Two YAML-format configuration files are needed to run a pipeline.

The first describes which steps to run in a pipeline, the overall inputs for it, execution information, and directories for outputs. It is described on this page. It includes the path to the second file, (see Config below); that file is described in more depth on the page Config YAML files.

Here is an example, from test/test.yml. The different pieces are described below.

# There are currently three defined launchers
# mini, parsl, and cwl
launcher:
    name: mini
    interval: 0.5
# and three sites:
# local, cori, and cori-interactive
site:
    name: local
    max_threads: 2


# The list of stages to run and the number of processors
# to use for each.
stages:
    - name: WLGCSummaryStatistic
      nprocess: 1
    - name: SysMapMaker
      nprocess: 1
    - name: shearMeasurementPipe
      nprocess: 1
    - name: PZEstimationPipe
      nprocess: 1
    - name: WLGCRandoms
      nprocess: 1
    - name: WLGCSelector
      nprocess: 1
    - name: SourceSummarizer
      nprocess: 1
    - name: WLGCTwoPoint
      nprocess: 1
    - name: WLGCCov
      nprocess: 1

# Definitions of where to find inputs for the overall pipeline.
# Any input required by a pipeline stage that is not generated by
# a previous stage must be defined here.  They are listed by tag.
inputs:
    DM: ./test/inputs/dm.txt
    fiducial_cosmology: ./test/inputs/fiducial_cosmology.txt

# Overall configuration file
config: ./test/config.yml

# If all the outputs for a stage already exist then do not re-run that stage
resume: False

# Put all the output files in this directory:
output_dir: ./test/outputs

# Put the logs from the individual stages in this directory:
log_dir: ./test/logs

# These will be run before and after the pipeline respectively
pre_script: ""
post_script: ""

Modules

The modules option, which is a string, consists of the names of python modules to import and search for pipeline stages (with spaces between each).

Each module is imported at the start of the pipeline. For a stage to be found, it should be imported somewhere in the chain of imports under __init__.py in one of the packages listed here. You can specify subpackages, like module.submodule in this list after module if you need to.

The python_paths option can be set to a single string or list of strings, and gives paths to add to python’s sys.path before attempting the import above.

Stages

The stages parameter should be a list of dictionaries. Each element in the list is one pipeline stage to be run. You don’t have to put the stages in order - ceci will figure that out for you.

Each dictionary represents one stage, and has these options, with the defaults as shown:

- name: NameOfClass       # required
  nprocess: 1             # optional
  threads_per_process: 1  # optional
  nodes: 1                # optional

threads_per_process is the number of threads, and therefore also the number of cores to assign to each process. OpenMP is the usual threading method used for our jobs, so OMP_NUM_THREADS is set to this value for the job.

nodes is the number of nodes to assign to the job. The processes are spread evenly across nodes.

nprocess is the total number of processes, (across all nodes, not per-node). Process-level parallelism is currently implemented only using MPI, but if you need other approaches please open an issue.

Launcher

The launcher parameter should be a dictionary that configures the workflow manager used to launch the jobs.

The name item in the dictionary sets which launcher is used. These options are currently allowed: mini, parsl, and cwl.

See the Launchers page for information on these launchers, and the other options they take.

Site

The site parameter should be a dictionary that configures the machine on which you are running the pipeline.

The name item in the dictionary sets which site is used. These options are currently allowed: local, cori-batch, and cori-interactive.

See the Sites page for information on these sites, and the other options they take.

Inputs

The inputs parameter is required, and should be set to a dictionary. It must describe any files that are overall inputs to the pipeline, and are not generated internally by it. Files that are made inside the pipeline must not be listed.

The keys are tags, strings from the inputs attribute on the classes that represent the pipeline stage. They should map to values which are the paths to find those inputs.

Config

The parameter config is required, and should be set to a path to another input YAML config file.

See the Config YAML files page for what that file should contain.

Resume

The parameter resume is required, and should be set to True or False.

If the parameter is True, then any pipeline stages whose outputs all exist already will be skipped and not run.

In the current implementation, a pipeline stage with missing input will not cause “downstream” stages to be run as well - e.g. if the final stage in your pipeline has all its outputs present it will not be re-run, even if earlier stages are re-run because their outputs had been removed.

Directories

The parameter output_dir is required, and should be set to a directory where all the outputs from the pipeline will be saved. If the directory does not exist it will be created.

If the resume parameter is set to True, then this is the directory that will be checked for existing outputs.

The parameter log_dir is required, and should be set to a directory where the printed output of the stages will be saved, in one file per stage.

Scripts

Two parameters can be set to run additional scripts before or after a pipeline stage. You can use them to perform checks or process results.

Any executable specified by pre_script will be run before the pipeline. If it returns a non-zero status then the pipeline will not be run and an exception will be raised.

Any executable specified by post_script will be run after the pipeline, but only if the pipeline completes successfully. If the post_script returns a non-zero status then it will be returned as the ceci exit code, but no exception will be raised.

Both scripts are called with the same arguments as the original executable was called with.