Pipeline YAML Files
===================

Two YAML-format configuration files are needed to run a pipeline.

The first describes which steps to run in a pipeline, the overall inputs for it, execution information, and directories for outputs.  It is described on this page.  It includes the path to the second file, (see `Config`_ below); that file is described in more depth on the page :ref:`config2`.

Here is an example, from ``test/test.yml``.  The different pieces are described below.


.. code-block:: yaml

  # There are currently two defined launchers
  # mini and parsl
  launcher:
      name: mini
      interval: 0.5
  # and three sites:
  # local, nersc-batch, and nersc-interactive
  site:
      name: local
      max_threads: 2


  # The list of stages to run and the number of processors
  # to use for each.
  stages:
      - name: WLGCSummaryStatistic
        nprocess: 1
      - name: SysMapMaker
        nprocess: 1
      - name: shearMeasurementPipe
        nprocess: 1
      - name: PZEstimationPipe
        nprocess: 1
      - name: WLGCRandoms
        nprocess: 1
      - name: WLGCSelector
        nprocess: 1
      - name: SourceSummarizer
        nprocess: 1
      - name: WLGCTwoPoint
        nprocess: 1
      - name: WLGCCov
        nprocess: 1

  # Definitions of where to find inputs for the overall pipeline.
  # Any input required by a pipeline stage that is not generated by
  # a previous stage must be defined here.  They are listed by tag.
  inputs:
      DM: ./test/inputs/dm.txt
      fiducial_cosmology: ./test/inputs/fiducial_cosmology.txt

  # Overall configuration file 
  config: ./test/config.yml

  # If all the outputs for a stage already exist then do not re-run that stage
  resume: False

  # Put all the output files in this directory:
  output_dir: ./test/outputs

  # Put the logs from the individual stages in this directory:
  log_dir: ./test/logs

  # These will be run before and after the pipeline respectively
  pre_script: ""
  post_script: ""

Modules
-------

The ``modules`` option, which is a string, consists of the names of python modules to import and search for pipeline stages (with spaces between each).

Each module is imported at the start of the pipeline.  For a stage to be found, it should be imported somewhere in the chain of imports under ``__init__.py`` in one of the packages listed here.  You can specify subpackages, like ``module.submodule`` in this list after ``module`` if you need to.

The ``python_paths`` option can be set to a single string or list of strings, and gives paths to add to python's ``sys.path`` before attempting the import above.

Stages
------

The ``stages`` parameter should be a list of dictionaries.  Each element in the list is one pipeline stage to be run.  You don't have to put the stages in order - ceci will figure that out for you.

Each dictionary represents one stage, and has these options, with the defaults as shown:


.. code-block:: yaml

  - name: NameOfClass       # required
    nprocess: 1             # optional
    threads_per_process: 1  # optional
    nodes: 1                # optional


``threads_per_process`` is the number of threads, and therefore also the number of cores to assign to each process.  OpenMP is the usual threading method used for our jobs, so ``OMP_NUM_THREADS`` is set to this value for the job.

``nodes`` is the number of nodes to assign to the job.  The processes are spread evenly across nodes.

``nprocess`` is the total number of processes, (across all nodes, not per-node).  Process-level parallelism is currently implemented only using MPI, but if you need other approaches please open an issue.


Launcher
--------

The ``launcher`` parameter should be a dictionary that configures the workflow manager used to launch the jobs.

The ``name`` item in the dictionary sets which launcher is used.  These options are currently allowed: ``mini`` or ``parsl``.

See the :ref:`launchers` page for information on these launchers, and the other options they take.


Site
----

The ``site`` parameter should be a dictionary that configures the machine on which you are running the pipeline.

The ``name`` item in the dictionary sets which site is used.  These options are currently allowed: ``local``, ``nersc-batch``, and ``nersc-interactive``.

See the :ref:`sites` page for information on these sites, and the other options they take.


Inputs
------

The ``inputs`` parameter is required, and should be set to a dictionary.  It must describe any files that are overall inputs to the pipeline, and are not generated internally by it.  Files that are made inside the pipeline must not be listed.

The keys are tags, strings from the ``inputs`` attribute on the classes that represent the pipeline stage.  They should map to values which are the paths to find those inputs.

Config
------

The parameter ``config`` is required, and should be set to a path to another input YAML config file.

See the :ref:`config2` page for what that file should contain.

Resume
------

The parameter ``resume`` is required, and should be set to ``True`` or ``False``.

If the parameter is ``True``, then any pipeline stages whose outputs all exist already will be skipped and not run.

In the current implementation, a pipeline stage with missing input will not cause "downstream" stages to be run as well - e.g. if the final stage in your pipeline has all its outputs present it will *not* be re-run, even if earlier stages *are* re-run because their outputs had been removed.

Directories
-----------

The parameter ``output_dir`` is required, and should be set to a directory where all the outputs from the pipeline will be saved.  If the directory does not exist it will be created.

If the resume parameter is set to True, then this is the directory that will be checked for existing outputs.

The parameter ``log_dir`` is required, and should be set to a directory where the printed output of the stages will be saved, in one file per stage.

Scripts
-------

Two parameters can be set to run additional scripts before or after a pipeline stage.  You can use them to perform checks or process results.

Any executable specified by ``pre_script`` will be run before the pipeline.  If it returns a non-zero status then the pipeline will not be run and an exception will be raised.

Any executable specified by ``post_script`` will be run after the pipeline, but only if the pipeline completes successfully.  If the post_script returns a non-zero status then it will be returned as the ceci exit code, but no exception will be raised.

Both scripts are called with the same arguments as the original executable was called with.


Templates
---------

You can use `Jinja2 <https://jinja.palletsprojects.com>`_ to allow you to use *templates* in your pipeline YML files.

This allows you to set variables on the ceci command line which are then used to
modify the pipeline text. This lets you use a single parameter file for a set of
runs. In TXPipe we use this to have a single pipeline file that can process one of several
different fields depending on the parameter choice.

See the `Jinja2 <https://jinja.palletsprojects.com>`_ documentation for information on the
template syntax. Then to set input variables, use the `-t` flag on the ceci command line, in the form:

.. code-block:: bash

  ceci -t variable1=value1 variable2=value2   pipeline.yml


There's not much point using template in interactive code - you should probably
just set up your pipeline files programatically.