Welcome to daops’s documentation!

Quick Guide

daops - data-aware operations

Pypi Travis Documentation

The daops library (pronounced “day-ops”) provides a python interface to a set of operations suitable for working with climate simulation outputs. It is typically used with ESGF data sets that are described in NetCDF files. daops is unique in that it accesses a store of fixes defined for data sets that are irregular when compared with others in their population.

When a daops operation, such as subset, is requested, the library will look up a database of known fixes before performing and calculations or transformations. The data will be loaded and fixed using the xarray library before the any actual operations are sent to its sister library clisops.

Features

The package has the following features:

  • Ability to run data-reduction operations on large climate data sets.

  • Knowledge of irregularities/anomalies in some climate data sets.

  • Ability to apply fixes to those data sets before operating on them. This process is called normalisation of the data sets.

Credits

This package was created with Cookiecutter and the cedadev/cookiecutter-pypackage project template.

Python Black

Installation

Stable release

Warning

daops requires libspatialindex-dev and libudunits2-dev to be installed prior to installation by pip.

To install daops, run this command in your terminal:

$ pip install daops

This is the preferred method to install daops, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

From sources

The sources for daops can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/roocs/daops

Get the submodules with test data:

$ git submodule update --init

Create Conda environment named daops:

$ conda env create -f environment.yml
$ source activate daops

Install daops in development mode:

$ pip install -r requirements.txt
$ pip install -r requirements_dev.txt
$ python setup.py develop

Run tests:

$ pytest -v tests/

Usage

To use daops in a project:

import daops

API

Subset operation

daops.ops.subset.subset(collection, time=None, area=None, level=None, output_dir=None, output_type='netcdf', split_method='time:auto', file_namer='standard')[source]

Subset input dataset according to parameters. Can be subsetted by level, area and time.

Parameters
  • collection (Collection of datasets to process, sequence or string of comma separated dataset identifiers.)

  • time (Time period - Time range to subset over, sequence of two time values or string of two / separated time values)

  • area (Area to subset over, sequence or string of comma separated lat and lon bounds. Must contain 4 values.)

  • level (Level range - Level values to subset over, sequence of two level values or string of two / separated level values)

  • output_dir (str or path like object describing output directory for output files.)

  • output_type ({“netcdf”, “nc”, “zarr”, “xarray”})

  • split_method ({“time:auto”})

  • file_namer ({“standard”, “simple”})

Returns

List of outputs in the selected type (a list of xarray Datasets or file paths.)

Examples

collection: (“cmip6.ukesm1.r1.gn.tasmax.v20200101”,)
time: (“1999-01-01T00:00:00”, “2100-12-30T00:00:00”)
area: (-5.,49.,10.,65)
level: (1000.,)
output_type: “netcdf”
output_dir: “/cache/wps/procs/req0111”
split_method: “time:decade”
file_namer: “facet_namer”

Utilities

daops.utils.consolidate.consolidate(collection, **kwargs)[source]

Finds the file paths relating to each input dataset. If a time range has been supplied then only the files relating to this time range are recorded.

Parameters
  • collection – (roocs_utils.CollectionParameter) The collection of datasets to process.

  • kwargs – Arguments of the operation taking place e.g. subset, average, or re-grid.

Returns

An ordered dictionary of each dataset from the collection argument and the file paths relating to it.

daops.utils.consolidate.convert_to_ds_id(dset)[source]

Converts the input dataset to a drs id form to use with the elasticsearch index.

Parameters

dset – Dataset to process. Formats currently accepted are file paths and paths to directories.

Returns

The ds id for the input dataset.

daops.utils.core.is_characterised(collection, require_all=False)[source]

Takes in a collection (an individual data reference or a sequence of them). Returns an ordered dictionary of a collection of ids with a boolean value for each stating whether the dataset has been characterised.

If require_all is True: return a single Boolean value.

Parameters
  • collection – one or more data references

  • require_all – Boolean to require that all must be characterised

Returns

Ordered Dictionary OR Boolean (if require_all is True)

daops.utils.core.is_dataref_characterised(dset)[source]
daops.utils.core.open_dataset(ds_id, file_paths)[source]

Opens an xarray Dataset and applies fixes if required. Fixes are applied to the data either before or after the dataset is opened. Whether a fix is a ‘pre-processor’ or ‘post-processor’ is defined in the fix itself.

Parameters
  • ds_id – Dataset identifier in the form of a drs id e.g. cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga

  • file_paths – (list) The file paths corresponding to the ds id.

Returns

xarray Dataset with fixes applied to the data.

class daops.utils.fixer.Fixer(ds_id)[source]

Bases: object

Fixer class to look up fixes to apply to input dataset from the elastic search index. Gathers fixes into pre and post processors. Pre-process fixes are chained together to allow them to be executed with one call.

class daops.utils.fixer.FuncChainer(funcs)[source]

Bases: object

Chains functions together to allow them to be executed in one call.

class daops.utils.normalise.ResultSet(inputs=None)[source]

Bases: object

A class to hold the results from an operation e.g. subset

add(dset, result)[source]

Adds outputs to an ordered dictionary with the ds id as the key. If the output is a file path this is also added to the file_paths variable so a list of file paths can be accessed independently.

daops.utils.normalise.normalise(collection)[source]

Takes file paths and opens and fixes the dataset they make up.

Parameters

collection – Ordered dictionary of ds ids and their related file paths.

Returns

An ordered dictionary of ds ids and their fixed xarray Dataset.

Data Utilities

daops.data_utils.coord_utils.add_scalar_coord(ds, **operands)[source]
Parameters
  • ds – Xarray Dataset

  • operands – (dict) Arguments for fix. Id, value and data type of scalar coordinate to add.

Returns

Xarray Dataset

daops.data_utils.coord_utils.squeeze_dims(ds, **operands)[source]
Parameters
  • ds – Xarray Dataset

  • operands – (dict) Arguments for fix. Dims (list) to remove.

Returns

Xarray Dataset

Processor

daops.processor.dispatch(operation, dset, **kwargs)[source]
daops.processor.process(operation, dset, mode='serial', **kwargs)[source]

Runs the processing operation on the dataset in the correct mode (in series or parallel).

Examples

[1]:
from daops.ops.subset import subset

# remove previosuly created example file
import os
if os.path.exists("./output_001.nc"):
    os.remove("./output_001.nc")

Subset

Daops has a subsetting operation that calls clisops.ops.subset.subset from the clisops library.

Before making the call to the subset operation, daops will look up a database of known fixes. If there are any fixes for the requested dataset then the data will be loaded and fixed using the xarray library and the subsetting operation is then carried out by clisops.

Results of subset and applying a fix

The results of the subsetting operation in daops are returned as an ordered dictionary of the input dataset id and the output in the chosen format (xarray dataset, netcdf file paths, zarr file paths)

The example below requires a fix so the elasticsearch index has been consulted.

It also demostrates the results of the operation

[2]:
# An example of subsetting a dataset that requires a fix - the elasticsearch index is consulted.

ds = "badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/*.nc"
result = subset(
        ds,
        time=("1955-01-01T00:00:00", "2013-12-30T00:00:00"),
        output_dir=None,
        output_type="xarray",
    )

result._results
2020-11-19 11:59:47,774 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Testing 1 files in time range: ...
2020-11-19 11:59:47,804 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - File 0: badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/zostoga_Omon_inmcm4_rcp45_r1i1p1_200601-210012.nc
2020-11-19 11:59:48,201 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Kept 1 files
2020-11-19 11:59:48,205 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/normalise.py - INFO - Working on datasets: OrderedDict([('cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga', ['badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/zostoga_Omon_inmcm4_rcp45_r1i1p1_200601-210012.nc'])])
2020-11-19 11:59:48,684 - elasticsearch - INFO - GET https://elasticsearch.ceda.ac.uk:443/roocs-fix/_doc/f34d45e4f7f5e187f64021b685adc447 [status:200 request:0.475s]
2020-11-19 11:59:48,708 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/core.py - INFO - Running post-processing function: squeeze_dims
2020-11-19 11:59:48,715 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/processor.py - INFO - Running subset [serial]: on Dataset with args: {'time': Time period to subset over
 start time: 1955-01-01T00:00:00
 end time: 2013-12-30T00:00:00, 'area': Area to subset over:
 None, 'level': Level range to subset over
 first_level: None
 last_level: None, 'output_type': 'xarray', 'output_dir': None, 'split_method': 'time:auto', 'file_namer': 'standard'}
2020-11-19 11:59:48,742 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py - INFO - Processing subset for times: ('2006-01-16', '2013-12-16')
2020-11-19 11:59:48,745 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/utils/output_utils.py - INFO - fmt_method=None, output_type=xarray
2020-11-19 11:59:48,748 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/utils/output_utils.py - INFO - Returning output as <class 'xarray.core.dataset.Dataset'>
/home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py:34: UserWarning: "start_date" not found within input date time range. Defaulting to minimum time step in xarray object.
  result = subset_time(ds, **kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py:34: UserWarning: "end_date" has been nudged to nearest valid time step in xarray object.
  result = subset_time(ds, **kwargs)
[2]:
OrderedDict([('cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga',
              [<xarray.Dataset>
               Dimensions:    (bnds: 2, time: 96)
               Coordinates:
                   lev        float64 0.0
                 * time       (time) object 2006-01-16 12:00:00 ... 2013-12-16 12:00:00
               Dimensions without coordinates: bnds
               Data variables:
                   lev_bnds   (bnds) float64 dask.array<chunksize=(2,), meta=np.ndarray>
                   time_bnds  (time, bnds) object dask.array<chunksize=(96, 2), meta=np.ndarray>
                   zostoga    (time) float32 dask.array<chunksize=(96,), meta=np.ndarray>
               Attributes:
                   institution:            INM (Institute for Numerical Mathematics,  Moscow...
                   institute_id:           INM
                   experiment_id:          rcp45
                   source:                 inmcm4 (2009)
                   model_id:               inmcm4
                   forcing:                N/A
                   parent_experiment_id:   historical
                   branch_time:            56940.0
                   contact:                Evgeny Volodin, volodin@inm.ras.ru,INM RAS, Gubki...
                   history:                Mon Mar  9 11:49:38 2020: ncks -d lev,,,8 -v zost...
                   comment:                no comments
                   references:             Volodin, Diansky, Gusev 2010. Climate model INMCM...
                   initialization_method:  1
                   physics_version:        1
                   tracking_id:            e16ae391-db18-4e82-b2b8-46ff24aeec77
                   product:                output
                   experiment:             RCP4.5
                   frequency:              mon
                   creation_date:          2010-11-19T08:18:56Z
                   Conventions:            CF-1.4
                   project_id:             CMIP5
                   table_id:               Table Omon (12 May 2010) f2afe576fb73a3a11aaa3cc8...
                   title:                  inmcm4 model output prepared for CMIP5 RCP4.5
                   parent_experiment:      Historical
                   modeling_realm:         ocean
                   realization:            1
                   cmor_version:           2.0.0
                   NCO:                    4.7.3])])

File paths of output

If output as file paths, it is also possible to access just the output file paths from the results object. This is demonstrated below.

[3]:
# An example of subsetting a dataset that requires a fix - the elasticsearch index is consulted.

ds = "badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/*.nc"
result = subset(
        ds,
        time=("1955-01-01T00:00:00", "2013-12-30T00:00:00"),
        output_dir=".",
        output_type="netcdf",
        file_namer="simple"
    )

print("ouptut file paths = ", result.file_paths)
2020-11-19 11:59:48,785 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Testing 1 files in time range: ...
2020-11-19 11:59:48,928 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - File 0: badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/zostoga_Omon_inmcm4_rcp45_r1i1p1_200601-210012.nc
2020-11-19 11:59:49,306 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Kept 1 files
2020-11-19 11:59:49,310 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/normalise.py - INFO - Working on datasets: OrderedDict([('cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga', ['badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/zostoga_Omon_inmcm4_rcp45_r1i1p1_200601-210012.nc'])])
2020-11-19 11:59:49,766 - elasticsearch - INFO - GET https://elasticsearch.ceda.ac.uk:443/roocs-fix/_doc/f34d45e4f7f5e187f64021b685adc447 [status:200 request:0.452s]
2020-11-19 11:59:49,790 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/core.py - INFO - Running post-processing function: squeeze_dims
2020-11-19 11:59:49,798 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/processor.py - INFO - Running subset [serial]: on Dataset with args: {'time': Time period to subset over
 start time: 1955-01-01T00:00:00
 end time: 2013-12-30T00:00:00, 'area': Area to subset over:
 None, 'level': Level range to subset over
 first_level: None
 last_level: None, 'output_type': 'netcdf', 'output_dir': '.', 'split_method': 'time:auto', 'file_namer': 'simple'}
2020-11-19 11:59:49,829 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py - INFO - Processing subset for times: ('2006-01-16', '2013-12-16')
2020-11-19 11:59:49,832 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/utils/output_utils.py - INFO - fmt_method=to_netcdf, output_type=netcdf
2020-11-19 11:59:49,879 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/utils/output_utils.py - INFO - Wrote output file: ./output_001.nc
ouptut file paths =  ['./output_001.nc']
/home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py:34: UserWarning: "start_date" not found within input date time range. Defaulting to minimum time step in xarray object.
  result = subset_time(ds, **kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py:34: UserWarning: "end_date" has been nudged to nearest valid time step in xarray object.
  result = subset_time(ds, **kwargs)

Checks implemented by daops

Daops will check that files exist in the requested time range

[4]:
ds = "/badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/*.nc"

try:
    result = subset(
            ds,
            time=("1955-01-01T00:00:00", "1990-12-30T00:00:00"),
            output_dir=None,
            output_type="xarray",
        )

except Exception as exc:
    print(exc)
2020-11-19 11:59:49,903 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Testing 0 files in time range: ...
no files to open

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://github.com/roocs/daops/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.

  • Any details about your local setup that might be helpful in troubleshooting.

  • Detailed steps to reproduce the bug.

Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.

Write Documentation

daops could always use more documentation, whether as part of the official daops docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://github.com/roocs/daops/issues.

If you are proposing a feature:

  • Explain in detail how it would work.

  • Keep the scope as narrow as possible, to make it easier to implement.

  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up daops for local development.

  1. Fork the daops repo on GitHub.

  2. Clone your fork locally:

    $ git clone git@github.com:your-name/daops.git
    
  3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

    # For virtualenv environments:
    $ mkvirtualenv daops
    
    # For Anaconda/Miniconda environments:
    $ conda create -n daops python=3.6
    
    $ cd daops/
    $ pip install -e .
    
  4. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally!

  5. When you’re done making changes, check that you verify your changes with black and run the tests, including testing other Python versions with tox:

    # For virtualenv environments:
    $ pip install black pytest tox
    
    # For Anaconda/Miniconda environments:
    $ conda install -c conda-forge black pytest tox
    
    $ black daops tests
    $ python setup.py test
    $ tox
    
  6. Before committing your changes, we ask that you install pre-commit in your virtualenv. Pre-commit runs git hooks that ensure that your code resembles that of the project and catches and corrects any small errors or inconsistencies when you git commit:

    # For virtualenv environments:
    $ pip install pre-commit
    
    # For Anaconda/Miniconda environments:
    $ conda install -c conda-forge pre_commit
    
    $ pre-commit install
    
  7. Commit your changes and push your branch to GitHub:

    $ git add *
    
    $ git commit -m "Your detailed description of your changes."
    # `pre-commit` will run checks at this point:
    # if no errors are found, changes will be committed.
    # if errors are found, modifications will be mades. Simply `git commit` again.
    
    $ git push origin name-of-your-bugfix-or-feature
    
  8. Submit a pull request through the GitHub website.

Pull Request Guidelines

Before you submit a pull request, please follow these guidelines:

  1. Open an issue on our GitHub repository with your issue that you’d like to fix or feature that you’d like to implement.

  2. Perform the changes, commit and push them either to new a branch within roocs/daops or to your personal fork of daops.

Warning

Try to keep your contributions within the scope of the issue that you are addressing. While it might be tempting to fix other aspects of the library as it comes up, it’s better to simply to flag the problems in case others are already working on it.

Consider adding a “# TODO:” comment if the need arises.

  1. Pull requests should raise test coverage for the daops library. Code coverage is an indicator of how extensively tested the library is. If you are adding a new set of functions, they must be tested and coverage percentage should not significantly decrease.

  2. If the pull request adds functionality, your functions should include docstring explanations. So long as the docstrings are syntactically correct, sphinx-autodoc will be able to automatically parse the information. Please ensure that the docstrings adhere to one of the following standards:

  3. The pull request should work for Python 3.6, 3.7 and 3.8 as well as raise test coverage. Pull requests are also checked for documentation build status and for PEP8 compliance.

    The build statuses and build errors for pull requests can be found at:

    https://travis-ci.org/roocs/daops/pull_requests

Warning

PEP8 and Black is strongly enforced. Ensure that your changes pass Flake8 and Black tests prior to pushing your final commits to your branch. Code formatting errors are treated as build errors and will block your pull request from being accepted.

Credits

Development Lead

Co-Developers

Contributors

None yet. Why not be the first?

Version History

v0.3.0 (2020-11-19)

Updating doc strings and documentation.

Breaking Changes

  • clisops>=0.4.0 and roocs-utils>=0.1.4 used.

  • data_refs parameter of daops.ops.subset.subset renamed to collection.

  • space parameter of daops.ops.subset.subset renamed to area.

  • chunk_rules parameter of daops.ops.subset.subset renamed to split_method.

  • filenamer parameter of daops.ops.subset.subset renamed to file_namer.

  • output_type parameter option added to daops.ops.subset.subset.

  • data_root_dir parameter in no longer needed daops.ops.subset.subset.

  • data_root_dir no longer a parameter of daops.utils.consolidate.consolidate.

New Features

  • Added notebook with example usage.

  • Config file now exists at daops.etc.roocs.ini. This can be overwritten by setting the environment variable ROOCS_CONFIG to the file path of a config file.

  • split_method implemented to split output files by if they exceed the memory limit provided in clisops.etc.roocs.ini named file_size_limit. Currently only the time:auto exists which splits evenly on time ranges.

  • file_namer implemented in subset operation. This has simple and standard options. simple numbers output files whereas standard names them according to the input dataset.

  • Directories, file paths and dataset ids can now be used as inputs to the subset operation.

  • Fixer class now looks up fixes on our elasticsearch index.

Other Changes

  • Updated documentation.

  • Functions that take the data_refs parameter have been changed to use collection parameter instead.

  • Functions that take the data_ref parameter have been changed to use dset parameter instead.

v0.2.0 (2020-06-22)

  • Updated to use clisops v0.2.0 (#17)

  • Added xarray aggregation tests (#16)

v0.1.0 (2020-04-27)

  • First release with clisops v0.1.0.

Indices and tables