Welcome to daops’s documentation!¶
Quick Guide¶
daops - data-aware operations¶
The daops
library (pronounced “day-ops”) provides a python interface to a
set of operations suitable for working with climate simulation outputs. It is
typically used with ESGF data sets that are described in NetCDF files. daops
is unique in that it accesses a store of fixes defined for data sets that are
irregular when compared with others in their population.
When a daops
operation, such as subset
, is requested, the library will look
up a database of known fixes before performing and calculations or transformations.
The data will be loaded and fixed using the xarray
library before the any actual operations are sent to its sister library
clisops.
Free software: BSD
Documentation: https://daops.readthedocs.io
Features¶
The package has the following features:
Ability to run data-reduction operations on large climate data sets.
Knowledge of irregularities/anomalies in some climate data sets.
Ability to apply fixes to those data sets before operating on them. This process is called normalisation of the data sets.
Credits¶
This package was created with Cookiecutter
and the cedadev/cookiecutter-pypackage
project template.
Cookiecutter: https://github.com/audreyr/cookiecutter
cookiecutter-pypackage: https://github.com/cedadev/cookiecutter-pypackage
Installation¶
Stable release¶
Warning
daops requires libspatialindex-dev and libudunits2-dev to be installed prior to installation by pip.
To install daops, run this command in your terminal:
$ pip install daops
This is the preferred method to install daops, as it will always install the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide you through the process.
From sources¶
The sources for daops can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/roocs/daops
Get the submodules with test data:
$ git submodule update --init
Create Conda environment named daops:
$ conda env create -f environment.yml
$ source activate daops
Install daops in development mode:
$ pip install -r requirements.txt
$ pip install -r requirements_dev.txt
$ python setup.py develop
Run tests:
$ pytest -v tests/
API¶
Subset operation¶
-
daops.ops.subset.
subset
(collection, time=None, area=None, level=None, output_dir=None, output_type='netcdf', split_method='time:auto', file_namer='standard')[source]¶ Subset input dataset according to parameters. Can be subsetted by level, area and time.
- Parameters
collection (Collection of datasets to process, sequence or string of comma separated dataset identifiers.)
time (Time period - Time range to subset over, sequence of two time values or string of two / separated time values)
area (Area to subset over, sequence or string of comma separated lat and lon bounds. Must contain 4 values.)
level (Level range - Level values to subset over, sequence of two level values or string of two / separated level values)
output_dir (str or path like object describing output directory for output files.)
output_type ({“netcdf”, “nc”, “zarr”, “xarray”})
split_method ({“time:auto”})
file_namer ({“standard”, “simple”})
- Returns
List of outputs in the selected type (a list of xarray Datasets or file paths.)
Examples
collection: (“cmip6.ukesm1.r1.gn.tasmax.v20200101”,)time: (“1999-01-01T00:00:00”, “2100-12-30T00:00:00”)area: (-5.,49.,10.,65)level: (1000.,)output_type: “netcdf”output_dir: “/cache/wps/procs/req0111”split_method: “time:decade”file_namer: “facet_namer”
Utilities¶
-
daops.utils.consolidate.
consolidate
(collection, **kwargs)[source] Finds the file paths relating to each input dataset. If a time range has been supplied then only the files relating to this time range are recorded.
- Parameters
collection – (roocs_utils.CollectionParameter) The collection of datasets to process.
kwargs – Arguments of the operation taking place e.g. subset, average, or re-grid.
- Returns
An ordered dictionary of each dataset from the collection argument and the file paths relating to it.
-
daops.utils.consolidate.
convert_to_ds_id
(dset)[source] Converts the input dataset to a drs id form to use with the elasticsearch index.
- Parameters
dset – Dataset to process. Formats currently accepted are file paths and paths to directories.
- Returns
The ds id for the input dataset.
-
daops.utils.core.
is_characterised
(collection, require_all=False)[source] Takes in a collection (an individual data reference or a sequence of them). Returns an ordered dictionary of a collection of ids with a boolean value for each stating whether the dataset has been characterised.
If require_all is True: return a single Boolean value.
- Parameters
collection – one or more data references
require_all – Boolean to require that all must be characterised
- Returns
Ordered Dictionary OR Boolean (if require_all is True)
-
daops.utils.core.
is_dataref_characterised
(dset)[source]
-
daops.utils.core.
open_dataset
(ds_id, file_paths)[source] Opens an xarray Dataset and applies fixes if required. Fixes are applied to the data either before or after the dataset is opened. Whether a fix is a ‘pre-processor’ or ‘post-processor’ is defined in the fix itself.
- Parameters
ds_id – Dataset identifier in the form of a drs id e.g. cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga
file_paths – (list) The file paths corresponding to the ds id.
- Returns
xarray Dataset with fixes applied to the data.
-
class
daops.utils.fixer.
Fixer
(ds_id)[source] Bases:
object
Fixer class to look up fixes to apply to input dataset from the elastic search index. Gathers fixes into pre and post processors. Pre-process fixes are chained together to allow them to be executed with one call.
-
class
daops.utils.fixer.
FuncChainer
(funcs)[source] Bases:
object
Chains functions together to allow them to be executed in one call.
-
class
daops.utils.normalise.
ResultSet
(inputs=None)[source] Bases:
object
A class to hold the results from an operation e.g. subset
-
add
(dset, result)[source] Adds outputs to an ordered dictionary with the ds id as the key. If the output is a file path this is also added to the file_paths variable so a list of file paths can be accessed independently.
-
-
daops.utils.normalise.
normalise
(collection)[source] Takes file paths and opens and fixes the dataset they make up.
- Parameters
collection – Ordered dictionary of ds ids and their related file paths.
- Returns
An ordered dictionary of ds ids and their fixed xarray Dataset.
Data Utilities¶
-
daops.data_utils.coord_utils.
add_scalar_coord
(ds, **operands)[source] - Parameters
ds – Xarray Dataset
operands – (dict) Arguments for fix. Id, value and data type of scalar coordinate to add.
- Returns
Xarray Dataset
-
daops.data_utils.coord_utils.
squeeze_dims
(ds, **operands)[source] - Parameters
ds – Xarray Dataset
operands – (dict) Arguments for fix. Dims (list) to remove.
- Returns
Xarray Dataset
Examples¶
[1]:
from daops.ops.subset import subset
# remove previosuly created example file
import os
if os.path.exists("./output_001.nc"):
os.remove("./output_001.nc")
Subset¶
Daops has a subsetting operation that calls clisops.ops.subset.subset
from the clisops
library.
Before making the call to the subset operation, daops
will look up a database of known fixes. If there are any fixes for the requested dataset then the data will be loaded and fixed using the xarray
library and the subsetting operation is then carried out by clisops
.
Results of subset and applying a fix¶
The results of the subsetting operation in daops are returned as an ordered dictionary of the input dataset id and the output in the chosen format (xarray dataset, netcdf file paths, zarr file paths)
The example below requires a fix so the elasticsearch index has been consulted.
It also demostrates the results of the operation
[2]:
# An example of subsetting a dataset that requires a fix - the elasticsearch index is consulted.
ds = "badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/*.nc"
result = subset(
ds,
time=("1955-01-01T00:00:00", "2013-12-30T00:00:00"),
output_dir=None,
output_type="xarray",
)
result._results
2020-11-19 11:59:47,774 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Testing 1 files in time range: ...
2020-11-19 11:59:47,804 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - File 0: badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/zostoga_Omon_inmcm4_rcp45_r1i1p1_200601-210012.nc
2020-11-19 11:59:48,201 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Kept 1 files
2020-11-19 11:59:48,205 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/normalise.py - INFO - Working on datasets: OrderedDict([('cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga', ['badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/zostoga_Omon_inmcm4_rcp45_r1i1p1_200601-210012.nc'])])
2020-11-19 11:59:48,684 - elasticsearch - INFO - GET https://elasticsearch.ceda.ac.uk:443/roocs-fix/_doc/f34d45e4f7f5e187f64021b685adc447 [status:200 request:0.475s]
2020-11-19 11:59:48,708 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/core.py - INFO - Running post-processing function: squeeze_dims
2020-11-19 11:59:48,715 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/processor.py - INFO - Running subset [serial]: on Dataset with args: {'time': Time period to subset over
start time: 1955-01-01T00:00:00
end time: 2013-12-30T00:00:00, 'area': Area to subset over:
None, 'level': Level range to subset over
first_level: None
last_level: None, 'output_type': 'xarray', 'output_dir': None, 'split_method': 'time:auto', 'file_namer': 'standard'}
2020-11-19 11:59:48,742 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py - INFO - Processing subset for times: ('2006-01-16', '2013-12-16')
2020-11-19 11:59:48,745 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/utils/output_utils.py - INFO - fmt_method=None, output_type=xarray
2020-11-19 11:59:48,748 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/utils/output_utils.py - INFO - Returning output as <class 'xarray.core.dataset.Dataset'>
/home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py:34: UserWarning: "start_date" not found within input date time range. Defaulting to minimum time step in xarray object.
result = subset_time(ds, **kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py:34: UserWarning: "end_date" has been nudged to nearest valid time step in xarray object.
result = subset_time(ds, **kwargs)
[2]:
OrderedDict([('cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga',
[<xarray.Dataset>
Dimensions: (bnds: 2, time: 96)
Coordinates:
lev float64 0.0
* time (time) object 2006-01-16 12:00:00 ... 2013-12-16 12:00:00
Dimensions without coordinates: bnds
Data variables:
lev_bnds (bnds) float64 dask.array<chunksize=(2,), meta=np.ndarray>
time_bnds (time, bnds) object dask.array<chunksize=(96, 2), meta=np.ndarray>
zostoga (time) float32 dask.array<chunksize=(96,), meta=np.ndarray>
Attributes:
institution: INM (Institute for Numerical Mathematics, Moscow...
institute_id: INM
experiment_id: rcp45
source: inmcm4 (2009)
model_id: inmcm4
forcing: N/A
parent_experiment_id: historical
branch_time: 56940.0
contact: Evgeny Volodin, volodin@inm.ras.ru,INM RAS, Gubki...
history: Mon Mar 9 11:49:38 2020: ncks -d lev,,,8 -v zost...
comment: no comments
references: Volodin, Diansky, Gusev 2010. Climate model INMCM...
initialization_method: 1
physics_version: 1
tracking_id: e16ae391-db18-4e82-b2b8-46ff24aeec77
product: output
experiment: RCP4.5
frequency: mon
creation_date: 2010-11-19T08:18:56Z
Conventions: CF-1.4
project_id: CMIP5
table_id: Table Omon (12 May 2010) f2afe576fb73a3a11aaa3cc8...
title: inmcm4 model output prepared for CMIP5 RCP4.5
parent_experiment: Historical
modeling_realm: ocean
realization: 1
cmor_version: 2.0.0
NCO: 4.7.3])])
File paths of output¶
If output as file paths, it is also possible to access just the output file paths from the results object. This is demonstrated below.
[3]:
# An example of subsetting a dataset that requires a fix - the elasticsearch index is consulted.
ds = "badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/*.nc"
result = subset(
ds,
time=("1955-01-01T00:00:00", "2013-12-30T00:00:00"),
output_dir=".",
output_type="netcdf",
file_namer="simple"
)
print("ouptut file paths = ", result.file_paths)
2020-11-19 11:59:48,785 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Testing 1 files in time range: ...
2020-11-19 11:59:48,928 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - File 0: badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/zostoga_Omon_inmcm4_rcp45_r1i1p1_200601-210012.nc
2020-11-19 11:59:49,306 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Kept 1 files
2020-11-19 11:59:49,310 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/normalise.py - INFO - Working on datasets: OrderedDict([('cmip5.output1.INM.inmcm4.rcp45.mon.ocean.Omon.r1i1p1.latest.zostoga', ['badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/zostoga_Omon_inmcm4_rcp45_r1i1p1_200601-210012.nc'])])
2020-11-19 11:59:49,766 - elasticsearch - INFO - GET https://elasticsearch.ceda.ac.uk:443/roocs-fix/_doc/f34d45e4f7f5e187f64021b685adc447 [status:200 request:0.452s]
2020-11-19 11:59:49,790 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/core.py - INFO - Running post-processing function: squeeze_dims
2020-11-19 11:59:49,798 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/processor.py - INFO - Running subset [serial]: on Dataset with args: {'time': Time period to subset over
start time: 1955-01-01T00:00:00
end time: 2013-12-30T00:00:00, 'area': Area to subset over:
None, 'level': Level range to subset over
first_level: None
last_level: None, 'output_type': 'netcdf', 'output_dir': '.', 'split_method': 'time:auto', 'file_namer': 'simple'}
2020-11-19 11:59:49,829 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py - INFO - Processing subset for times: ('2006-01-16', '2013-12-16')
2020-11-19 11:59:49,832 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/utils/output_utils.py - INFO - fmt_method=to_netcdf, output_type=netcdf
2020-11-19 11:59:49,879 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/utils/output_utils.py - INFO - Wrote output file: ./output_001.nc
ouptut file paths = ['./output_001.nc']
/home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py:34: UserWarning: "start_date" not found within input date time range. Defaulting to minimum time step in xarray object.
result = subset_time(ds, **kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/clisops/ops/subset.py:34: UserWarning: "end_date" has been nudged to nearest valid time step in xarray object.
result = subset_time(ds, **kwargs)
Checks implemented by daops¶
Daops will check that files exist in the requested time range
[4]:
ds = "/badc/cmip5/data/cmip5/output1/INM/inmcm4/rcp45/mon/ocean/Omon/r1i1p1/latest/zostoga/*.nc"
try:
result = subset(
ds,
time=("1955-01-01T00:00:00", "1990-12-30T00:00:00"),
output_dir=None,
output_type="xarray",
)
except Exception as exc:
print(exc)
2020-11-19 11:59:49,903 - /home/docs/checkouts/readthedocs.org/user_builds/daops/conda/release-v0.3.0/lib/python3.9/site-packages/daops/utils/consolidate.py - INFO - Testing 0 files in time range: ...
no files to open
Contributing¶
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions¶
Report Bugs¶
Report bugs at https://github.com/roocs/daops/issues.
If you are reporting a bug, please include:
Your operating system name and version.
Any details about your local setup that might be helpful in troubleshooting.
Detailed steps to reproduce the bug.
Fix Bugs¶
Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.
Implement Features¶
Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.
Write Documentation¶
daops could always use more documentation, whether as part of the official daops docs, in docstrings, or even on the web in blog posts, articles, and such.
Submit Feedback¶
The best way to send feedback is to file an issue at https://github.com/roocs/daops/issues.
If you are proposing a feature:
Explain in detail how it would work.
Keep the scope as narrow as possible, to make it easier to implement.
Remember that this is a volunteer-driven project, and that contributions are welcome :)
Get Started!¶
Ready to contribute? Here’s how to set up daops for local development.
Fork the daops repo on GitHub.
Clone your fork locally:
$ git clone git@github.com:your-name/daops.git
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
# For virtualenv environments: $ mkvirtualenv daops # For Anaconda/Miniconda environments: $ conda create -n daops python=3.6 $ cd daops/ $ pip install -e .
Create a branch for local development:
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally!
When you’re done making changes, check that you verify your changes with black and run the tests, including testing other Python versions with tox:
# For virtualenv environments: $ pip install black pytest tox # For Anaconda/Miniconda environments: $ conda install -c conda-forge black pytest tox $ black daops tests $ python setup.py test $ tox
Before committing your changes, we ask that you install pre-commit in your virtualenv. Pre-commit runs git hooks that ensure that your code resembles that of the project and catches and corrects any small errors or inconsistencies when you git commit:
# For virtualenv environments: $ pip install pre-commit # For Anaconda/Miniconda environments: $ conda install -c conda-forge pre_commit $ pre-commit install
Commit your changes and push your branch to GitHub:
$ git add * $ git commit -m "Your detailed description of your changes." # `pre-commit` will run checks at this point: # if no errors are found, changes will be committed. # if errors are found, modifications will be mades. Simply `git commit` again. $ git push origin name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website.
Pull Request Guidelines¶
Before you submit a pull request, please follow these guidelines:
Open an issue on our GitHub repository with your issue that you’d like to fix or feature that you’d like to implement.
Perform the changes, commit and push them either to new a branch within roocs/daops or to your personal fork of daops.
Warning
Try to keep your contributions within the scope of the issue that you are addressing. While it might be tempting to fix other aspects of the library as it comes up, it’s better to simply to flag the problems in case others are already working on it.
Consider adding a “# TODO:” comment if the need arises.
Pull requests should raise test coverage for the daops library. Code coverage is an indicator of how extensively tested the library is. If you are adding a new set of functions, they must be tested and coverage percentage should not significantly decrease.
If the pull request adds functionality, your functions should include docstring explanations. So long as the docstrings are syntactically correct, sphinx-autodoc will be able to automatically parse the information. Please ensure that the docstrings adhere to one of the following standards:
The pull request should work for Python 3.6, 3.7 and 3.8 as well as raise test coverage. Pull requests are also checked for documentation build status and for PEP8 compliance.
- The build statuses and build errors for pull requests can be found at:
Warning
PEP8 and Black is strongly enforced. Ensure that your changes pass Flake8 and Black tests prior to pushing your final commits to your branch. Code formatting errors are treated as build errors and will block your pull request from being accepted.
Credits¶
Development Lead¶
Ag Stephens ag.stephens@stfc.ac.uk
Co-Developers¶
Eleanor Smith eleanor.smith@stfc.ac.uk
Carsten Ehbrecht ehbrecht@dkrz.de
Contributors¶
None yet. Why not be the first?
Version History¶
v0.3.0 (2020-11-19)¶
Updating doc strings and documentation.
Breaking Changes¶
clisops
>=0.4.0 androocs-utils
>=0.1.4 used.data_refs
parameter ofdaops.ops.subset.subset
renamed tocollection
.space
parameter ofdaops.ops.subset.subset
renamed toarea
.chunk_rules
parameter ofdaops.ops.subset.subset
renamed tosplit_method
.filenamer
parameter ofdaops.ops.subset.subset
renamed tofile_namer
.output_type
parameter option added todaops.ops.subset.subset
.data_root_dir
parameter in no longer neededdaops.ops.subset.subset
.data_root_dir
no longer a parameter ofdaops.utils.consolidate.consolidate
.
New Features¶
Added notebook with example usage.
Config file now exists at
daops.etc.roocs.ini
. This can be overwritten by setting the environment variableROOCS_CONFIG
to the file path of a config file.split_method
implemented to split output files by if they exceed the memory limit provided inclisops.etc.roocs.ini
namedfile_size_limit
. Currently only thetime:auto
exists which splits evenly on time ranges.file_namer
implemented in subset operation. This hassimple
andstandard
options.simple
numbers output files whereasstandard
names them according to the input dataset.Directories, file paths and dataset ids can now be used as inputs to the subset operation.
Fixer class now looks up fixes on our elasticsearch index.
Other Changes¶
Updated documentation.
Functions that take the
data_refs
parameter have been changed to usecollection
parameter instead.Functions that take the
data_ref
parameter have been changed to usedset
parameter instead.
v0.2.0 (2020-06-22)¶
Updated to use clisops v0.2.0 (#17)
Added xarray aggregation tests (#16)
v0.1.0 (2020-04-27)¶
First release with clisops v0.1.0.