Project

General

Profile

Cdo{rb,py}

Why the ...

You're happy with using CDO in a shell? You're hesitant to use a scripting language instead, because shells are so easy to use? Here are some cool features, you'll hardly get within a shell context:

  • Direct data access via numpy/narray
    If you have python-netcdf4 or scipy installed, you will have direct access to the values of your fields. Python and Ruby offer a rich set of scientific libraries to work with these data arrays, e.g. plotting.
  • Automatic tempfile handling
    When dealing with temporary data, you will never have to do things manually. No cleanup, no manual creation and renaming of files. For whatever you want to save, give a real output file name - the rest is done automatically.
  • Flexible parallelization
    You have a huge long running job, that could be done on parallel? In general shells do not offer enough parallelism to have fine-grained control - the & is not enough for 100 to 1000 jobs, because your system will slow down very much and in a minute, you'll get your call from the administrators.
    Both Python and Ruby offer what you just need: Run 1000 routine calls with only 12 concurrent processes or threads. With Cdo{rb,py} you now can make use of it!
  • Conditional processing
    Avoid re-processing if the output files is already on disk. Wisely used, this can speed up your script tremendously. This behaviour can be switched on globally or just for a single calls. E.g. when analysing output data from a running experiments you can run the same script over and over again and you'll always get the latest results.
  • write new operators out of old ones
    Python and Ruby bindings are open libraries, that can be extended at runtime. If you need a new operator, just write it in Python or Ruby!

There is a github repository for easy code sharing and where the changelog is tracked.

If you have questions, please use the CDO forum.

what it is (not) ...

This scripting language package is essentially a wrapper around the CDO binary. It parses method arguments and options, builds a command line and executes it. There is no shared library backend which calls CDO operators. This has some advantages:

  • operator chaining is fully supported
  • multiple CDO binaries can be used at the same time using setCdo())
  • packages are highly portable, because they are pure python/ruby implementations

Usage

Almost all features are covered by units tests. Theses should be a good starting point to get an impression in how to use the package:

Both bindings are tested with the unix and the win32 version of CDO. Please note, that returning arrays by setting returnCdf is not tested due to the lack of the corresponding netcdf library on windows. There are precompiled windows version of netcdf, but I will not spent time to get it running.

Before doing anything else, the libraries must have been loaded in the usual way:

from cdo import *   # python version
cdo = Cdo()

In the python version an object has to be created for internal reasons, whereas this is not necessary for Ruby. This may change in the future, but for now it is only a minor difference
require 'cdo'       # ruby version

online/offline help

For all non-operators, the automatically generated documentation form rubygems might be helpful. Operator documentation can be viewed online, directly by calling

cdo -h <operator pattern>
or within the interactive python/ruby shell. Both of the folling examples display the built-in help for sinfov:

  • Python:
    from cdo import *
    help(Cdo().sinfov)
    
    or
    from cdo import *
    cdo = Cdo()
    help(cdo.sinfov)
  • Ruby
    require 'cdo'
    cdo = Cdo.new
    cdo.help('sinfov')     # or Cdo.help(:sinfov)
    

IO

Input and output files can be set with the keywors input and output

    cdo.infov(:input => ifile)      #ruby version
    cdo.showlevel(:input => ifile)
    cdo.infov(input=ifile)          #python verson
    cdo.showlevel(input=ifile)
    cdo.timmin(:input => ifile ,:output => ofile)   #ruby version
    cdo.timmin(input = ifile,    output =  ofile)   #python version

Options

Commandline options like '-f' or '-P' can by used via the options keyword:

    cdo.timmin(:input => ifile ,:output => ofile,:options => '-f nc') #ruby version
    cdo.timmin(input = ifile,    output = ofile,  options = '-f nc')  #python version

Operator arguments have to be given as the first method argument

    cdo.remap(gridFile,    weightFile,:input => ifile,:output => ofile,:options => '-f nc') #ruby version
    cdo.remap(gridFile+","+weighFile,  input =  ifile, output =  ofile, options = '-f nc')  #python version

or
    cdo.seltimestep('1/10',:input => ifile,:output => ofile,:options => '-r -B F64') #ruby version
    cdo.seltimestep('1/10', input =  ifile, output =  ofile, options =  '-r -B F64') #python version

Operator Chains

To take real advantage of CDOs internal parallelism, you should work with operator chains as mush as possible:

    cdo.setname('random',:input => "-mul -random,r10x10 -enlarge,r10x10 -setyear,2000 -for,1,4",:output => ofile,:options => '-f nc') #ruby version
    cdo.setname('random', input =  "-mul -random,r10x10 -enlarge,r10x10 -setyear,2000 -for,1,4", output =  ofile, options =  '-f nc') #python version

Another good example taken from the Tutorial illustrates the different ways of chaining: While the chain

cdo sub -dayavg ifile2 -timavg ifile1 ofile
is represented by

cdo.sub(:input => "-dayavg #{ifile2} -timavg #{ifile1}", :output => ofile)  #ruby
cdo.sub(input = "-dayavg " + ifile2 + " -timavg " +ifile1, output = ofile)  #python

The serial version, which prohibits internal parallelism, creates unnecessary temporal files and is just mentioned for educational reasons would look like

cdo.sub(:input => Cdo.dayavg(:input => ifile2) + " " + Cdo.timavg(:input => ifile1), :output => ofile)  #ruby
cdo.sub(input  =  cdo.dayavg(input  =  ifile2) + " " + cdo.timavg(input  =  ifile1), output  =  ofile)  #python

or using the join-method:
cdo.sub(:input => [Cdo.dayavg(:input => ifile2),Cdo.timavg(:input => ifile1)].join(" "), :output => ofile)  #ruby
cdo.sub(input  =  " ".join([cdo.dayavg(input  =  ifile2),cdo.timavg(input  =  ifile1)] , output  =  ofile)  #python

Special Features

Tempfile handling

If the output stream is omitted, a temporary file is written and its name is the return value of the call:

    ofile = cdo.timmin(:input => ifile ,:options => '-f nc')   #ruby version
    ofile = cdo.timmin(input  =  ifile,  options =  '-f nc')   #python version

Here, the output files are automatically removed, when the scripts finishes. Manual cleanup is not necessary any more unless you encounter a crash. In that case you can use the new (1.4.0) method

cdo.cleanTempDir()
or set an alternative directory for storing temporary files with
cdo = Cdo(tempdir='/path/to/new/tempdir') #python
cdo = Cdo.new(tempdir: '/path/to/new/tempdir') #ruby

Conditional Processing

When processing large number of input files as it is the case in a running experiment, it can be very helpful to suppress the creation of intermediate output if these files are already there. This can speed up your post-processing. In the default behavior, output is created no matter if something is overwritten or not. Conditional processing can be used in two different ways:

  • global setting
    cdo.forceOutput = False   #python
    or
    Cdo.forceOutput = false   #ruby
    This switch changes the default behavior (example)
  • operator option
    cdo.stdatm("0,10,20",output = ofile, force =  False)  #python
    or
    Cdo.stdatm(0,10,20,:output => ofile,:force => false)  #ruby
    The usage of this options allows you to setup the output action very precisely without changing the default (example for good place to uses this feature)

Multi-threadding - Ruby

When things can be done in parallel, Python and Ruby offer a smart way to handle this without to much overhead. A Ruby example should illustrate how it can be done: Tutorial

require 'cdo'
require 'parallel'

iFile                 = ARGV[0].nil? ? 'ifs_oper_T1279_2011010100.grb' : ARGV[0]
targetGridFile        = ARGV[1].nil? ? 'cell_grid-r2b07.nc'            : ARGV[1] # grid file
targetGridweightsFile = ARGV[2].nil? ? 'cell_weight-r2b07.nc'          : ARGV[2] # pre-computed interpolation weights
nWorkers              = ARGV[3].nil? ? 8                               : ARGV[3] # number of parallel threads

cdo = Cdo.new

# lets work in debug mode
cdo.debug = true

# split the input file wrt to variable names,codes,levels,grids,timesteps,...
splitTag = "ifs2icon_skel_split_" 
cdo.splitname(:input => iFile, :output => splitTag,:options => '-f nc')

# collect Files form the split
files = Dir.glob("#{splitTag}*.nc")

# remap variables in parallel
ofiles = Parallel.map(files,:in_processes => nWorkers).each {|file|
    basename = file[0..-(File.extname(file).size+1)]
    ofile = cdo.remap(targetGridFile,targetGridweightsFile,
                      :input => file,
                      :output => "remapped_#{basename}.nc")
}

# Merge all the results together
cdo.merge(:input => ofiles.join(" "),:output => 'mergedResults.nc')

In this case the parallelization is done per variable. The only lines, which had to be added for letting the code run on a user defined (see line 7) number of thread are 2, 13, 25, 30 and 32. This approach uses a queue, which takes all jobs and is getting started with q.run. A python version of JobQueue should be easy to implement. Contribution would be appreciated!

Multi-threadding - Python

A multiprocessing based example may look like

from cdo import *
import multiprocessing

def showlevel(arg):
    return cdo.showlevel(input=arg)

cdo       = Cdo()
cdo.debug = True
ifile     = '/home/ram/local/data/cdo/GR30L20_fx.nc'
pool      = multiprocessing.Pool(1)
results   = []

for i in range(0,5):
    results.append(pool.apply_async(showlevel, [ifile]))

pool.close()
pool.join()

for res in results:
    print(res.get())

Data access via numpy/narray/xarray/cdf

When working with netcdf, it is possible to get access to the data in three additional ways:

  1. a file handle: Using a file handle offers the flexibility to go through the whole file with all it information like variables, dimensions and attributes. To get such an handle form a cdo call, use the returnCdf keyword or use the readCdf methods:
    cdo.stdatm("0", options = "-f nc", returnCdf  =  True).variables["P"][:]  #python, access variable 'P' with scipy.io
    cdo.stdatm(0, :options => "-f nc", :returnCdf => true).var("P").get       #ruby , access with ruby-netcdf
    or return the pure handle with
    cdo.readCdf(ifile)  #python
    cdo.readCdf(ifile)  #ruby
    
  2. a numpy/narray object: If a certain variable should be read in, use the returnArray instead of returnCdf:
    pressure = cdo.stdatm("0", options = "-f nc",  returnArray = 'P')  #python
    pressure = cdo.stdatm(0, :options => "-f nc", :returnArray = 'P')  #ruby
  3. a masked array: If the target variable has missing values, i.e. makes use of the FillValue, the returned structure reflects this. For python a masked array is returned, the ruby version uses a special version of NArray called NArrayMiss. As an example, lets mask out the ocean from the global topography:
    oro = cdo.setrtomiss(-10000,0, input =  cdo.topo( options =  '-f nc'), returnMaArray =  'topo')  #python
    oro = cdo.setrtomiss(-10000,0,:input => cdo.topo(:options => '-f nc'),:returnMaArray => 'topo')  #ruby
  4. netcdf file-handles with
    oro = cdo.setrtomiss(-10000,0, input =  cdo.topo( options =  '-f nc'), returnCdf = True)  #python
    oro = cdo.setrtomiss(-10000,0,:input => cdo.topo(:options => '-f nc'),:returnCdf => true)  #ruby

As an python extra there is the option to return XArray/XDataset objects with

oro = cdo.setrtomiss(-10000,0, input =  cdo.topo( options =  '-f nc'), returnXArray =  'topo')
oro = cdo.setrtomiss(-10000,0,:input => cdo.topo(:options => '-f nc'),:returnXDataset = True)

Have a look into the documentation of the underlying netcdf libraries to get an overview of their functionality:

Prerequisites

The python module requires scipy.io (or pycdf as a fallback) whereas the ruby module needs ruby-netcdf. These dependencies are not handled automatically by pip or gem, because they are optional. Scipy and netcdf4-python are available for most linux/unix distributions as a precompiled package. If this is not the case for your favorite one, you could also use its pip repository. The ruby-netcdf package has a gem-repository:

  • Ruby:
    gem install ruby-netcdf
    or
    gem install ruby-netcdf --user-install
  • Python:
    pip install scipy
    or visit the the homepage for help on manual installation

Use Cases: Plotting

Examples: Python

from cdo import *
cdo   = Cdo()                                                         # create the CDO caller
ifile = 'tsurf.nc'                                                    # input: surface temperature
cdo.fldsum(input=ifile)                                               # compute the timeseries of global sum, return a temporary filename
vals  = cdo.fldsum(input=ifile,returnCdf=True).variables['tsurf'][:]  # return the timeseries as numpy array
print(cdo.fldsum(input=ifile,returnCdf=True).variables)               # get a list of all variables in the file 

Basic plotting:

from cdo import *
import matplotlib.pyplot as plt
ifile = 'EH5_AMIP_1_TSURF_1991-1995.nc'
cdo   = Cdo()

# Comput the field mean value timeseries and return it as a numpy array
vals  = cdo.fldmean(input=ifile,returnCdf=True).variables['tsurf'][:] 

# make it 1D
vals = vals.flatten()

# Plot the cumulatice sum of the variataion
plt.plot((vals-vals.mean()).cumsum())
plt.show()
produces: original:

2D plotting:

import xarray
from cdo import Cdo
import numpy
from matplotlib import pylab

cdo = Cdo()

# plotting topography with XArray
cdo.topo(returnXArray='topo').plot()
pylab.show()

# ploting based on masked numpy arrays
#   orography with missing values below 0
oro = cdo.setrtomiss(-20000,0,input='-sellonlatbox,-20,60,20,60  -topo',returnMaArray='topo')
pylab.imshow(numpy.flipud(oro))
pylab.show()
XArray numpy

Examples: Ruby

require 'cdo'
cdo   = Cdo.new
ifile = 'tsurf.nc'                                                   # input: surface temperature
vals  = cdo.fldsum(:in => ifile,:returnCdf=> true).var('tsurf').get  # return the global sum timeseries as narray
puts cdo.fldsum(:input => ifile,:returnCdf => true).var_names        # get a list of all variables in the file 

If you want some basic plotting, use the Ruby bindings of the GNU Scientific Library. You can install it like cdo. Here's a short example:

require 'cdo'
require 'gsl'

cdo   = Cdo.new
ifile = "data/examples/EH5_AMIP_1_TSURF_1991-1995.nc" 
tmean = cdo.fldmean(:in => ifile,:returnCdf => true).var('tsurf').get
tmean.to_gv.plot("w d title 'AMPI global mean surface temp'")

which shows

In this context the variable tmean is of type narray which is the ruby version of numpy. It has several methods itself. For filtering the out the temporal behaviour of the aboce time series, you could substract the mean value and display the cumulative sum by adding:

(tmean-tmean.mean)[0,0,0..-1].cumsum.to_gv.plot("w d title 'CUMSUM of global mean surface temp variation'")

with results

Use Cases: Interpolation, Root finding, Data fitting, ...

Through the numpy/narray interface, both python and ruby version offer a huge amount of extra functionality via several 3rd party libraries:

Write your own operators

Future versions

Both cdo modules are not directly linked to a special CDO version. Instead you can change the CDO version to what ever you have installed. Use the setCdo method to use another CDO binary. When CDO is updated and new operators area available, they are usable in the python and ruby modules automatically without any update.

Installation

CDO can be easily accessed via Ruby and Python. For each of the these two there is a dedicated package with can be installed from public servers with their own specific package management systems: gem for Ruby and pypi for Python. Interfaces of both packages are the identical.

Note on Windows: There is no native windows-build of CDO, since it is designed for POSIX-compatible systems. Please have a look at Win32 for how to run CDO on windows-based systems. The sort version is: Install cygwin
+ some additional cygwin packages incl. Ruby or Python and use the pre-build cdo package from the download area.

Ruby

Ruby's package system is called _gem_. The cdo module is located here. Its installation is rather easy: Just type

gem install cdo
and you'll get the latest version installed on your system. Installation as usual requires root privileges, so it might be necessary to prepend a sudo to the command. gem has a great built-in help:
gem help install
will show all you need for installation. If you do not have root access to you machine, another installation directory should be chosen with the --install-dir option or you use
gem install cdo --user-install
for an installation under $HOME/.gem.

Ruby 1.9.x comes with gem included, but some distros like debian and its derivates create extra packages for it. You might have to watch out for a rubygems package.

Python: pip

The cdo module can also be installed for python using pypi, the python package index. Cdo can be found here. If _pip_ is installed on your system, just type

pip install cdo
For user installations, use
pip install cdo --user
Please Note: For upgrading with pip, you have to remove the temporary directories first. Otherwise the upgrade will not take place:
rm -rf /tmp/pip /tmp/src && pip install --build=/tmp/pip --src=/tmp/src --user cdo --upgrade

Without pip, you should download the tar file and run (possibly requiring root privileges)
python setup.py install
in the extracted directory.

Python: conda

the conda-forge channel has separate packages for the CDO binary and the python bindings:

CDO binary python-bindings
conda package name cdo python-cdo

Installation works as usual with

 conda -c conda-forge install python-cdo

For JupyterHub users - Watch out for your temporary data

In contrast to plain python scripts the JupyterHub environment needs special attentions when using CDO's python bindings.

General handling of intermediate results

cdo.py allows to omit output files - instead it will create temporary files for you on the go and clean everything after the script is finished:

>>> ofile = cdo.fldmean(input="myfile.nc")
>>> print(ofile)
/tmp/cdoPyhhp_h4yr

Compared to bash this feature eases the process of writing analysis scripts, because large parts of the boiler plate code is gone.

Normal python script run in order of seconds or minutes. Hence temporary data is removed rather quickly. JupyterHub session on the other hand can run for hours. Further more they run shared with other sessions on the same nodes.

On most Unix-like systems temporary files are stored in main memory. So if there are too many temporary files at the same time the node become unusable had has to be restarted.

How to avoid this?

  1. Cleanup whenever possible: cdo.py comes with a routine
    cdo.cleanTempDir()
    This will remove all temporary files, which were created by cdo.py and are owned by you. So it will also work on temporary files created by you on that host in a former session.
    Add this call in between larger blocks of code, where the temporary data is not needed (e.g. because you used an explicit name for the main results)
  2. Choose your own directory for temporary data right at the start with
    cdo = Cdo(tempdir="/work/mh0287/{}/tmp".format(os.environ['USER']))
    With a setup like this you don't interfere with the main memory and have a lot more space. The price you pay is filesystem performance: every normal file system is a lot slower then `/tmp` in main memory. But you get more control and a more stable system.

Final thoughts

JupyterHub is not a normal script because there is no prescribed command execution: cells can be executed top-down, but they don't have to. That's why `cdo.cleanTempDir()` is not a good solution here. If you program using functions, you can add it there, but the interactive nature of the workflow does not ensure the cleanup is run on a regular basis.

My recommendation for JubypterHub is

cdo = Cdo(tempdir="<big-directory>")
In my opinion the performance penalty is worth the stability of the system. Technically `cdo.cleanTempDir()` still can be used on whatever tempdir you chose to use, but going away from `/tmp` really helps.