Atmospheric Algorithm Antics

Where the sky and programming ability are the limits

Accessing NetCDF datasets with Python - Part 1

Since writing my original tutorial Python - NetCDF reading and writing example with plotting, I have received a lot of questions and feedback. As a result, I decided to expand my original tutorial into a multi-part blog post. In this series, we will cover

  • what are NetCDF files,
  • classic NetCDF vs NetCDF-4,
  • reading NetCDF files into Python,
  • plotting data,
  • assessing online data sets,
  • generating NetCDF files,
  • Climate and Forecast Convention compliance, and
  • file size/compression.

In this series, we will discuss what are Unidata NetCDF (Network Common Data Form) files then transition to accessing NetCDF file data with Python. Specifically, we will focus on using the NetCDF4 Python module developed by NOAA's Jeff Whitaker.

Throughout this series, we will use the NCEP/NCAR Reanalysis I (Kalnay et al. 1996) [NCEP/NCAR Reanalysis data provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, from their Web site at http://www.esrl.noaa.gov/psd/].

OK, let's get started!

What is a NetCDF file?

At the most basic level, NetCDF files are trying to avoid a new file format popping up for each new data set. Each file format requires its own drivers and utilities. This is problematic for anyone. Users need to not only learn the format but must write new code to read the files. This can be very time consuming.

Enter NetCDF! Unidata NetCDF stands for Network Common Data Form. As the name suggests, its goal is to make a universal data file format. One format to rule them all, one format to... I digress. UCAR’s Unidata created the format as an offshoot of NASA’s Common Data Format in hopes of making the file format platform independent. NetCDF is nice because it also helps manage big data (No, not the band Big Data. Dealing with them might be a different story.) We are talking about large, multidimensional data sets. In weather and climate work, the state of the atmosphere is represented by state variables that are typically defined at points of latitude, longitude, height, and time. These data sets can have file sizes that quickly grow into the gigabytes.

OK, it is a universal file format which works well for the types of data used in weather and climate. However, NetCDF doesn't stop there. Borrowing from the FAQ section on Unidata’s website, NetCDF data is:

  • Self-Describing. A NetCDF file includes information about the data it contains.
  • Portable. A NetCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.
  • Scalable. A small subset of a large dataset may be accessed efficiently.
  • Appendable. Data may be appended to a properly structured NetCDF file without copying the dataset or redefining its structure.
  • Sharable. One writer and multiple readers may simultaneously access the same NetCDF file.
  • Archivable. Access to all earlier forms of NetCDF data will be supported by current and future versions of the software.

Why use NetCDF?

As highlighted in the scientific journal Nature special Challenges in irreproducible research, the academic community is quickly moving to enact standards to address problems related with irreproducibility. The result is the many journals are mandating that data used in the research be included with the manuscript submission. As we will discuss in more detail shortly, NetCDF by its construction assists in achieving these goals because the files are self-describing, portable, sharable, and archivable.

How is the data self-describing?

Every NetCDF files contains METADATA about the data in the file. This METADATA is broken down into variables, dimensions, and attributes.

  • Variables. Variables contain data stored in the NetCDF file. This data is typically in the form of a multidimensional array. Scalar values are stored as 0-dimension arrays.
  • Dimensions. Dimensions can be used to describe physical space (latitude, longitude, height, and time) or indices of other quantities (e.g. weather station identifiers).
  • Attributes. Attributes are modifiers for variables and dimensions. Attributes act as ancillary data to help provide context. An example of an attribute would be a variable's units or fill/missing values.

It sounds like self-describing can get out of hand. Does anyone standardize the descriptions?

Yes, they do! Many agencies and groups created NetCDF conventions. The main convention being used today is CF Conventions (Climate and Forecast). However, if you are curious or encounter data using a different convention, Unidata maintains a list you can use to find out more information. In this series, we will generate files that are CF compliant. If you are not in a field associated with weather or climate, the CF Conventions have general data guidelines that can be adapted to your purposes.

What is in the description?

I’ve talked a lot about the file being self-describing but what does that actually mean? I think the best thing to do is walk through an example. In this example, we will be looking at output generated by a Python function called ncdump. This function mimics the header output of the Unidata ncdump utility. Please note: at this stage, I will only be discussing the output from this code.

from netCDF4 import Dataset
from ncdump import ncdump

nc_fid = Dataset("http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ncep.reanalysis/surface/air.sig995.2012.nc", 'r')
nc_attrs, nc_dims, nc_vars = ncdump(nc_fid)
NetCDF Global Attributes:
    Conventions: u'COARDS'
    title: u'4x daily NMC reanalysis (2012)'
    description: u'Data is from NMC initialized reanalysis\n(4x/day).  These are the 0.9950 sigma level values.'
    platform: u'Model'
    references: u'http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html'
    history: u'created 2011/12 by Hoop (netCDF2.3)\nConverted to chunked, deflated non-packed NetCDF4 2014/09'
    DODS_EXTRA.Unlimited_Dimension: u'time'
NetCDF dimension information:
    Name: time
        size: 1464
        type: dtype('float64')
        long_name: u'Time'
        delta_t: u'0000-00-00 06:00:00'
        standard_name: u'time'
        axis: u'T'
        units: u'hours since 1800-01-01 00:00:0.0'
        actual_range: array([ 1858344.,  1867122.])
        _ChunkSize: 1
    Name: lat
        size: 73
        type: dtype('float32')
        units: u'degrees_north'
        actual_range: array([ 90., -90.], dtype=float32)
        long_name: u'Latitude'
        standard_name: u'latitude'
        axis: u'Y'
    Name: lon
        size: 144
        type: dtype('float32')
        units: u'degrees_east'
        long_name: u'Longitude'
        actual_range: array([   0. ,  357.5], dtype=float32)
        standard_name: u'longitude'
        axis: u'X'
NetCDF variable information:
    Name: air
        dimensions: (u'time', u'lat', u'lon')
        size: 15389568
        type: dtype('float32')
        long_name: u'4xDaily Air temperature at sigma level 995'
        units: u'degK'
        precision: 2
        least_significant_digit: 1
        GRIB_id: 11
        GRIB_name: u'TMP'
        var_desc: u'Air temperature'
        dataset: u'NMC Reanalysis'
        level_desc: u'Surface'
        statistic: u'Individual Obs'
        parent_stat: u'Other'
        missing_value: -9.96921e+36
        actual_range: array([ 191.1000061,  323.       ], dtype=float32)
        valid_range: array([ 185.16000366,  331.16000366], dtype=float32)
        _ChunkSize: array([  1,  73, 144])

In the output generated by the short snippet of code, we see that there a three main section (Global attributes, dimensions, and variables). Under each of the primary sections, you will see additional information.

In the global attribute section, you will see attributes as the name suggests. A well-constructed NetCDF file will have the conventions use (in this case, 'COARDS'), a title, a description, and a history of how the file has been modified.

In the dimension and variable sections, you will see the name of the dimension and variable followed by attributes. These attributes typically include units, a long_name that offers a more detailed description, data range information, etc. Variables are distinguished from dimensions because variables are typically functions of one or more dimensions. In our example, 'air' has time, lat, and lon as its dimensions.

Types of NetCDF files

There are four NetCDF format variants according to the Unidata NetCDF FAQ page:

  • the classic format,
  • the 64-bit offset format,
  • the NetCDF-4 format, and
  • the NetCDF-4 classic model format. While this seems to add even more complexity to using NetCDF files, the reality is that unless you are generating NetCDF files, most applications read NetCDF files regardless of type with no issues. This aspect has been abstracted for the general user!

The classic format has its roots in the original version of the NetCDF standard. It is the default for new files and is the format of the NCEP/NCAR Reanalysis we will use in a later part of the series.

The 64-bit offset simply allows for larger dataset to be created. Prior to the offset, files would be limited to 2 GiB. A 64-bit machine is not required to read a 64-bit file. This point should not be a concern for many users.

The NetCDF-4 format adds many new features related to compression and multiple unlimited dimensions (we'll discuss both of these points later). NetCDF-4 is essentially a special case of the HDF5 file format.

The NetCDF-4 classic model format attempts to bridge gaps between the original NetCDF file and NetCDF-4.

Luckily for us, the NetCDF4 Python module handles many of these differences. The main decision when picking a type is to think about your user. If the user is going to access your data via Fortran, the classic format might be the best choice. If you have a large dataset that can benefit from compression, NetCDF-4 might be a better choice.

Wrapping up

Alright! This concludes the first part into our NetCDF journey. If you are interested in learning more about what NetCDF files are, I would strongly urge you to explore Unidata's NetCDF website. As noted several times, this post relied heavily on the content on the NetCDF website. If you are trying to figure out how data in a file is actually structured and how to access that data, we'll address this in a hands on approach in the next posting.

This post was written using an IPython notebook. You can download this notebook, or see a static view here.

*Please note that Accessing NetCDF datasets with Python - Part 1 by Chris Slocum is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.*

Comments