Python contains intrinsic structure and mathematical commands, but its capabilities can be dramatically increased using modules. Modules are written in Python or a compiled language like C to help simplify common, general, or redundant tasks. For instance, the datetime module helps programmers manipulate calendar dates and times using a variety of units. Packages contain one or more modules, which are often designed to facilitate tasks that follow a central theme. Some other terms used interchangeably for packages are libraries and distributions.
At the time of writing, there are over 200,000 Python packages registered on pipy.org and more that live on the internet in code repositories such as GitHub (https://github.com/). Many of the most popular packages are often developed and maintained by large online communities. This massive effort benefits you as a scientist because many common tasks have already been developed in Python by someone else. This can also create a dilemma for scientists and researchers – the trade‐off between using existing code to save time against time spent researching and vetting so many code options. Additionally, because many of these packages do not have full‐time staff support, the projects can be abandoned by their development teams, and your code could eventually become obsolete.
In your research, I suggest you use three rules when choosing packages to learn and work with:
1 Use established packages.
2 Use packages that have a large community of support.
3 Use code that is efficient with respect to reduced coding time and increased speed of performance.
Following is a list of the main Python packages that I will cover in this text.
2.2.1 NumPy
NumPy is the fundamental package for scientific computing with Python. It can work with multidimensional arrays, contains many advanced mathematical functions, and is useful for linear algebra, Fourier transforms, and for generating random numbers. NumPy also allows users to encapsulate data efficiently. If you are familiar with MATLAB, you will feel very comfortable using this package.
2.2.2 Pandas
Pandas is a library that permits using data frame (stylized DataFrame) structures and includes a suite of I/O and data manipulation tools. Unlike NumPy, Pandas allows you to reference named columns instead of using indices. With Pandas, you can perform the same kinds of essential tasks that are available in spreadsheet programs (but now automated and with fewer mouse clicks!). For those who are familiar with R programming language, Pandas mimics the R data.frame function.
A limitation of Pandas is that it can only operate with 2D data structures. More recently, the xarray package has been developed to handle higher‐dimensional datasets. In addition, Pandas can be somewhat inefficient because the library is technically a wrapper for NumPy, so it can consume up to three times as much memory, particularly in Jupyter Notebook. For larger row operations (500K rows or greater), the differences can even out. (Goutham, 2017).
2.2.3 Matplotlib
Matplotlib is a plotting library, arguably the most popular one. Matplotlib can generate histograms, power spectra, bar charts, error charts, and scatterplots with a few lines of code. The plots can be completely customized to suit your aesthetics. Due to their similarities, this is another package where MATLAB experience may come in handy.
2.2.4 netCDF4 and h5py
I will discuss two common self‐describing data formats, netCDF and HDF, in Section 3.2.3. Two major packages for importing these formats are the netCDF4 and h5py packages. These tools are advantageous because the user does not have to have any knowledge of how to parse the input files, so long as the files follow standard formatting. These two packages import the data, which can then be converted to NumPy to perform more rigorous data operations.
2.2.5 Cartopy
Cartopy is a package for projecting geospatial data to maps. It can also be used to access a wealth of features, including land/ocean masks and topography. Many projections are available, and you can easily transform the data between them.
Previously, Basemap was the primary package for creating maps. You may come across examples that use it online. However, the package is now deprecated and Cartopy has become the primary package that interfaces with Matplotlib.
Cartopy is a package available from the SciTools organization, which was originally developed by the UK Met Office. It has now expanded into a community collaboration.
2.3 Maturing Packages
The packages detailed in this section are worth mentioning because they may apply to your specific project. Further, some features are too good to ignore, so they are highlighted below. However, if your code requires a long‐term shelf life, it may be best to find alternative solutions, as the following packages may change more rapidly than those listed in Section 2.3.
2.3.1 xarray
xarray is a package that borrows heavily from Pandas to organize multidimensional data. Mathematical operations are lightning fast thanks to dimensional and coordinate indexing. Visualization is also easy. xarray is valuable to Earth scientists because it permits opening multiple netCDF files with ease. Interpolation and group operations is also possible.
The xarray syntax can be challenging to newcomers. It can be difficult to wrangle the data into the format needed. Nevertheless, this tool is worth the time investment due to the many features of interest to Earth science.
2.3.2 Dask
Dask interfaces with Pandas, Scikit‐Learn, and NumPy to perform parallel processing and out‐of‐memory operations that can read data in chunks without ever being totally in the computer’s RAM. This is very useful for working with large datasets. If speed needs to be prioritized, it would be worth learning this package.
2.3.3 Iris
Iris is a format‐agnostic Python library for analyzing and visualizing Earth science data. If datasets follow the standard CF formatting conventions, Iris can easily load the data. The Iris package has a steep learning curve but can be useful for performing meteorological computations. Like Cartopy, Iris is a package available from the SciTools organization.
2.3.4 MetPy
MetPy is a collection of tools in Python for reading, visualizing, and performing calculations with weather data. MetPy enables downloading a curated collection of remote sensing datasets. Unit conversions are easy to perform, which is helpful when making calculations of meteorological variables. MetPy is maintained by Unidata in Boulder, Colorado.
2.3.5 cfgrib and eccodes
Cfgrib is a useful package for reading GRIB1 and GRIB2 data, which is a common format for