Skip to content

Commit d4c4682

Browse files
authored
DOC: refresh "Why xarray" and shorten top-level description (#2657)
* DOC: refresh "Why xarray" and shorten top-level description This documentation revamp builds upon rabernat's rewrite in GH2430. The main change is that the three paragraph description felt too long to me, so I moved the background paragraph on multi-dimensional arrays into the next section, on "Why xarray". I also ended up rewriting most of that page, and made a few adjustments to the FAQ and related projects pages. * 'why xarray' in setup.py, too * Updates per review
1 parent 6795fd0 commit d4c4682

File tree

6 files changed

+139
-134
lines changed

6 files changed

+139
-134
lines changed

README.rst

+25-59
Original file line numberDiff line numberDiff line change
@@ -9,49 +9,47 @@ xarray: N-D labeled arrays and datasets
99
:target: https://coveralls.io/r/pydata/xarray
1010
.. image:: https://readthedocs.org/projects/xray/badge/?version=latest
1111
:target: http://xarray.pydata.org/
12-
.. image:: https://img.shields.io/pypi/v/xarray.svg
13-
:target: https://pypi.python.org/pypi/xarray/
14-
.. image:: https://zenodo.org/badge/13221727.svg
15-
:target: https://zenodo.org/badge/latestdoi/13221727
1612
.. image:: http://img.shields.io/badge/benchmarked%20by-asv-green.svg?style=flat
1713
:target: http://pandas.pydata.org/speed/xarray/
18-
.. image:: https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A
19-
:target: http://numfocus.org
14+
.. image:: https://img.shields.io/pypi/v/xarray.svg
15+
:target: https://pypi.python.org/pypi/xarray/
2016

2117
**xarray** (formerly **xray**) is an open source project and Python package
2218
that makes working with labelled multi-dimensional arrays simple,
2319
efficient, and fun!
2420

25-
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
26-
"tensors") are an essential part of computational science.
27-
They are encountered in a wide range of fields, including physics, astronomy,
28-
geoscience, bioinformatics, engineering, finance, and deep learning.
29-
In Python, NumPy_ provides the fundamental data structure and API for
30-
working with raw ND arrays.
31-
However, real-world datasets are usually more than just raw numbers;
32-
they have labels which encode information about how the array values map
33-
to locations in space, time, etc.
21+
Xarray introduces labels in the form of dimensions, coordinates and
22+
attributes on top of raw NumPy_-like arrays, which allows for a more
23+
intuitive, more concise, and less error-prone developer experience.
24+
The package includes a large and growing library of domain-agnostic functions
25+
for advanced analytics and visualization with these data structures.
3426

35-
By introducing *dimensions*, *coordinates*, and *attributes* on top of raw
36-
NumPy-like arrays, xarray is able to understand these labels and use them to
37-
provide a more intuitive, more concise, and less error-prone experience.
38-
Xarray also provides a large and growing library of functions for advanced
39-
analytics and visualization with these data structures.
4027
Xarray was inspired by and borrows heavily from pandas_, the popular data
4128
analysis package focused on labelled tabular data.
42-
Xarray can read and write data from most common labeled ND-array storage
43-
formats and is particularly tailored to working with netCDF_ files, which were
44-
the source of xarray's data model.
29+
It is particularly tailored to working with netCDF_ files, which were the
30+
source of xarray's data model, and integrates tightly with dask_ for parallel
31+
computing.
4532

46-
.. _NumPy: http://www.numpy.org/
33+
.. _NumPy: http://www.numpy.org
4734
.. _pandas: http://pandas.pydata.org
35+
.. _dask: http://dask.org
4836
.. _netCDF: http://www.unidata.ucar.edu/software/netcdf
4937

5038
Why xarray?
5139
-----------
5240

53-
Adding dimensions names and coordinate indexes to numpy's ndarray_ makes many
54-
powerful array operations possible:
41+
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
42+
"tensors") are an essential part of computational science.
43+
They are encountered in a wide range of fields, including physics, astronomy,
44+
geoscience, bioinformatics, engineering, finance, and deep learning.
45+
In Python, NumPy_ provides the fundamental data structure and API for
46+
working with raw ND arrays.
47+
However, real-world datasets are usually more than just raw numbers;
48+
they have labels which encode information about how the array values map
49+
to locations in space, time, etc.
50+
51+
Xarray doesn't just keep track of labels on arrays -- it uses them to provide a
52+
powerful and concise interface. For example:
5553

5654
- Apply operations over dimensions by name: ``x.sum('time')``.
5755
- Select values by label instead of integer location:
@@ -65,42 +63,10 @@ powerful array operations possible:
6563
- Keep track of arbitrary metadata in the form of a Python dictionary:
6664
``x.attrs``.
6765

68-
pandas_ provides many of these features, but it does not make use of dimension
69-
names, and its core data structures are fixed dimensional arrays.
70-
71-
Why isn't pandas enough?
72-
------------------------
73-
74-
pandas_ excels at working with tabular data. That suffices for many statistical
75-
analyses, but physical scientists rely on N-dimensional arrays -- which is
76-
where xarray comes in.
77-
78-
xarray aims to provide a data analysis toolkit as powerful as pandas_ but
79-
designed for working with homogeneous N-dimensional arrays
80-
instead of tabular data. When possible, we copy the pandas API and rely on
81-
pandas's highly optimized internals (in particular, for fast indexing).
82-
83-
Why netCDF?
84-
-----------
85-
86-
Because xarray implements the same data model as the netCDF_ file format,
87-
xarray datasets have a natural and portable serialization format. But it is also
88-
easy to robustly convert an xarray ``DataArray`` to and from a numpy ``ndarray``
89-
or a pandas ``DataFrame`` or ``Series``, providing compatibility with the full
90-
`PyData ecosystem <http://pydata.org/>`__.
91-
92-
Our target audience is anyone who needs N-dimensional labeled arrays, but we
93-
are particularly focused on the data analysis needs of physical scientists --
94-
especially geoscientists who already know and love netCDF_.
95-
96-
.. _ndarray: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
97-
.. _pandas: http://pandas.pydata.org
98-
.. _netCDF: http://www.unidata.ucar.edu/software/netcdf
99-
10066
Documentation
10167
-------------
10268

103-
The official documentation is hosted on ReadTheDocs at http://xarray.pydata.org/
69+
Learn more about xarray in its official documentation at http://xarray.pydata.org/
10470

10571
Contributing
10672
------------

doc/faq.rst

+8-6
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,9 @@ pandas is a fantastic library for analysis of low-dimensional labelled data -
1818
if it can be sensibly described as "rows and columns", pandas is probably the
1919
right choice. However, sometimes we want to use higher dimensional arrays
2020
(`ndim > 2`), or arrays for which the order of dimensions (e.g., columns vs
21-
rows) shouldn't really matter. For example, climate and weather data is often
22-
natively expressed in 4 or more dimensions: time, x, y and z.
21+
rows) shouldn't really matter. For example, the images of a movie can be
22+
natively represented as an array with four dimensions: time, row, column and
23+
color.
2324

2425
Pandas has historically supported N-dimensional panels, but deprecated them in
2526
version 0.20 in favor of Xarray data structures. There are now built-in methods
@@ -39,9 +40,8 @@ if you were using Panels:
3940
xarray ``Dataset``.
4041

4142
You can :ref:`read about switching from Panels to Xarray here <panel transition>`.
42-
Pandas gets a lot of things right, but scientific users need fully multi-
43-
dimensional data structures.
44-
43+
Pandas gets a lot of things right, but many science, engineering and complex
44+
analytics use cases need fully multi-dimensional data structures.
4545

4646
How do xarray data structures differ from those found in pandas?
4747
----------------------------------------------------------------
@@ -65,7 +65,9 @@ multi-dimensional data-structures.
6565

6666
That said, you should only bother with xarray if some aspect of data is
6767
fundamentally multi-dimensional. If your data is unstructured or
68-
one-dimensional, stick with pandas.
68+
one-dimensional, pandas is usually the right choice: it has better performance
69+
for common operations such as ``groupby`` and you'll find far more usage
70+
examples online.
6971

7072

7173
Why don't aggregations return Python scalars?

doc/index.rst

+11-19
Original file line numberDiff line numberDiff line change
@@ -5,29 +5,21 @@ xarray: N-D labeled arrays and datasets in Python
55
that makes working with labelled multi-dimensional arrays simple,
66
efficient, and fun!
77

8-
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
9-
"tensors") are an essential part of computational science.
10-
They are encountered in a wide range of fields, including physics, astronomy,
11-
geoscience, bioinformatics, engineering, finance, and deep learning.
12-
In Python, NumPy_ provides the fundamental data structure and API for
13-
working with raw ND arrays.
14-
However, real-world datasets are usually more than just raw numbers;
15-
they have labels which encode information about how the array values map
16-
to locations in space, time, etc.
17-
18-
By introducing *dimensions*, *coordinates*, and *attributes* on top of raw
19-
NumPy-like arrays, xarray is able to understand these labels and use them to
20-
provide a more intuitive, more concise, and less error-prone experience.
21-
Xarray also provides a large and growing library of functions for advanced
22-
analytics and visualization with these data structures.
8+
Xarray introduces labels in the form of dimensions, coordinates and
9+
attributes on top of raw NumPy_-like arrays, which allows for a more
10+
intuitive, more concise, and less error-prone developer experience.
11+
The package includes a large and growing library of domain-agnostic functions
12+
for advanced analytics and visualization with these data structures.
13+
2314
Xarray was inspired by and borrows heavily from pandas_, the popular data
2415
analysis package focused on labelled tabular data.
25-
Xarray can read and write data from most common labeled ND-array storage
26-
formats and is particularly tailored to working with netCDF_ files, which were
27-
the source of xarray's data model.
16+
It is particularly tailored to working with netCDF_ files, which were the
17+
source of xarray's data model, and integrates tightly with dask_ for parallel
18+
computing.
2819

29-
.. _NumPy: http://www.numpy.org/
20+
.. _NumPy: http://www.numpy.org
3021
.. _pandas: http://pandas.pydata.org
22+
.. _dask: http://dask.org
3123
.. _netCDF: http://www.unidata.ucar.edu/software/netcdf
3224

3325
Documentation

doc/related-projects.rst

+12-6
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Xarray related projects
44
-----------------------
55

6-
Here below is a list of several existing libraries that build
6+
Here below is a list of existing open source projects that build
77
functionality upon xarray. See also section :ref:`internals` for more
88
details on how to build xarray extensions.
99

@@ -39,11 +39,16 @@ Geosciences
3939

4040
Machine Learning
4141
~~~~~~~~~~~~~~~~
42-
- `cesium <http://cesium-ml.org/>`_: machine learning for time series analysis
42+
- `ArviZ <https://arviz-devs.github.io/arviz/>`_: Exploratory analysis of Bayesian models, built on top of xarray.
4343
- `Elm <https://ensemble-learning-models.readthedocs.io>`_: Parallel machine learning on xarray data structures
4444
- `sklearn-xarray (1) <https://phausamann.github.io/sklearn-xarray>`_: Combines scikit-learn and xarray (1).
4545
- `sklearn-xarray (2) <https://sklearn-xarray.readthedocs.io/en/latest/>`_: Combines scikit-learn and xarray (2).
4646

47+
Other domains
48+
~~~~~~~~~~~~~
49+
- `ptsa <https://pennmem.github.io/ptsa_new/html/index.html>`_: EEG Time Series Analysis
50+
- `pycalphad <https://pycalphad.org/docs/latest/>`_: Computational Thermodynamics in Python
51+
4752
Extend xarray capabilities
4853
~~~~~~~~~~~~~~~~~~~~~~~~~~
4954
- `Collocate <https://github.com/cistools/collocate>`_: Collocate xarray trajectories in arbitrary physical dimensions
@@ -61,9 +66,10 @@ Visualization
6166
- `hvplot <https://hvplot.pyviz.org/>`_ : A high-level plotting API for the PyData ecosystem built on HoloViews.
6267
- `psyplot <https://psyplot.readthedocs.io>`_: Interactive data visualization with python.
6368

64-
Other
65-
~~~~~
66-
- `ptsa <https://pennmem.github.io/ptsa_new/html/index.html>`_: EEG Time Series Analysis
67-
- `pycalphad <https://pycalphad.org/docs/latest/>`_: Computational Thermodynamics in Python
69+
Non-Python projects
70+
~~~~~~~~~~~~~~~~~~~
71+
- `xframe <https://github.com/QuantStack/xframe>`_: C++ data structures inspired by xarray.
72+
- `AxisArrays <https://github.com/JuliaArrays/AxisArrays.jl>`_ and
73+
`NamedArrays <https://github.com/davidavdav/NamedArrays.jl>`_: similar data structures for Julia.
6874

6975
More projects can be found at the `"xarray" Github topic <https://github.com/topics/xarray>`_.

doc/why-xarray.rst

+48-28
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,21 @@
11
Overview: Why xarray?
22
=====================
33

4-
Features
5-
--------
6-
7-
Adding dimensions names and coordinate indexes to numpy's ndarray_ makes many
8-
powerful array operations possible:
4+
What labels enable
5+
------------------
6+
7+
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called
8+
"tensors") are an essential part of computational science.
9+
They are encountered in a wide range of fields, including physics, astronomy,
10+
geoscience, bioinformatics, engineering, finance, and deep learning.
11+
In Python, NumPy_ provides the fundamental data structure and API for
12+
working with raw ND arrays.
13+
However, real-world datasets are usually more than just raw numbers;
14+
they have labels which encode information about how the array values map
15+
to locations in space, time, etc.
16+
17+
Xarray doesn't just keep track of labels on arrays -- it uses them to provide a
18+
powerful and concise interface. For example:
919

1020
- Apply operations over dimensions by name: ``x.sum('time')``.
1121
- Select values by label instead of integer location:
@@ -19,20 +29,22 @@ powerful array operations possible:
1929
- Keep track of arbitrary metadata in the form of a Python dictionary:
2030
``x.attrs``.
2131

22-
pandas_ provides many of these features, but it does not make use of dimension
23-
names, and its core data structures are fixed dimensional arrays.
24-
2532
The N-dimensional nature of xarray's data structures makes it suitable for dealing
2633
with multi-dimensional scientific data, and its use of dimension names
2734
instead of axis labels (``dim='time'`` instead of ``axis=0``) makes such
2835
arrays much more manageable than the raw numpy ndarray: with xarray, you don't
2936
need to keep track of the order of arrays dimensions or insert dummy dimensions
3037
(e.g., ``np.newaxis``) to align arrays.
3138

39+
The immediate payoff of using xarray is that you'll write less code. The
40+
long-term payoff is that you'll understand what you were thinking when you come
41+
back to look at it weeks or months later.
42+
3243
Core data structures
3344
--------------------
3445

35-
xarray has two core data structures. Both are fundamentally N-dimensional:
46+
xarray has two core data structures, which build upon and extend the core
47+
strengths of NumPy_ and pandas_. Both are fundamentally N-dimensional:
3648

3749
- :py:class:`~xarray.DataArray` is our implementation of a labeled, N-dimensional
3850
array. It is an N-D generalization of a :py:class:`pandas.Series`. The name
@@ -43,8 +55,6 @@ xarray has two core data structures. Both are fundamentally N-dimensional:
4355
shared dimensions, and serves a similar purpose in xarray to the
4456
:py:class:`pandas.DataFrame`.
4557

46-
.. _datarray: https://github.com/fperez/datarray
47-
4858
The value of attaching labels to numpy's :py:class:`numpy.ndarray` may be
4959
fairly obvious, but the dataset may need more motivation.
5060

@@ -69,23 +79,33 @@ metadata once, not every time you save a file.
6979
Goals and aspirations
7080
---------------------
7181

72-
pandas_ excels at working with tabular data. That suffices for many statistical
73-
analyses, but physical scientists rely on N-dimensional arrays -- which is
74-
where xarray comes in.
82+
Xarray contributes domain-agnostic data-structures and tools for labeled
83+
multi-dimensional arrays to Python's SciPy_ ecosystem for numerical computing.
84+
In particular, xarray builds upon and integrates with NumPy_ and pandas_:
85+
86+
- Our user-facing interfaces aim to be more explicit verisons of those found in
87+
NumPy/pandas.
88+
- Compatibility with the broader ecosystem is a major goal: it should be easy
89+
to get your data in and out.
90+
- We try to keep a tight focus on functionality and interfaces related to
91+
labeled data, and leverage other Python libraries for everything else, e.g.,
92+
NumPy/pandas for fast arrays/indexing (xarray itself contains no compiled
93+
code), Dask_ for parallel computing, matplotlib_ for plotting, etc.
94+
95+
Xarray is a collaborative and community driven project, run entirely on
96+
volunteer effort (see :ref:`contributing`).
97+
Our target audience is anyone who needs N-dimensional labeled arrays in Python.
98+
Originally, development was driven by the data analysis needs of physical
99+
scientists (especially geoscientists who already know and love
100+
netCDF_), but it has become a much more broadly useful tool, and is still
101+
under active development.
102+
See our technical :ref:`roadmap` for more details, and feel free to reach out
103+
with questions about whether xarray is the right tool for your needs.
75104

76-
xarray aims to provide a data analysis toolkit as powerful as pandas_ but
77-
designed for working with homogeneous N-dimensional arrays
78-
instead of tabular data. When possible, we copy the pandas API and rely on
79-
pandas's highly optimized internals (in particular, for fast indexing).
80-
81-
Importantly, xarray has robust support for converting its objects to and
82-
from a numpy ``ndarray`` or a pandas ``DataFrame`` or ``Series``, providing
83-
compatibility with the full `PyData ecosystem <http://pydata.org/>`__.
84-
85-
Our target audience is anyone who needs N-dimensional labeled arrays, but we
86-
are particularly focused on the data analysis needs of physical scientists --
87-
especially geoscientists who already know and love netCDF_.
88-
89-
.. _ndarray: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
105+
.. _datarray: https://github.com/fperez/datarray
106+
.. _Dask: http://dask.org
107+
.. _matplotlib: http://matplotlib.org
90108
.. _netCDF: http://www.unidata.ucar.edu/software/netcdf
109+
.. _NumPy: http://www.numpy.org
91110
.. _pandas: http://pandas.pydata.org
111+
.. _SciPy: http://www.scipy.org

0 commit comments

Comments
 (0)