Skip to content

DOC: Merge FAQ and gotcha (rebase of GH13768) #15222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -844,3 +844,136 @@ Of course if you need integer based selection, then use ``iloc``
.. ipython:: python

dfir.iloc[0:5]

Miscellaneous indexing FAQ
--------------------------

Integer indexing
~~~~~~~~~~~~~~~~

Label-based indexing with integer axis labels is a thorny topic. It has been
discussed heavily on mailing lists and among various members of the scientific
Python community. In pandas, our general viewpoint is that labels matter more
than integer locations. Therefore, with an integer axis index *only*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove "with an integer axis index" -- I think this applies to everything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Originally this was a part about ix, and then this made more sense (the ix was replaced by loc after the deprecation)

I didn't really review the content of these parts, only moved them (apart from the comment that were already made on the other PR), but good to do that now as well.

label-based indexing is possible with the standard tools like ``.loc``. The
following code will generate exceptions:

.. code-block:: python

s = pd.Series(range(5))
s[-1]
df = pd.DataFrame(np.random.randn(5, 4))
df
df.loc[-2:]

This deliberate decision was made to prevent ambiguities and subtle bugs (many
users reported finding bugs when the API change was made to stop "falling back"
on position-based indexing).

Non-monotonic indexes require exact matches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the index of a ``Series`` or ``DataFrame`` is monotonically increasing or decreasing, then the bounds
of a label-based slice can be outside the range of the index, much like slice indexing a
normal Python ``list``. Monotonicity of an index can be tested with the ``is_monotonic_increasing`` and
``is_monotonic_decreasing`` attributes.

.. ipython:: python

df = pd.DataFrame(index=[2,3,3,4,5], columns=['data'], data=range(5))
df.index.is_monotonic_increasing

# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
df.loc[0:4, :]

# slice is are outside the index, so empty DataFrame is returned
df.loc[13:15, :]

On the other hand, if the index is not monotonic, then both slice bounds must be
*unique* members of the index.

.. ipython:: python

df = pd.DataFrame(index=[2,3,1,4,3,5], columns=['data'], data=range(6))
df.index.is_monotonic_increasing

# OK because 2 and 4 are in the index
df.loc[2:4, :]

.. code-block:: python

# 0 is not in the index
In [9]: df.loc[0:4, :]
KeyError: 0

# 3 is not a unique label
In [11]: df.loc[2:3, :]
KeyError: 'Cannot get right slice bound for non-unique label: 3'


Endpoints are inclusive
~~~~~~~~~~~~~~~~~~~~~~~

Compared with standard Python sequence slicing in which the slice endpoint is
not inclusive, label-based slicing in pandas **is inclusive**. The primary
reason for this is that it is often not possible to easily determine the
"successor" or next element after a particular label in an index. For example,
consider the following Series:

.. ipython:: python

s = pd.Series(np.random.randn(6), index=list('abcdef'))
s

Suppose we wished to slice from ``c`` to ``e``, using integers this would be

.. ipython:: python

s[2:5]

However, if you only had ``c`` and ``e``, determining the next element in the
index can be somewhat complicated. For example, the following does not work:

::

s.loc['c':'e'+1]

A very common use case is to limit a time series to start and end at two
specific dates. To enable this, we made the design design to make label-based
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad example -- it's easy to calculate a following timestamp!

(To be honest, I am no longer convinced that end inclusive indexing is a good idea. But we are stuck with it for backwards compatibility.)

slicing include both endpoints:

.. ipython:: python

s.loc['c':'e']

This is most definitely a "practicality beats purity" sort of thing, but it is
something to watch out for if you expect label-based slicing to behave exactly
in the way that standard Python integer slicing works.


Indexing potentially changes underlying Series dtype
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The different indexing operation can potentially change the dtype of a ``Series``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify this is a little more:

"The different indexing operation"
->
"Indexing with Series with a list of index locations that do not exist"

Might be good to cover DataFrame dtype changes with indexing here, too, but I think it's a bit more complicated because of block consolidation.


.. ipython:: python

series1 = pd.Series([1, 2, 3])
series1.dtype
res = series1[[0,4]]
res.dtype
res

.. ipython:: python
series2 = pd.Series([True])
series2.dtype
res = series2.reindex_like(series1)
res.dtype
res

This is because the (re)indexing operations above silently inserts ``NaNs`` and the ``dtype``
changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs``
such as ``numpy.logical_and``.

See the `this old issue <https://github.com/pydata/pandas/issues/2388>`__ for a more
detailed discussion.
2 changes: 1 addition & 1 deletion doc/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1889,7 +1889,7 @@ gotchas

Performing selection operations on ``integer`` type data can easily upcast the data to ``floating``.
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (starting in 0.11.0)
See also :ref:`integer na gotchas <gotchas.intna>`
See also :ref:`Support for integer ``NA`` <gotchas.intna>`

.. ipython:: python

Expand Down
7 changes: 4 additions & 3 deletions doc/source/ecosystem.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,12 +93,13 @@ targets the IPython Notebook environment.

`Plotly’s <https://plot.ly/>`__ `Python API <https://plot.ly/python/>`__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and `D3.js <http://d3js.org/>`__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of `matplotlib, ggplot for Python, and Seaborn <https://plot.ly/python/matplotlib-to-plotly-tutorial/>`__ can convert figures into interactive web-based plots. Plots can be drawn in `IPython Notebooks <https://plot.ly/ipython-notebooks/>`__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has `cloud <https://plot.ly/product/plans/>`__, `offline <https://plot.ly/python/offline/>`__, or `on-premise <https://plot.ly/product/enterprise/>`__ accounts for private use.

`Pandas-Qt <https://github.com/datalyze-solutions/pandas-qt>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Visualizing Data in Qt applications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Spun off from the main pandas library, the `Pandas-Qt <https://github.com/datalyze-solutions/pandas-qt>`__
Spun off from the main pandas library, the `qtpandas <https://github.com/draperjames/qtpandas>`__
library enables DataFrame visualization and manipulation in PyQt4 and PySide applications.


.. _ecosystem.ide:

IDE
Expand Down
115 changes: 0 additions & 115 deletions doc/source/faq.rst

This file was deleted.

Loading