-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Merge FAQ and gotcha (rebase of GH13768) #15222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f5e0af0
ab7fdf0
c9e41cc
7185dd4
53a6970
7abb65b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -844,3 +844,136 @@ Of course if you need integer based selection, then use ``iloc`` | |
.. ipython:: python | ||
|
||
dfir.iloc[0:5] | ||
|
||
Miscellaneous indexing FAQ | ||
-------------------------- | ||
|
||
Integer indexing | ||
~~~~~~~~~~~~~~~~ | ||
|
||
Label-based indexing with integer axis labels is a thorny topic. It has been | ||
discussed heavily on mailing lists and among various members of the scientific | ||
Python community. In pandas, our general viewpoint is that labels matter more | ||
than integer locations. Therefore, with an integer axis index *only* | ||
label-based indexing is possible with the standard tools like ``.loc``. The | ||
following code will generate exceptions: | ||
|
||
.. code-block:: python | ||
|
||
s = pd.Series(range(5)) | ||
s[-1] | ||
df = pd.DataFrame(np.random.randn(5, 4)) | ||
df | ||
df.loc[-2:] | ||
|
||
This deliberate decision was made to prevent ambiguities and subtle bugs (many | ||
users reported finding bugs when the API change was made to stop "falling back" | ||
on position-based indexing). | ||
|
||
Non-monotonic indexes require exact matches | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
If the index of a ``Series`` or ``DataFrame`` is monotonically increasing or decreasing, then the bounds | ||
of a label-based slice can be outside the range of the index, much like slice indexing a | ||
normal Python ``list``. Monotonicity of an index can be tested with the ``is_monotonic_increasing`` and | ||
``is_monotonic_decreasing`` attributes. | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame(index=[2,3,3,4,5], columns=['data'], data=range(5)) | ||
df.index.is_monotonic_increasing | ||
|
||
# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4: | ||
df.loc[0:4, :] | ||
|
||
# slice is are outside the index, so empty DataFrame is returned | ||
df.loc[13:15, :] | ||
|
||
On the other hand, if the index is not monotonic, then both slice bounds must be | ||
*unique* members of the index. | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame(index=[2,3,1,4,3,5], columns=['data'], data=range(6)) | ||
df.index.is_monotonic_increasing | ||
|
||
# OK because 2 and 4 are in the index | ||
df.loc[2:4, :] | ||
|
||
.. code-block:: python | ||
|
||
# 0 is not in the index | ||
In [9]: df.loc[0:4, :] | ||
KeyError: 0 | ||
|
||
# 3 is not a unique label | ||
In [11]: df.loc[2:3, :] | ||
KeyError: 'Cannot get right slice bound for non-unique label: 3' | ||
|
||
|
||
Endpoints are inclusive | ||
~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Compared with standard Python sequence slicing in which the slice endpoint is | ||
not inclusive, label-based slicing in pandas **is inclusive**. The primary | ||
reason for this is that it is often not possible to easily determine the | ||
"successor" or next element after a particular label in an index. For example, | ||
consider the following Series: | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series(np.random.randn(6), index=list('abcdef')) | ||
s | ||
|
||
Suppose we wished to slice from ``c`` to ``e``, using integers this would be | ||
|
||
.. ipython:: python | ||
|
||
s[2:5] | ||
|
||
However, if you only had ``c`` and ``e``, determining the next element in the | ||
index can be somewhat complicated. For example, the following does not work: | ||
|
||
:: | ||
|
||
s.loc['c':'e'+1] | ||
|
||
A very common use case is to limit a time series to start and end at two | ||
specific dates. To enable this, we made the design design to make label-based | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bad example -- it's easy to calculate a following timestamp! (To be honest, I am no longer convinced that end inclusive indexing is a good idea. But we are stuck with it for backwards compatibility.) |
||
slicing include both endpoints: | ||
|
||
.. ipython:: python | ||
|
||
s.loc['c':'e'] | ||
|
||
This is most definitely a "practicality beats purity" sort of thing, but it is | ||
something to watch out for if you expect label-based slicing to behave exactly | ||
in the way that standard Python integer slicing works. | ||
|
||
|
||
Indexing potentially changes underlying Series dtype | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The different indexing operation can potentially change the dtype of a ``Series``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's clarify this is a little more: "The different indexing operation" Might be good to cover DataFrame dtype changes with indexing here, too, but I think it's a bit more complicated because of block consolidation. |
||
|
||
.. ipython:: python | ||
|
||
series1 = pd.Series([1, 2, 3]) | ||
series1.dtype | ||
res = series1[[0,4]] | ||
res.dtype | ||
res | ||
|
||
.. ipython:: python | ||
series2 = pd.Series([True]) | ||
series2.dtype | ||
res = series2.reindex_like(series1) | ||
res.dtype | ||
res | ||
|
||
This is because the (re)indexing operations above silently inserts ``NaNs`` and the ``dtype`` | ||
changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs`` | ||
such as ``numpy.logical_and``. | ||
|
||
See the `this old issue <https://github.com/pydata/pandas/issues/2388>`__ for a more | ||
detailed discussion. |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove "with an integer axis index" -- I think this applies to everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Originally this was a part about
ix
, and then this made more sense (the ix was replaced by loc after the deprecation)I didn't really review the content of these parts, only moved them (apart from the comment that were already made on the other PR), but good to do that now as well.