Skip to content

Commit 2619ee3

Browse files
jorisvandenbosschejreback
authored andcommitted
DOC: Merge FAQ and gotcha (rebase of GH13768)
Rebase and clean-up of #13768 closes #9809 Author: Joris Van den Bossche <[email protected]> Author: sinhrks <[email protected]> Closes #15222 from jorisvandenbossche/pr/13768 and squashes the following commits: 7abb65b [Joris Van den Bossche] Make 'indexing may change dtype' more general 53a6970 [Joris Van den Bossche] Move HTML libraries gotchas to html io docs 7185dd4 [Joris Van den Bossche] Keep original gotchas label for references c9e41cc [Joris Van den Bossche] Redo updates after ix deprecation ab7fdf0 [Joris Van den Bossche] restore file name f5e0af0 [sinhrks] DOC: Merge FAQ and gotcha
1 parent 7277459 commit 2619ee3

File tree

7 files changed

+299
-374
lines changed

7 files changed

+299
-374
lines changed

doc/source/advanced.rst

+133
Original file line numberDiff line numberDiff line change
@@ -844,3 +844,136 @@ Of course if you need integer based selection, then use ``iloc``
844844
.. ipython:: python
845845
846846
dfir.iloc[0:5]
847+
848+
Miscellaneous indexing FAQ
849+
--------------------------
850+
851+
Integer indexing
852+
~~~~~~~~~~~~~~~~
853+
854+
Label-based indexing with integer axis labels is a thorny topic. It has been
855+
discussed heavily on mailing lists and among various members of the scientific
856+
Python community. In pandas, our general viewpoint is that labels matter more
857+
than integer locations. Therefore, with an integer axis index *only*
858+
label-based indexing is possible with the standard tools like ``.loc``. The
859+
following code will generate exceptions:
860+
861+
.. code-block:: python
862+
863+
s = pd.Series(range(5))
864+
s[-1]
865+
df = pd.DataFrame(np.random.randn(5, 4))
866+
df
867+
df.loc[-2:]
868+
869+
This deliberate decision was made to prevent ambiguities and subtle bugs (many
870+
users reported finding bugs when the API change was made to stop "falling back"
871+
on position-based indexing).
872+
873+
Non-monotonic indexes require exact matches
874+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
875+
876+
If the index of a ``Series`` or ``DataFrame`` is monotonically increasing or decreasing, then the bounds
877+
of a label-based slice can be outside the range of the index, much like slice indexing a
878+
normal Python ``list``. Monotonicity of an index can be tested with the ``is_monotonic_increasing`` and
879+
``is_monotonic_decreasing`` attributes.
880+
881+
.. ipython:: python
882+
883+
df = pd.DataFrame(index=[2,3,3,4,5], columns=['data'], data=range(5))
884+
df.index.is_monotonic_increasing
885+
886+
# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
887+
df.loc[0:4, :]
888+
889+
# slice is are outside the index, so empty DataFrame is returned
890+
df.loc[13:15, :]
891+
892+
On the other hand, if the index is not monotonic, then both slice bounds must be
893+
*unique* members of the index.
894+
895+
.. ipython:: python
896+
897+
df = pd.DataFrame(index=[2,3,1,4,3,5], columns=['data'], data=range(6))
898+
df.index.is_monotonic_increasing
899+
900+
# OK because 2 and 4 are in the index
901+
df.loc[2:4, :]
902+
903+
.. code-block:: python
904+
905+
# 0 is not in the index
906+
In [9]: df.loc[0:4, :]
907+
KeyError: 0
908+
909+
# 3 is not a unique label
910+
In [11]: df.loc[2:3, :]
911+
KeyError: 'Cannot get right slice bound for non-unique label: 3'
912+
913+
914+
Endpoints are inclusive
915+
~~~~~~~~~~~~~~~~~~~~~~~
916+
917+
Compared with standard Python sequence slicing in which the slice endpoint is
918+
not inclusive, label-based slicing in pandas **is inclusive**. The primary
919+
reason for this is that it is often not possible to easily determine the
920+
"successor" or next element after a particular label in an index. For example,
921+
consider the following Series:
922+
923+
.. ipython:: python
924+
925+
s = pd.Series(np.random.randn(6), index=list('abcdef'))
926+
s
927+
928+
Suppose we wished to slice from ``c`` to ``e``, using integers this would be
929+
930+
.. ipython:: python
931+
932+
s[2:5]
933+
934+
However, if you only had ``c`` and ``e``, determining the next element in the
935+
index can be somewhat complicated. For example, the following does not work:
936+
937+
::
938+
939+
s.loc['c':'e'+1]
940+
941+
A very common use case is to limit a time series to start and end at two
942+
specific dates. To enable this, we made the design design to make label-based
943+
slicing include both endpoints:
944+
945+
.. ipython:: python
946+
947+
s.loc['c':'e']
948+
949+
This is most definitely a "practicality beats purity" sort of thing, but it is
950+
something to watch out for if you expect label-based slicing to behave exactly
951+
in the way that standard Python integer slicing works.
952+
953+
954+
Indexing potentially changes underlying Series dtype
955+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
956+
957+
The different indexing operation can potentially change the dtype of a ``Series``.
958+
959+
.. ipython:: python
960+
961+
series1 = pd.Series([1, 2, 3])
962+
series1.dtype
963+
res = series1[[0,4]]
964+
res.dtype
965+
res
966+
967+
.. ipython:: python
968+
series2 = pd.Series([True])
969+
series2.dtype
970+
res = series2.reindex_like(series1)
971+
res.dtype
972+
res
973+
974+
This is because the (re)indexing operations above silently inserts ``NaNs`` and the ``dtype``
975+
changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs``
976+
such as ``numpy.logical_and``.
977+
978+
See the `this old issue <https://github.com/pydata/pandas/issues/2388>`__ for a more
979+
detailed discussion.

doc/source/basics.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -1889,7 +1889,7 @@ gotchas
18891889

18901890
Performing selection operations on ``integer`` type data can easily upcast the data to ``floating``.
18911891
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (starting in 0.11.0)
1892-
See also :ref:`integer na gotchas <gotchas.intna>`
1892+
See also :ref:`Support for integer ``NA`` <gotchas.intna>`
18931893

18941894
.. ipython:: python
18951895

doc/source/ecosystem.rst

+4-3
Original file line numberDiff line numberDiff line change
@@ -93,12 +93,13 @@ targets the IPython Notebook environment.
9393

9494
`Plotly’s <https://plot.ly/>`__ `Python API <https://plot.ly/python/>`__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and `D3.js <http://d3js.org/>`__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of `matplotlib, ggplot for Python, and Seaborn <https://plot.ly/python/matplotlib-to-plotly-tutorial/>`__ can convert figures into interactive web-based plots. Plots can be drawn in `IPython Notebooks <https://plot.ly/ipython-notebooks/>`__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has `cloud <https://plot.ly/product/plans/>`__, `offline <https://plot.ly/python/offline/>`__, or `on-premise <https://plot.ly/product/enterprise/>`__ accounts for private use.
9595

96-
`Pandas-Qt <https://github.com/datalyze-solutions/pandas-qt>`__
97-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96+
Visualizing Data in Qt applications
97+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9898

99-
Spun off from the main pandas library, the `Pandas-Qt <https://github.com/datalyze-solutions/pandas-qt>`__
99+
Spun off from the main pandas library, the `qtpandas <https://github.com/draperjames/qtpandas>`__
100100
library enables DataFrame visualization and manipulation in PyQt4 and PySide applications.
101101

102+
102103
.. _ecosystem.ide:
103104

104105
IDE

doc/source/faq.rst

-115
This file was deleted.

0 commit comments

Comments
 (0)