Skip to content

Commit 9991579

Browse files
committed
ENH: Intervalindex
closes #7640 closes #8625 reprise of #8707 Author: Jeff Reback <[email protected]> Author: Stephan Hoyer <[email protected]> Closes #15309 from jreback/intervalindex and squashes the following commits: 11ab1e1 [Jeff Reback] merge conflicts 834df76 [Jeff Reback] more docs fbc1cf8 [Jeff Reback] doc example and bug 7577335 [Jeff Reback] fixup on merge of changes in algorithms.py 3a3e02e [Jeff Reback] sorting example 4333937 [Jeff Reback] api-types test fixing f0e3ad2 [Jeff Reback] pep b2d26eb [Jeff Reback] more docs e5f8082 [Jeff Reback] allow pd.cut to take an IntervalIndex for bins 4a5ebea [Jeff Reback] more tests & fixes for non-unique / overlaps rename _is_contained_in -> contains add sorting test 340c98b [Jeff Reback] CLN/COMPAT: IntervalIndex 74162aa [Stephan Hoyer] API/ENH: IntervalIndex
1 parent 3fde134 commit 9991579

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+4195
-504
lines changed

asv_bench/benchmarks/indexing.py

+20
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,26 @@ def time_is_monotonic(self):
226226
self.miint.is_monotonic
227227

228228

229+
class IntervalIndexing(object):
230+
goal_time = 0.2
231+
232+
def setup(self):
233+
self.monotonic = Series(np.arange(1000000),
234+
index=IntervalIndex.from_breaks(np.arange(1000001)))
235+
236+
def time_getitem_scalar(self):
237+
self.monotonic[80000]
238+
239+
def time_loc_scalar(self):
240+
self.monotonic.loc[80000]
241+
242+
def time_getitem_list(self):
243+
self.monotonic[80000:]
244+
245+
def time_loc_list(self):
246+
self.monotonic.loc[80000:]
247+
248+
229249
class PanelIndexing(object):
230250
goal_time = 0.2
231251

doc/source/advanced.rst

+33
Original file line numberDiff line numberDiff line change
@@ -850,6 +850,39 @@ Of course if you need integer based selection, then use ``iloc``
850850
851851
dfir.iloc[0:5]
852852
853+
.. _indexing.intervallindex:
854+
855+
IntervalIndex
856+
~~~~~~~~~~~~~
857+
858+
.. versionadded:: 0.20.0
859+
860+
.. warning::
861+
862+
These indexing behaviors are provisional and may change in a future version of pandas.
863+
864+
.. ipython:: python
865+
866+
df = pd.DataFrame({'A': [1, 2, 3, 4]},
867+
index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
868+
df
869+
870+
Label based indexing via ``.loc`` along the edges of an interval works as you would expect,
871+
selecting that particular interval.
872+
873+
.. ipython:: python
874+
875+
df.loc[2]
876+
df.loc[[2, 3]]
877+
878+
If you select a lable *contained* within an interval, this will also select the interval.
879+
880+
.. ipython:: python
881+
882+
df.loc[2.5]
883+
df.loc[[2.5, 3.5]]
884+
885+
853886
Miscellaneous indexing FAQ
854887
--------------------------
855888

doc/source/api.rst

+21
Original file line numberDiff line numberDiff line change
@@ -1405,6 +1405,27 @@ Categorical Components
14051405
CategoricalIndex.as_ordered
14061406
CategoricalIndex.as_unordered
14071407

1408+
.. _api.intervalindex:
1409+
1410+
IntervalIndex
1411+
-------------
1412+
1413+
.. autosummary::
1414+
:toctree: generated/
1415+
1416+
IntervalIndex
1417+
1418+
IntervalIndex Components
1419+
~~~~~~~~~~~~~~~~~~~~~~~~
1420+
1421+
.. autosummary::
1422+
:toctree: generated/
1423+
1424+
IntervalIndex.from_arrays
1425+
IntervalIndex.from_tuples
1426+
IntervalIndex.from_breaks
1427+
IntervalIndex.from_intervals
1428+
14081429
.. _api.multiindex:
14091430

14101431
MultiIndex

doc/source/reshaping.rst

+9-1
Original file line numberDiff line numberDiff line change
@@ -517,7 +517,15 @@ Alternatively we can specify custom bin-edges:
517517

518518
.. ipython:: python
519519
520-
pd.cut(ages, bins=[0, 18, 35, 70])
520+
c = pd.cut(ages, bins=[0, 18, 35, 70])
521+
c
522+
523+
.. versionadded:: 0.20.0
524+
525+
If the ``bins`` keyword is an ``IntervalIndex``, then these will be
526+
used to bin the passed data.
527+
528+
pd.cut([25, 20, 50], bins=c.categories)
521529

522530

523531
.. _reshaping.dummies:

doc/source/whatsnew/v0.20.0.txt

+58
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Highlights include:
1313
- ``Panel`` has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_panel>`
1414
- Improved user API when accessing levels in ``.groupby()``, see :ref:`here <whatsnew_0200.enhancements.groupby_access>`
1515
- Improved support for UInt64 dtypes, see :ref:`here <whatsnew_0200.enhancements.uint64_support>`
16+
- Addition of an ``IntervalIndex`` and ``Interval`` scalar type, see :ref:`here <whatsnew_0200.enhancements.intervalindex>`
1617
- A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref:`here <whatsnew_0200.enhancements.table_schema>`
1718
- Window Binary Corr/Cov operations return a MultiIndexed ``DataFrame`` rather than a ``Panel``, as ``Panel`` is now deprecated, see :ref:`here <whatsnew_0200.api_breaking.rolling_pairwise>`
1819
- Support for S3 handling now uses ``s3fs``, see :ref:`here <whatsnew_0200.api_breaking.s3>`
@@ -314,6 +315,63 @@ To convert a ``SparseDataFrame`` back to sparse SciPy matrix in COO format, you
314315

315316
sdf.to_coo()
316317

318+
.. _whatsnew_0200.enhancements.intervalindex:
319+
320+
IntervalIndex
321+
^^^^^^^^^^^^^
322+
323+
pandas has gained an ``IntervalIndex`` with its own dtype, ``interval`` as well as the ``Interval`` scalar type. These allow first-class support for interval
324+
notation, specifically as a return type for the categories in ``pd.cut`` and ``pd.qcut``. The ``IntervalIndex`` allows some unique indexing, see the
325+
:ref:`docs <indexing.intervallindex>`. (:issue:`7640`, :issue:`8625`)
326+
327+
Previous behavior:
328+
329+
.. code-block:: ipython
330+
331+
In [2]: pd.cut(range(3), 2)
332+
Out[2]:
333+
[(-0.002, 1], (-0.002, 1], (1, 2]]
334+
Categories (2, object): [(-0.002, 1] < (1, 2]]
335+
336+
# the returned categories are strings, representing Intervals
337+
In [3]: pd.cut(range(3), 2).categories
338+
Out[3]: Index(['(-0.002, 1]', '(1, 2]'], dtype='object')
339+
340+
New behavior:
341+
342+
.. ipython:: python
343+
344+
c = pd.cut(range(4), bins=2)
345+
c
346+
c.categories
347+
348+
Furthermore, this allows one to bin *other* data with these same bins. ``NaN`` represents a missing
349+
value similar to other dtypes.
350+
351+
.. ipython:: python
352+
353+
pd.cut([0, 3, 1, 1], bins=c.categories)
354+
355+
These can also used in ``Series`` and ``DataFrame``, and indexed.
356+
357+
.. ipython:: python
358+
359+
df = pd.DataFrame({'A': range(4),
360+
'B': pd.cut([0, 3, 1, 1], bins=c.categories)}
361+
).set_index('B')
362+
363+
Selecting a specific interval
364+
365+
.. ipython:: python
366+
367+
df.loc[pd.Interval(1.5, 3.0)]
368+
369+
Selecting via a scalar value that is contained in the intervals.
370+
371+
.. ipython:: python
372+
373+
df.loc[0]
374+
317375
.. _whatsnew_0200.enhancements.other:
318376

319377
Other Enhancements

pandas/_libs/hashtable.pyx

-1
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,6 @@ cdef extern from "Python.h":
4141

4242
cdef size_t _INIT_VEC_CAP = 128
4343

44-
4544
include "hashtable_class_helper.pxi"
4645
include "hashtable_func_helper.pxi"
4746

0 commit comments

Comments
 (0)