Skip to content

Calling pandas.cut with timedelta series and incompatible bins should raise TypeError #20605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nmusolino opened this issue Apr 4, 2018 · 3 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Duplicate Report Duplicate issue or pull request Timedelta Timedelta data type

Comments

@nmusolino
Copy link
Contributor

nmusolino commented Apr 4, 2018

Code Sample

In [1]: import pandas

In [3]: import numpy

In [10]: s = pandas.Series(numpy.timedelta64(i, 's') for i in range(5))

In [11]: s
Out[11]:
0   00:00:00
1   00:00:01
2   00:00:02
3   00:00:03
4   00:00:04
dtype: timedelta64[ns]

In [13]: pandas.cut(s, bins=[0, 2, 5])
Out[13]:
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
dtype: category
Categories (2, object): [(0, 2] < (2, 5]]

In [16]: pandas.cut(s, bins=[0.0, 2.5, 5.0])    # In contrast, the floating-point case raises.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-9dc1028f6406> in <module>()
----> 1 pandas.cut(s, bins=[0.0, 2.5, 5.0])    # In contrast, this raises.

C:\...\lib\site-packages\pandas\tools\tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest)
    117     return _bins_to_cuts(x, bins, right=right, labels=labels,
    118                          retbins=retbins, precision=precision,
--> 119                          include_lowest=include_lowest)
    120
    121

C:\...\lib\site-packages\pandas\tools\tile.py in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
    189
    190     side = 'left' if right else 'right'
--> 191     ids = bins.searchsorted(x, side=side)
    192
    193     if len(algos.unique(bins)) < len(bins):

TypeError: invalid type promotion

Problem description

Calling pandas.cut with a timedelta64 series and integer bins returns an all-NaN series. This is inconsistent with two other results:

  1. Calling the function with float bins raises a TypeError as expected.
  2. Performing arithmetic comparisons with such a series (like s < 0) raises TypeError as expected.

Expected Output

Calling pandas.cut(s, bins=[0, 2, 5]) with the series s described above should raise a TypeError, because the bin edges are not of type that is comparable with the series values.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

@nmusolino
Copy link
Contributor Author

See also issue #19891.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 4, 2018

This has been changed / fixed (potentially just on master). Can you try with a newer version?

On master, both ints and floats correctly raise a ValueError. The bins argument should be an array of timedeltas.

In [18]: pandas.cut(s, bins=[0, 2, 5])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-d075ddd2f614> in <module>()
----> 1 pandas.cut(s, bins=[0, 2, 5])

/Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/core/reshape/tile.pyc in cut(x, bins, right, labels, retbins, precision, include_lowest)
    193     else:
    194         bins = np.asarray(bins)
--> 195         bins = _convert_bin_to_numeric_type(bins, dtype)
    196         if (np.diff(bins) < 0).any():
    197             raise ValueError('bins must increase monotonically.')

/Users/taugspurger/sandbox/pandas-ip-2/pandas/pandas/core/reshape/tile.pyc in _convert_bin_to_numeric_type(bins, dtype)
    387             bins = to_timedelta(bins).view(np.int64)
    388         else:
--> 389             raise ValueError("bins must be of timedelta64 dtype")
    390     elif is_datetime64_dtype(dtype) or is_datetime64tz_dtype(dtype):
    391         if bins_dtype in ['datetime', 'datetime64']:

ValueError: bins must be of timedelta64 dtype
In [20]: pandas.cut(s, bins=pd.to_timedelta([0, 2, 10], unit='s'))
Out[20]:
0                                   NaN
1    (0 days 00:00:00, 0 days 00:00:02]
2    (0 days 00:00:00, 0 days 00:00:02]
3    (0 days 00:00:02, 0 days 00:00:10]
4    (0 days 00:00:02, 0 days 00:00:10]
dtype: category
Categories (2, interval[timedelta64[ns]]): [(0 days 00:00:00, 0 days 00:00:02] < (0 days 00:00:02, 0 days 00:00:10]]

@TomAugspurger TomAugspurger added Timedelta Timedelta data type Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Apr 4, 2018
@TomAugspurger
Copy link
Contributor

Looks like #14737 added support for timedelta cut. Upgrading to 0.20+ should fix it for you.

@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label Apr 4, 2018
@TomAugspurger TomAugspurger added this to the No action milestone Apr 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Duplicate Report Duplicate issue or pull request Timedelta Timedelta data type
Projects
None yet
Development

No branches or pull requests

2 participants