REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby #36003

Khris777 · 2020-08-31T07:03:11Z

[X ] I have checked that this issue has not already been reported.
[ X] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

Two different dates, one within the range of what pd.Timestamp can handle, the other outside of that range:

import pandas as pd
import datetime
df = pd.DataFrame({'A': ['X', 'Y'], 'B': [datetime.datetime(2005, 1, 1, 10, 30, 23, 540000),
                                          datetime.datetime(3005, 1, 1, 10, 30, 23, 540000)]})
print(df.groupby('A').B.max())

Problem description

pd.Timestamp can't deal with a too big date like the year 3005, so to represent such a date I need to use the datetime.datetime type. Before 1.1.1 (1.1.0?) this hasn't been an issue, but now this code throws an assertion error:

Traceback (most recent call last):

  File "<ipython-input-38-8b8ec5e4e179>", line 5, in <module>
    print(df.groupby('A').B.max())

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1558, in max
    numeric_only=numeric_only, min_count=min_count, alias="max", npfunc=np.max

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1015, in _agg_general
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\generic.py", line 261, in aggregate
    func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1083, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\ops.py", line 644, in agg_series
    return self._aggregate_series_fast(obj, func)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\ops.py", line 669, in _aggregate_series_fast
    result, counts = grouper.get_result()

  File "pandas\_libs\reduction.pyx", line 256, in pandas._libs.reduction.SeriesGrouper.get_result

  File "pandas\_libs\reduction.pyx", line 74, in pandas._libs.reduction._BaseGrouper._apply_to_group

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1060, in <lambda>
    f = lambda x: func(x, *args, **kwargs)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1015, in <lambda>
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))

  File "<__array_function__ internals>", line 6, in amax

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\numpy\core\fromnumeric.py", line 2706, in amax
    keepdims=keepdims, initial=initial, where=where)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\numpy\core\fromnumeric.py", line 85, in _wrapreduction
    return reduction(axis=axis, out=out, **passkwargs)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\generic.py", line 11460, in stat_func
    func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\series.py", line 4220, in _reduce
    delegate = self._values

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\series.py", line 572, in _values
    return self._mgr.internal_values()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\internals\managers.py", line 1615, in internal_values
    return self._block.internal_values()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\internals\blocks.py", line 2019, in internal_values
    return self.array_values()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\internals\blocks.py", line 2022, in array_values
    return self._holder._simple_new(self.values)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\arrays\datetimes.py", line 290, in _simple_new
    assert values.dtype == "i8"

AssertionError

From testing with mixing pd.Timestamp and datetime.datetime types I presume pandas is converting applicable dates (first line in the example) to pd.Timestamp while leaving the others as datetime.datetime leading to a mixed-type result column and the assertion error.

Expected Output

Since I'm explicitely operating with datatype datetime.datetime there should be no implicit conversion to pd.Timestamp if it's not assured that all values are within the range that pd.Timestamp allows.

Output of `pd.show_versions()`

commit : f2ca0a2
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 50.0.0.post20200830
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.3
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : 0.8.0
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.51.1

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-08-31T18:55:01Z

Thanks @Khris777. This seems to be a regression from 1.0.5:

In [1]: import pandas as pd
   ...: import datetime
   ...:
   ...: df = pd.DataFrame({'A': ['X', 'Y'], 'B': [datetime.datetime(2
   ...: 005, 1, 1, 10, 30, 23, 540000),
   ...:                                           datetime.datetime(3
   ...: 005, 1, 1, 10, 30, 23, 540000)]})
   ...: df.groupby("A")["B"].max()
   ...:
Out[1]:
A
X    2005-01-01 10:30:23.540000
Y    3005-01-01 10:30:23.540000
Name: B, dtype: object

In [2]: pd.__version__
Out[2]: '1.0.5'

Actually this was raising a TypeError after 4edcc55 but prior to the AssertionError.

cc @jbrockmendel

simonjayhawkins · 2020-09-07T10:01:22Z

moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline

simonjayhawkins · 2020-09-07T15:25:18Z

Actually this was raising a TypeError after 4edcc55 but prior to the AssertionError.

#31182

simonjayhawkins · 2020-09-26T17:27:59Z

Actually this was raising a TypeError after 4edcc55 but prior to the AssertionError.

can confirm

first bad commit: [4edcc55] CLN: Make Series._values match Index._values (#31182)

https://github.com/simonjayhawkins/pandas/runs/1170442784?check_suite_focus=true

simonjayhawkins · 2020-10-05T13:00:46Z

moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline

simonjayhawkins · 2020-10-29T15:16:32Z

moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline

Khris777 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2020

dsaxton added Regression Functionality that used to work in a prior pandas version Datetime Datetime data dtype and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2020

dsaxton changed the title ~~BUG: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby~~ REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby Aug 31, 2020

jorisvandenbossche added this to the 1.1.2 milestone Sep 1, 2020

simonjayhawkins modified the milestones: 1.1.2, 1.1.3 Sep 7, 2020

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Sep 26, 2020

add code sample for pandas-dev#36003

b9f0344

simonjayhawkins modified the milestones: 1.1.3, 1.1.4 Oct 5, 2020

simonjayhawkins modified the milestones: 1.1.4, 1.1.5 Oct 29, 2020

jreback modified the milestones: 1.1.5, Contributions Welcome Nov 25, 2020

jorisvandenbossche modified the milestones: Contributions Welcome, 1.1.5 Nov 26, 2020

jorisvandenbossche mentioned this issue Nov 26, 2020

REGR: fix regression in groupby aggregation with out-of-bounds datetimes #38094

Merged

jorisvandenbossche closed this as completed in #38094 Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby #36003

REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby #36003

Khris777 commented Aug 31, 2020

dsaxton commented Aug 31, 2020

simonjayhawkins commented Sep 7, 2020

simonjayhawkins commented Sep 7, 2020

simonjayhawkins commented Sep 26, 2020

simonjayhawkins commented Oct 5, 2020

simonjayhawkins commented Oct 29, 2020

REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby #36003

REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby #36003

Comments

Khris777 commented Aug 31, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

dsaxton commented Aug 31, 2020

simonjayhawkins commented Sep 7, 2020

simonjayhawkins commented Sep 7, 2020

simonjayhawkins commented Sep 26, 2020

simonjayhawkins commented Oct 5, 2020

simonjayhawkins commented Oct 29, 2020

Output of `pd.show_versions()`