Skip to content

REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby #36003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task
Khris777 opened this issue Aug 31, 2020 · 6 comments · Fixed by #38094
Labels
Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@Khris777
Copy link

  • [X ] I have checked that this issue has not already been reported.

  • [ X] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Two different dates, one within the range of what pd.Timestamp can handle, the other outside of that range:

import pandas as pd
import datetime
df = pd.DataFrame({'A': ['X', 'Y'], 'B': [datetime.datetime(2005, 1, 1, 10, 30, 23, 540000),
                                          datetime.datetime(3005, 1, 1, 10, 30, 23, 540000)]})
print(df.groupby('A').B.max())

Problem description

pd.Timestamp can't deal with a too big date like the year 3005, so to represent such a date I need to use the datetime.datetime type. Before 1.1.1 (1.1.0?) this hasn't been an issue, but now this code throws an assertion error:

Traceback (most recent call last):

  File "<ipython-input-38-8b8ec5e4e179>", line 5, in <module>
    print(df.groupby('A').B.max())

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1558, in max
    numeric_only=numeric_only, min_count=min_count, alias="max", npfunc=np.max

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1015, in _agg_general
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\generic.py", line 261, in aggregate
    func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1083, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\ops.py", line 644, in agg_series
    return self._aggregate_series_fast(obj, func)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\ops.py", line 669, in _aggregate_series_fast
    result, counts = grouper.get_result()

  File "pandas\_libs\reduction.pyx", line 256, in pandas._libs.reduction.SeriesGrouper.get_result

  File "pandas\_libs\reduction.pyx", line 74, in pandas._libs.reduction._BaseGrouper._apply_to_group

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1060, in <lambda>
    f = lambda x: func(x, *args, **kwargs)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1015, in <lambda>
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))

  File "<__array_function__ internals>", line 6, in amax

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\numpy\core\fromnumeric.py", line 2706, in amax
    keepdims=keepdims, initial=initial, where=where)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\numpy\core\fromnumeric.py", line 85, in _wrapreduction
    return reduction(axis=axis, out=out, **passkwargs)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\generic.py", line 11460, in stat_func
    func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\series.py", line 4220, in _reduce
    delegate = self._values

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\series.py", line 572, in _values
    return self._mgr.internal_values()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\internals\managers.py", line 1615, in internal_values
    return self._block.internal_values()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\internals\blocks.py", line 2019, in internal_values
    return self.array_values()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\internals\blocks.py", line 2022, in array_values
    return self._holder._simple_new(self.values)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\arrays\datetimes.py", line 290, in _simple_new
    assert values.dtype == "i8"

AssertionError

From testing with mixing pd.Timestamp and datetime.datetime types I presume pandas is converting applicable dates (first line in the example) to pd.Timestamp while leaving the others as datetime.datetime leading to a mixed-type result column and the assertion error.

Expected Output

Since I'm explicitely operating with datatype datetime.datetime there should be no implicit conversion to pd.Timestamp if it's not assured that all values are within the range that pd.Timestamp allows.

Output of pd.show_versions()

commit : f2ca0a2
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 50.0.0.post20200830
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.3
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : 0.8.0
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.51.1

@Khris777 Khris777 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2020
@dsaxton dsaxton added Regression Functionality that used to work in a prior pandas version Datetime Datetime data dtype and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2020
@dsaxton dsaxton changed the title BUG: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby Aug 31, 2020
@dsaxton
Copy link
Member

dsaxton commented Aug 31, 2020

Thanks @Khris777. This seems to be a regression from 1.0.5:

In [1]: import pandas as pd
   ...: import datetime
   ...:
   ...: df = pd.DataFrame({'A': ['X', 'Y'], 'B': [datetime.datetime(2
   ...: 005, 1, 1, 10, 30, 23, 540000),
   ...:                                           datetime.datetime(3
   ...: 005, 1, 1, 10, 30, 23, 540000)]})
   ...: df.groupby("A")["B"].max()
   ...:
Out[1]:
A
X    2005-01-01 10:30:23.540000
Y    3005-01-01 10:30:23.540000
Name: B, dtype: object

In [2]: pd.__version__
Out[2]: '1.0.5'

Actually this was raising a TypeError after 4edcc55 but prior to the AssertionError.

cc @jbrockmendel

@jorisvandenbossche jorisvandenbossche added this to the 1.1.2 milestone Sep 1, 2020
@simonjayhawkins simonjayhawkins modified the milestones: 1.1.2, 1.1.3 Sep 7, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline

@simonjayhawkins
Copy link
Member

Actually this was raising a TypeError after 4edcc55 but prior to the AssertionError.

#31182

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Sep 26, 2020
@simonjayhawkins
Copy link
Member

Actually this was raising a TypeError after 4edcc55 but prior to the AssertionError.

can confirm

first bad commit: [4edcc55] CLN: Make Series._values match Index._values (#31182)

https://github.com/simonjayhawkins/pandas/runs/1170442784?check_suite_focus=true

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.3, 1.1.4 Oct 5, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.4, 1.1.5 Oct 29, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline

@jreback jreback modified the milestones: 1.1.5, Contributions Welcome Nov 25, 2020
@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, 1.1.5 Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants