Skip to content

dt-accessor fails to extract 'days' from timedelta-series #26285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stefansimik opened this issue May 5, 2019 · 8 comments
Closed

dt-accessor fails to extract 'days' from timedelta-series #26285

stefansimik opened this issue May 5, 2019 · 8 comments
Labels
Datetime Datetime data dtype

Comments

@stefansimik
Copy link

stefansimik commented May 5, 2019

Copy-pastable example

import pandas as pd
import datetime as dt

# Create data
df = pd.DataFrame({'start_date': [dt.date(2019, 3, 31), dt.date(2019, 3, 31)],
                   'end_date': [dt.date(2019, 3, 31), dt.date(3014, 1, 1)] })

# Calculate duration (timedelta objects)
df['duration'] = df['end_date'] - df['start_date']

# EXAMPLE 1: PROBLEM HERE - THIS FAILS !!!
df['duration_days'] = df['duration'].dt.days

# EXAMPLE 2: This works as a workaround
# df['duration_days'] = df['duration'].apply(lambda x: x.days)

Problem description

dt-accessor cannot extract days value from timedelta-series
and throws exception:

AttributeErrorTraceback (most recent call last)
<ipython-input-5-ef0f32d83e8e> in <module>
----> 1 df['duration_days'] = df['duration'].dt.days

D:\data_science\bin\anaconda3\envs\nbo2\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5061         if (name in self._internal_names_set or name in self._metadata or
   5062                 name in self._accessors):
-> 5063             return object.__getattribute__(self, name)
   5064         else:
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):

D:\data_science\bin\anaconda3\envs\nbo2\lib\site-packages\pandas\core\accessor.py in __get__(self, obj, cls)
    169             # we're accessing the attribute of the class, i.e., Dataset.geo
    170             return self._accessor
--> 171         accessor_obj = self._accessor(obj)
    172         # Replace the property with the accessor object. Inspired by:
    173         # http://www.pydanny.com/cached-property.html

D:\data_science\bin\anaconda3\envs\nbo2\lib\site-packages\pandas\core\indexes\accessors.py in __new__(cls, data)
    322             pass  # we raise an attribute error anyway
    323 
--> 324         raise AttributeError("Can only use .dt accessor with datetimelike "
    325                              "values")

AttributeError: Can only use .dt accessor with datetimelike values

Expected Output

dt-accessor should extract days value from timedelta-series
without any exceptions.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 40.8.0
Cython: None
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented May 5, 2019

This may have something to do with Timestamp limitations:

http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations

Though that may not be the sole driver.

@mroeschke and @jbrockmendel

@WillAyd WillAyd added the Datetime Datetime data dtype label May 5, 2019
@mroeschke
Copy link
Member

The issue is that you are using datetime.dates which are not a first class data type in pandas and considered object dtype. Although the arithmetic yields datetime.timedelta objects, the operation came from two object dtypes.

In [29]: df.dtypes
Out[29]:
start_date    object
end_date      object
duration      object
dtype: object

In [30]: df['duration'][0]
Out[30]: datetime.timedelta(0)

I can see an argument where pandas coerce in this case but I believe we don't do this for other object data types (e.g. subtracting 2 ints with object dtype).

We recommend sticking with datetime.datetime objects in general. Support for this would take a huge contribution from the community to support datetime.date

@stefansimik
Copy link
Author

stefansimik commented May 6, 2019

Thank you @mroeschke

I have reworked example from date -> datetime.

import pandas as pd
import datetime as dt

# Create data
df = pd.DataFrame({'start_date': [dt.datetime(2019, 3, 31), dt.datetime(2019, 3, 31)],
                   'end_date': [dt.datetime(2019, 3, 31), dt.datetime(3014, 1, 1)] })

# Calculate duration (timedelta objects)
df['duration'] = df['end_date'] - df['start_date']

# EXAMPLE 1: PROBLEM HERE - THIS FAILS !!!
df['duration_days'] = df['duration'].dt.days

The code still fails with error:

TypeError: ufunc subtract cannot use operands with types dtype('O') and dtype('<M8[ns]')

Full error in details below:

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~\Anaconda3\lib\site-packages\pandas\core\ops.py in na_op(x, y) 1504 try: -> 1505 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs) 1506 except TypeError:

~\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
207 if use_numexpr:
--> 208 return _evaluate(op, op_str, a, b, **eval_kwargs)
209 return _evaluate_standard(op, op_str, a, b)

~\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
122 if result is None:
--> 123 result = _evaluate_standard(op, op_str, a, b)
124

~\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
67 with np.errstate(all='ignore'):
---> 68 return op(a, b)
69

TypeError: ufunc subtract cannot use operands with types dtype('O') and dtype('<M8[ns]')

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\ops.py in safe_na_op(lvalues, rvalues)
1528 with np.errstate(all='ignore'):
-> 1529 return na_op(lvalues, rvalues)
1530 except Exception:

~\Anaconda3\lib\site-packages\pandas\core\ops.py in na_op(x, y)
1506 except TypeError:
-> 1507 result = masked_arith_op(x, y, op)
1508

~\Anaconda3\lib\site-packages\pandas\core\ops.py in masked_arith_op(x, y, op)
1008 result[mask] = op(xrav[mask],
-> 1009 com.values_from_object(yrav[mask]))
1010

TypeError: ufunc subtract cannot use operands with types dtype('O') and dtype('<M8[ns]')

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
in
7
8 # Calculate duration (timedelta objects)
----> 9 df['duration'] = df['end_date'] - df['start_date']
10
11 # EXAMPLE 1: PROBLEM HERE - THIS FAILS !!!

~\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(left, right)
1581 rvalues = rvalues.values
1582
-> 1583 result = safe_na_op(lvalues, rvalues)
1584 return construct_result(left, result,
1585 index=left.index, name=res_name, dtype=None)

~\Anaconda3\lib\site-packages\pandas\core\ops.py in safe_na_op(lvalues, rvalues)
1531 if is_object_dtype(lvalues):
1532 return libalgos.arrmap_object(lvalues,
-> 1533 lambda x: op(x, rvalues))
1534 raise
1535

pandas/_libs/algos.pyx in pandas._libs.algos.arrmap()

~\Anaconda3\lib\site-packages\pandas\core\ops.py in (x)
1531 if is_object_dtype(lvalues):
1532 return libalgos.arrmap_object(lvalues,
-> 1533 lambda x: op(x, rvalues))
1534 raise
1535

TypeError: ufunc subtract cannot use operands with types dtype('O') and dtype('<M8[ns]')

If I replace the year 3014 for value 2014, everything works fine.
And it works fine not only with dt.datetime, but it also works when using original dt.date.

Because of this, I am suspicious, that using dt.date instead of dt.datetime is maybe not the only problem, but it can relate to the high value 3014 used in date/datetime.
It is strange, that type of 2nd column in dataframe is:

  • object, when year 3014 is used,
  • datetime64[ns] when value 2014 is used instead of 3014

@mroeschke
Copy link
Member

The object vs datetime64 difference you are seeing with the year has to do with the timestamp limitation @WillAyd mentioned.

In [1]: pd.Timestamp.max
Out[1]: Timestamp('2262-04-11 23:47:16.854775807')

@stefansimik
Copy link
Author

stefansimik commented May 6, 2019

Thank you @mroeschke

Despite I can imagine, that there is some hidden impact of Timestamp limitation under the hood,
it is not clear to me - how or where the timestamp limitation is involved in the calculation.

Let's test the simplest code possible:

import pandas as pd
import datetime as dt

series = pd.Series([dt.datetime(2019, 1, 1), dt.datetime(2262, 1, 1)])
series

OK - This results in datetime-series, as expected. (we used max year 2262)

import pandas as pd
import datetime as dt

series = pd.Series([dt.datetime(2019, 1, 1), dt.datetime(2263, 1, 1)])
series

NOT OK - This results in object-series, despite all elements are valid datetimes. (we used max year 2263 - over limit of Timestamp)

Result:
The problem is definitely related to the Timestamp.max value.
Knowing this, we have some hint about the reasons behind this problem, but in my opinion it is very unexpected and buggy behavior of pandas, that needs to be fixed.

Suggestion:
Datetime-series should work well with valid datetimes. It is not acceptable, that pd.Series changes its type from datetime to object, if all nested datetime-objects are fully valid and correct.

This bug practically excludes pandas from working on real-world projects, where fully valid datetime objects (above year 2262) are used and forcing us to do hacky workarounds (like in initial Example 2).
This crazy, unexpected and impractical limitation should be fixed.

Pandas should work normally and predictably with ANY valid datetime.
I would suggest - Let's fix it :-)

What do you think about it?

@jbrockmendel
Copy link
Member

It is not acceptable, that pd.Series changes its type from datetime to object, if all nested datetime-objects are fully valid and correct.

Note that the dtype is not datetime but datetime64[ns]. Timestamp.max isn't there just to make things difficult, it is the largest nanosecond timestamp that can be represented using an int64.

Pandas should work normally and predictably with ANY valid datetime.

Similar to @mroeschke 's comment about supporting datetime.date directly, it will take a huge effort/redesign to do what you're suggesting (and it has come up many times). If you're willing+able to make it happen, I think the feature would be welcome.

If you don't need nanosecond precision, you can use Period[s] dtype. It doesn't have all of the features of datetime64[ns], but at least gets around the 2263 problem.

@jreback
Copy link
Contributor

jreback commented May 7, 2019

@stefansimik you can also have a look here: #7307

closing for all the reasons @mroeschke and @jbrockmendel pointed. object typed Series do not expose, the datetime accessors.

@jreback jreback closed this as completed May 7, 2019
@jreback jreback added this to the No action milestone May 7, 2019
@stefansimik
Copy link
Author

Thank you all guys (@jbrockmendel, @jreback, @mroeschke)
for all your explanations.
I understand now... and see the same - that solution is not easily available these times.

Let's hope for better future in pandas v2.0 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

5 participants