Skip to content

.astype('object') misbehaves on Categorical containing Timestamp #18024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toobaz opened this issue Oct 29, 2017 · 10 comments · Fixed by #44930
Closed

.astype('object') misbehaves on Categorical containing Timestamp #18024

toobaz opened this issue Oct 29, 2017 · 10 comments · Fixed by #44930
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Oct 29, 2017

Code Sample, a copy-pastable example if possible

In [2]: pd._libs.algos.ensure_object([pd.Timestamp('2014-01-01')]) # right
Out[2]: array([Timestamp('2014-01-01 00:00:00')], dtype=object)

In [3]: pd._libs.algos.ensure_object(pd.Categorical([pd.Timestamp('2014-01-01')])) # wrong
Out[3]: array([1388534400000000000], dtype=object)

Problem description

The two calls should produce the same output. The problem comes from:

In [3]: pd.Categorical([pd.Timestamp('2014-01-01')]).astype('object')
Out[3]: array([1388534400000000000], dtype=object)

... which in turn comes from

In [3]: np.array(pd.Categorical([pd.Timestamp('2014-01-01')]), dtype='object')
Out[3]: array([1388534400000000000], dtype=object)

Expected Output

Out[2]

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 4329a8d03555568b8593364caad680c24f91c9ad
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+18.g4329a8d03
pytest: 3.0.6
pip: 9.0.1
setuptools: None
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.1.0.dev
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@toobaz toobaz changed the title ensure_object misbehaves on Categorical containing Timestamp .astype('object') misbehaves on Categorical containing Timestamp Oct 30, 2017
@gfyoung gfyoung added Categorical Categorical Data Type Datetime Datetime data dtype labels Oct 30, 2017
@gfyoung
Copy link
Member

gfyoung commented Oct 30, 2017

@toobaz : I'm not sure I fully follow you explanation of the problem's source, but your report seems reasonable to me. PR is welcome!

@toobaz
Copy link
Member Author

toobaz commented Oct 30, 2017

Actually my explanation is not even a real explanation... I just dug down the code a couple of steps looking for the origin of the problem (and stopped at the border with numpy code - but a fix in pandas should be possible, given that e.g. replacing Timestamp with Period works fine).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 30, 2017

I agree it is strange behaviour, but I am not fully sure we can do something about it at the source. Converting a Categorical to array gives datetime64

In [51]: np.asarray(pd.Categorical([pd.Timestamp('2014-01-01')]))
Out[51]: array(['2014-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

and converting that to object gives the underlying integer as object (numpy behaviour):

In [52]: np.asarray(pd.Categorical([pd.Timestamp('2014-01-01')])).astype(object)
Out[52]: array([1388534400000000000], dtype=object)

The difference with the Timestamp example is that that already is object dtype, so the astype(object) does not do anything.

You get the same result when astyping datetime64 data:

In [55]: pd._libs.algos.ensure_object(np.array(['2012-01-01'], dtype='datetime64[ns]'))
Out[55]: array([1325376000000000000], dtype=object)

In [56]: np.array(['2012-01-01'], dtype='datetime64[ns]').astype(object)
Out[56]: array([1325376000000000000], dtype=object)

But I suppose we can add some workarounds to prevent this from happening on the pandas side (like is done for the Series(datetime64 data).astype(object) case)

@toobaz
Copy link
Member Author

toobaz commented Oct 30, 2017

Converting a Categorical to array gives datetime64

I'm not sure what we can do to prevent unwanted conversions, but what we should at least (be able to) fix is the fact that np.asarray(pd.Categorical(anything)) gives a different result than np.asarray(anything) (where in the case we are discussing, anything is [pd.Timestamp('2014-01-01')]).

@jorisvandenbossche
Copy link
Member

but what we should at least (be able to) fix is the fact that np.asarray(pd.Categorical(anything)) gives a different result than np.asarray(anything)

Yes, but the 'anything', although being [pd.Timestamp('2014-01-01')], is actually a DatetimeIndex (once converted to Categorical, the categories are inferred to be datetime), and doing np.asarray(DatetimeIndex(..)) does give datetime64, not object Timestamps:

In [60]: np.asarray(pd.DatetimeIndex(["2012-01-01"]))
Out[60]: array(['2012-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

@jreback
Copy link
Contributor

jreback commented Oct 30, 2017

take is meant to emulate np.take; but it has to operate on the ._values, IOW it know about extension types (in take_1d which does all the work)

In [51]: from pandas.core.algorithms import take_1d

In [52]: c = pd.Categorical([pd.Timestamp('2014-01-01')])

In [53]: take_1d(c.categories.values, c.codes)
Out[53]: array(['2014-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [54]: c.categories.take(c.codes).values
Out[54]: array(['2014-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [56]: c.categories.values.take(c.codes)
Out[56]: array(['2014-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

so we want these to be identical.

@jreback
Copy link
Contributor

jreback commented Oct 30, 2017

This issue is not about take, rather about not handling .astype(object) I agree this should work and be equivalent. Prob just needs some logic to have

In [59]: c.astype('object')
Out[59]: array([1388534400000000000], dtype=object)

In [60]: c.categories
Out[60]: DatetimeIndex(['2014-01-01'], dtype='datetime64[ns]', freq=None)

In [61]: c.categories.astype('object')
Out[61]: Index([2014-01-01 00:00:00], dtype='object')

work correctly.

@mroeschke mroeschke added the Bug label Apr 1, 2020
@mroeschke
Copy link
Member

This looks to work on master. I suppose we could use a test for this

In [7]: In [3]: pd._libs.algos.ensure_object(pd.Categorical([pd.Timestamp('2014-01-01')])) # wrong
Out[7]: array([Timestamp('2014-01-01 00:00:00')], dtype=object)

In [8]: In [2]: pd._libs.algos.ensure_object([pd.Timestamp('2014-01-01')]) # right
Out[8]: array([Timestamp('2014-01-01 00:00:00')], dtype=object)

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Datetime Datetime data dtype labels Jun 12, 2021
@jbrockmendel
Copy link
Member

@phofl i think one of your current PRs might solve this?

@phofl
Copy link
Member

phofl commented Dec 21, 2021

Yep, added a test

Edit: The astype call did not work before

@jreback jreback added this to the 1.4 milestone Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants