Skip to content

BUG: .values for objects containing categoricals with box-able categories #21658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jun 27, 2018 · 3 comments · Fixed by #53524
Closed

BUG: .values for objects containing categoricals with box-able categories #21658

jorisvandenbossche opened this issue Jun 27, 2018 · 3 comments · Fixed by #53524
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@jorisvandenbossche
Copy link
Member

A bit a cryptic title, but following from my investigation of #21390 (comment).

When having a categorical values with "box-able" categories, eg:

In [1]: cat = pd.Categorical(pd.date_range("2012-01-01", periods=3, freq='H'))

In [3]: cat.tolist()
Out[3]: 
[Timestamp('2012-01-01 00:00:00', freq='H'),
 Timestamp('2012-01-01 01:00:00', freq='H'),
 Timestamp('2012-01-01 02:00:00', freq='H')]

the boxing to Timestamps works for the Categorical itself, but once this is combined in 2D data structures, the boxing fails:

In [7]: midx = pd.MultiIndex.from_product([['a', 'b', 'c'], cat])

In [8]: midx.values
Out[8]: 
array([('a', 1325376000000000000), ('a', 1325379600000000000),
       ('a', 1325383200000000000), ('b', 1325376000000000000),
       ('b', 1325379600000000000), ('b', 1325383200000000000),
       ('c', 1325376000000000000), ('c', 1325379600000000000),
       ('c', 1325383200000000000)], dtype=object)

In [9]: df = pd.DataFrame({'a':['a', 'b', 'c'], 'b': cat, 'c': np.array(cat)})

In [10]: df.dtypes
Out[10]: 
a            object
b          category
c    datetime64[ns]
dtype: object

In [11]: df.values
Out[11]: 
array([['a', 1325376000000000000, Timestamp('2012-01-01 00:00:00')],
       ['b', 1325379600000000000, Timestamp('2012-01-01 01:00:00')],
       ['c', 1325383200000000000, Timestamp('2012-01-01 02:00:00')]], dtype=object)
@mroeschke
Copy link
Member

This looks to work on master now. Could use a test

In [21]: In [7]: midx = pd.MultiIndex.from_product([['a', 'b', 'c'], cat])
    ...:
    ...: In [8]: midx.values
Out[21]:
array([('a', Timestamp('2012-01-01 00:00:00')),
       ('a', Timestamp('2012-01-01 01:00:00')),
       ('a', Timestamp('2012-01-01 02:00:00')),
       ('b', Timestamp('2012-01-01 00:00:00')),
       ('b', Timestamp('2012-01-01 01:00:00')),
       ('b', Timestamp('2012-01-01 02:00:00')),
       ('c', Timestamp('2012-01-01 00:00:00')),
       ('c', Timestamp('2012-01-01 01:00:00')),
       ('c', Timestamp('2012-01-01 02:00:00'))], dtype=object)

In [22]: In [9]: df = pd.DataFrame({'a':['a', 'b', 'c'], 'b': cat, 'c': np.array(cat)})
    ...:
    ...: In [10]: df.dtypes
Out[22]:
a            object
b          category
c    datetime64[ns]
dtype: object

In [23]: In [11]: df.values
Out[23]:
array([['a', Timestamp('2012-01-01 00:00:00'),
        Timestamp('2012-01-01 00:00:00')],
       ['b', Timestamp('2012-01-01 01:00:00'),
        Timestamp('2012-01-01 01:00:00')],
       ['c', Timestamp('2012-01-01 02:00:00'),
        Timestamp('2012-01-01 02:00:00')]], dtype=object)

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type labels Jun 20, 2021
@dhanushkamath
Copy link

dhanushkamath commented Jul 31, 2022

Hi @mroeschke, I would like to contribute here. New to OSS. Can you give me some info on what should be covered in the test cases? Are the two examples that you've given enough?

@dhanushkamath
Copy link

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants