Skip to content

BUG: DataFrame([listlikes]) different shape if first listlike is Categorical #38845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Dec 31, 2020 · 13 comments · Fixed by #41557
Closed

BUG: DataFrame([listlikes]) different shape if first listlike is Categorical #38845

jbrockmendel opened this issue Dec 31, 2020 · 13 comments · Fixed by #41557
Labels
Bug Categorical Categorical Data Type Constructors Series/DataFrame/Index/pd.array Constructors Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@jbrockmendel
Copy link
Member

arrs =  [np.arange(0, 40, 10), np.arange(0, 40, 10), np.arange(0, 40, 10)]

arrs2 = arrs.copy()
arrs2[0] = pd.Categorical(arrs2[0])

df = pd.DataFrame(arrs)
df2 = pd.DataFrame(arrs2)
df3 = pd.DataFrame(arrs2[::-1])

>>> df.shape
(3, 4)
>>> df2.shape
(4, 3)
>>> df3.shape
(3, 4)

It's specific to Categorical, as we check isinstance(data[0], Categorical) in nested_data_to_arrays

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 31, 2020
@venaturum
Copy link
Contributor

Looks like the root cause is to_arrays, which is called by nested_data_to_arrays

(pandas/core/internals/construction.py)

def to_arrays(data, columns, dtype: Optional[DtypeObj] = None):
    """
    Return list of arrays, columns.
    """
    .
    .
    .
    elif isinstance(data[0], Categorical):
        if columns is None:
            columns = ibase.default_index(len(data))
        return data, columns

@simonjayhawkins simonjayhawkins added Categorical Categorical Data Type Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 31, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Dec 31, 2020
@venaturum
Copy link
Contributor

There's a few functions in construction.py which seem to base behaviour around the first element's type in a nested list.

eg


pd.DataFrame([
    [1,2,3],
    pd.Series([4,5,6]),
    [7,8,9]
])

works but:

pd.DataFrame([
    pd.Series([1,2,3]),
    [4,5,6],
    [7,8,9]
])

results in an exception.

This begs the question, "what is the desired result when every nested array is a Categorical:

pd.DataFrame([pd.Categorical(arrs[i]) for i in range(3)])

Is it (by design):

	0	1	2
0	0	0	0
1	10	10	10
2	20	20	20
3	30	30	30

If it is, then this what is the desired result if all, but one, of the nested arrays are Categorical?

Seems like there needs to be clarification around design choices here. In any case, the order of the nested arrays shouldn't affect the final shape of the DataFrame and so using assumptions based on the type of the first nested array are not ideal.

@venaturum
Copy link
Contributor

Is the solution as simple as replacing code in construction.py of the form:

isinstance(data[0], type_to_check)

with

all(map(lambda x: isinstance(x, type_to_check), data))

?

@jbrockmendel
Copy link
Member Author

The first step is to get consensus that we want to change the behavior in the Categorical case. It may end up needing a deprecation cycle.

@MarcoGorelli MarcoGorelli added the Needs Discussion Requires discussion from core team before further action label Jan 3, 2021
@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

in an ideal world i think we would want to change this (though needing a deprecation cycle). @venaturum can you make the change locally and run asvs and see if performance is an issue (it might be)?

@venaturum
Copy link
Contributor

in an ideal world i think we would want to change this (though needing a deprecation cycle). @venaturum can you make the change locally and run asvs and see if performance is an issue (it might be)?

Sure thing.

@venaturum
Copy link
Contributor

@jreback

I've ran the asv. Do I zip and upload pandas/asv_bench/results?

What about the output in the terminal?

@jreback
Copy link
Contributor

jreback commented Jan 5, 2021

just post if anything is significantly different

@venaturum
Copy link
Contributor

This is the last section of the terminal output

GH38845 asv summary.txt

@jreback
Copy link
Contributor

jreback commented Jan 5, 2021

looks very odd

pick out a relevant benchmark or 2 and repeat

@venaturum
Copy link
Contributor

I ran the three worst reported benchmarks, 5 times each, to see what sort of consistency I get. I tried to leave the machine otherwise idle.
There are a few outliers in the ratios which are a bit bizarre (namely 0 and 180).

I'm not sure if much trust can be placed in the original benchmarks I uploaded - the machine certainly wasn't idle at the time.

       before           after         ratio
     [cc9c9b9f]       [0976c4ca]
     <GH38672^2>       <GH38845>
+        292±80ms        324±50ms     1.11  multiindex_object.GetLoc.time_large_get_loc
+        248±40ms        270±60ms     1.09  multiindex_object.GetLoc.time_large_get_loc
-        496±70ms        429±80ms     0.86  multiindex_object.GetLoc.time_large_get_loc
-        510±70ms        434±90ms     0.85  multiindex_object.GetLoc.time_large_get_loc
-        267±60ms        209±70ms     0.78  multiindex_object.GetLoc.time_large_get_loc
       before           after         ratio
     [cc9c9b9f]       [0976c4ca]
     <GH38672^2>       <GH38845>
+        415±10μs        468±30μs     1.13  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        110±20ms        428±50μs     0.00  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+        99.7±6ms         121±5ms     1.20  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+         117±8ms        130±20ms     1.11  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+        102±10ms         113±7ms     1.11  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
       before           after         ratio
     [cc9c9b9f]       [0976c4ca]
     <GH38672^2>       <GH38845>
          102±3ms         102±7ms     1.00  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-        114±10ms        401±20μs     0.00  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
+       587±100μs         106±7ms   180.55  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-        233±30ms        188±20ms     0.81  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
         191±10ms        194±30ms     1.02  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)

Currently rerunning pandas/asv_bench/benchmarks/ctors as I assume this is the most relevant to the changes I've made locally, but I'm only guessing on that one.

@venaturum
Copy link
Contributor

I ran the ctors benchmarks 13 times. The table below shows the median ratio reported.

Interestingly it looks like those benchmarks which have worsened were generally highlighted as having changed, much less thatn the benchmarks that have improved.

Sample size is still not great though.

                                                                                                median ratio    number of reports
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists>,False,'int')               1.320                5
ctors.SeriesConstructors.time_series_constructor(<functionarr_dict>,True,'float')                   1.310                3
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists>,True,'float')              1.230                5
ctors.SeriesConstructors.time_series_constructor(<functiongen_of_str>,False,'float')                1.230                1
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples>,True,'float')             1.210                5
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists>,False,'float')             1.210                5
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists_with_none>,False,'int')     1.190                5
ctors.SeriesConstructors.time_series_constructor(<functionarr_dict>,False,'int')                    1.190                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples>,True,'int')               1.180                6
ctors.SeriesConstructors.time_series_constructor(<functiongen_of_tuples>,False,'int')               1.170                5
ctors.SeriesConstructors.time_series_constructor(<functionarr_dict>,True,'int')                     1.160                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists_with_none>,True,'float')    1.160                5
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples>,False,'float')            1.150                6
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples_with_none>,False,'float')  1.140                6
ctors.SeriesConstructors.time_series_constructor(<functiongen_of_tuples>,False,'float')             1.120                6
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists_with_none>,True,'int')      1.110                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples_with_none>,True,'float')   1.110                4
ctors.SeriesConstructors.time_series_constructor(<functiongen_of_str>,False,'int')                  1.110                3
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples_with_none>,True,'int')     1.100                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples>,False,'int')              1.025                4
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists>,True,'int')                1.010                8
ctors.SeriesConstructors.time_series_constructor(<functionarr_dict>,False,'float')                  1.000                4
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_str>,False,'int')                 0.995                10
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_str>,True,'int')                  0.895                6
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists_with_none>,False,'float')   0.880                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_str>,True,'float')                0.875                6
ctors.SeriesConstructors.time_series_constructor(<class'list'>,True,'int')                          0.830                10
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_str>,False,'float')               0.825                8
ctors.SeriesDtypesConstructors.time_dtindex_from_series                                             0.825                2
ctors.SeriesDtypesConstructors.time_index_from_array_floats                                         0.805                2
ctors.SeriesDtypesConstructors.time_dtindex_from_index_with_series                                  0.800                3
ctors.SeriesDtypesConstructors.time_index_from_array_string                                         0.800                1
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples_with_none>,False,'int')    0.780                4
ctors.SeriesConstructors.time_series_constructor(<class'list'>,True,'float')                        0.770                9
ctors.SeriesConstructors.time_series_constructor(<class'list'>,False,'int')                         0.735                10
ctors.SeriesConstructors.time_series_constructor(<functionno_change>,True,'float')                  0.710                9
ctors.SeriesConstructors.time_series_constructor(<class'list'>,False,'float')                       0.690                13
ctors.SeriesConstructors.time_series_constructor(<functionno_change>,True,'int')                    0.625                12
ctors.SeriesConstructors.time_series_constructor(<functionno_change>,False,'float')                 0.490                12
ctors.SeriesConstructors.time_series_constructor(<functionno_change>,False,'int')                   0.480                13
ctors.MultiIndexConstructor.time_multiindex_from_iterables                                          0.390                3

@venaturum
Copy link
Contributor

I'm also curious as to how asv decides whether to declare performance is increased or decreased.

In the example below all but one of the benchmarks reported were faster, yet it is summarised as "PERFORMANCE DECREASED", which seems unexpected.

       before           after         ratio
     [cc9c9b9f]       [0976c4ca]
     <GH38672^2>       <GH38845>
+         124±5μs         158±60μs     1.28  ctors.SeriesConstructors.time_series_constructor(<function no_change>, True, 'int')
-      2.98±0.3ms       2.70±0.1ms     0.91  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none>, True, 'float')
-      2.51±0.5ms       2.25±0.1ms     0.90  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')
-        655±90μs         583±90μs     0.89  ctors.SeriesConstructors.time_series_constructor(<function list_of_str>, True, 'int')
-      1.34±0.7ms       1.18±0.2ms     0.88  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'float')
-        3.42±1ms       2.98±0.4ms     0.87  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists>, True, 'float')
-      4.19±0.8ms       3.60±0.2ms     0.86  ctors.SeriesConstructors.time_series_constructor(<function arr_dict>, True, 'int')
-       785±200μs        668±200μs     0.85  ctors.SeriesConstructors.time_series_constructor(<function list_of_str>, True, 'float')
-      3.25±0.7ms       2.76±0.2ms     0.85  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples>, False, 'float')
-       659±300μs        555±200μs     0.84  ctors.SeriesConstructors.time_series_constructor(<function list_of_str>, False, 'int')
-      3.15±0.4ms       2.65±0.2ms     0.84  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none>, False, 'float')
-      3.35±0.7ms       2.79±0.2ms     0.83  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none>, True, 'int')
-      3.29±0.6ms       2.73±0.4ms     0.83  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples>, True, 'int')
-        3.61±1ms       2.98±0.3ms     0.83  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists>, False, 'float')
-       739±300μs        597±100μs     0.81  ctors.SeriesConstructors.time_series_constructor(<function list_of_str>, False, 'float')
-        40.9±9μs         32.6±1μs     0.80  ctors.SeriesDtypesConstructors.time_dtindex_from_index_with_series
-        3.77±1ms       2.89±0.4ms     0.77  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists>, True, 'int')
-        3.38±1ms       2.57±0.1ms     0.76  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none>, False, 'int')
-        3.68±1ms       2.80±0.1ms     0.76  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples>, True, 'float')
-        2.45±1ms       1.73±0.6ms     0.71  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'float')
-        3.83±1ms       2.67±0.2ms     0.70  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples_with_none>, False, 'int')
-        3.74±1ms      2.59±0.08ms     0.69  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples_with_none>, False, 'float')
-       317±100μs        214±200μs     0.68  ctors.SeriesConstructors.time_series_constructor(<function no_change>, True, 'float')
-        5.61±2ms         3.73±3ms     0.66  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')
-        143±90μs        64.5±10μs     0.45  ctors.SeriesConstructors.time_series_constructor(<function no_change>, False, 'float')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Constructors Series/DataFrame/Index/pd.array Constructors Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants