BUG: DataFrame([listlikes]) different shape if first listlike is Categorical #38845

jbrockmendel · 2020-12-31T03:47:50Z

arrs =  [np.arange(0, 40, 10), np.arange(0, 40, 10), np.arange(0, 40, 10)]

arrs2 = arrs.copy()
arrs2[0] = pd.Categorical(arrs2[0])

df = pd.DataFrame(arrs)
df2 = pd.DataFrame(arrs2)
df3 = pd.DataFrame(arrs2[::-1])

>>> df.shape
(3, 4)
>>> df2.shape
(4, 3)
>>> df3.shape
(3, 4)

It's specific to Categorical, as we check isinstance(data[0], Categorical) in nested_data_to_arrays

The text was updated successfully, but these errors were encountered:

venaturum · 2020-12-31T08:12:26Z

Looks like the root cause is to_arrays, which is called by nested_data_to_arrays

(pandas/core/internals/construction.py)

def to_arrays(data, columns, dtype: Optional[DtypeObj] = None):
    """
    Return list of arrays, columns.
    """
    .
    .
    .
    elif isinstance(data[0], Categorical):
        if columns is None:
            columns = ibase.default_index(len(data))
        return data, columns

venaturum · 2021-01-01T02:35:33Z

There's a few functions in construction.py which seem to base behaviour around the first element's type in a nested list.

eg


pd.DataFrame([
    [1,2,3],
    pd.Series([4,5,6]),
    [7,8,9]
])

works but:

pd.DataFrame([
    pd.Series([1,2,3]),
    [4,5,6],
    [7,8,9]
])

results in an exception.

This begs the question, "what is the desired result when every nested array is a Categorical:

pd.DataFrame([pd.Categorical(arrs[i]) for i in range(3)])

Is it (by design):

If it is, then this what is the desired result if all, but one, of the nested arrays are Categorical?

Seems like there needs to be clarification around design choices here. In any case, the order of the nested arrays shouldn't affect the final shape of the DataFrame and so using assumptions based on the type of the first nested array are not ideal.

venaturum · 2021-01-01T03:12:01Z

Is the solution as simple as replacing code in construction.py of the form:

isinstance(data[0], type_to_check)

with

all(map(lambda x: isinstance(x, type_to_check), data))

?

jbrockmendel · 2021-01-02T17:36:18Z

The first step is to get consensus that we want to change the behavior in the Categorical case. It may end up needing a deprecation cycle.

jreback · 2021-01-03T16:23:49Z

in an ideal world i think we would want to change this (though needing a deprecation cycle). @venaturum can you make the change locally and run asvs and see if performance is an issue (it might be)?

venaturum · 2021-01-03T21:33:23Z

in an ideal world i think we would want to change this (though needing a deprecation cycle). @venaturum can you make the change locally and run asvs and see if performance is an issue (it might be)?

Sure thing.

venaturum · 2021-01-05T04:56:44Z

@jreback

I've ran the asv. Do I zip and upload pandas/asv_bench/results?

What about the output in the terminal?

jreback · 2021-01-05T05:05:41Z

just post if anything is significantly different

venaturum · 2021-01-05T05:12:25Z

This is the last section of the terminal output

GH38845 asv summary.txt

jreback · 2021-01-05T05:19:30Z

looks very odd

pick out a relevant benchmark or 2 and repeat

venaturum · 2021-01-05T07:49:53Z

I ran the three worst reported benchmarks, 5 times each, to see what sort of consistency I get. I tried to leave the machine otherwise idle.
There are a few outliers in the ratios which are a bit bizarre (namely 0 and 180).

I'm not sure if much trust can be placed in the original benchmarks I uploaded - the machine certainly wasn't idle at the time.

       before           after         ratio
     [cc9c9b9f]       [0976c4ca]
     <GH38672^2>       <GH38845>
+        292±80ms        324±50ms     1.11  multiindex_object.GetLoc.time_large_get_loc
+        248±40ms        270±60ms     1.09  multiindex_object.GetLoc.time_large_get_loc
-        496±70ms        429±80ms     0.86  multiindex_object.GetLoc.time_large_get_loc
-        510±70ms        434±90ms     0.85  multiindex_object.GetLoc.time_large_get_loc
-        267±60ms        209±70ms     0.78  multiindex_object.GetLoc.time_large_get_loc

       before           after         ratio
     [cc9c9b9f]       [0976c4ca]
     <GH38672^2>       <GH38845>
+        415±10μs        468±30μs     1.13  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-        110±20ms        428±50μs     0.00  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+        99.7±6ms         121±5ms     1.20  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+         117±8ms        130±20ms     1.11  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+        102±10ms         113±7ms     1.11  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')

       before           after         ratio
     [cc9c9b9f]       [0976c4ca]
     <GH38672^2>       <GH38845>
          102±3ms         102±7ms     1.00  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-        114±10ms        401±20μs     0.00  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
+       587±100μs         106±7ms   180.55  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
-        233±30ms        188±20ms     0.81  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
         191±10ms        194±30ms     1.02  hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)

Currently rerunning pandas/asv_bench/benchmarks/ctors as I assume this is the most relevant to the changes I've made locally, but I'm only guessing on that one.

venaturum · 2021-01-05T11:11:38Z

I ran the ctors benchmarks 13 times. The table below shows the median ratio reported.

Interestingly it looks like those benchmarks which have worsened were generally highlighted as having changed, much less thatn the benchmarks that have improved.

Sample size is still not great though.

                                                                                                median ratio    number of reports
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists>,False,'int')               1.320                5
ctors.SeriesConstructors.time_series_constructor(<functionarr_dict>,True,'float')                   1.310                3
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists>,True,'float')              1.230                5
ctors.SeriesConstructors.time_series_constructor(<functiongen_of_str>,False,'float')                1.230                1
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples>,True,'float')             1.210                5
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists>,False,'float')             1.210                5
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists_with_none>,False,'int')     1.190                5
ctors.SeriesConstructors.time_series_constructor(<functionarr_dict>,False,'int')                    1.190                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples>,True,'int')               1.180                6
ctors.SeriesConstructors.time_series_constructor(<functiongen_of_tuples>,False,'int')               1.170                5
ctors.SeriesConstructors.time_series_constructor(<functionarr_dict>,True,'int')                     1.160                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists_with_none>,True,'float')    1.160                5
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples>,False,'float')            1.150                6
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples_with_none>,False,'float')  1.140                6
ctors.SeriesConstructors.time_series_constructor(<functiongen_of_tuples>,False,'float')             1.120                6
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists_with_none>,True,'int')      1.110                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples_with_none>,True,'float')   1.110                4
ctors.SeriesConstructors.time_series_constructor(<functiongen_of_str>,False,'int')                  1.110                3
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples_with_none>,True,'int')     1.100                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples>,False,'int')              1.025                4
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists>,True,'int')                1.010                8
ctors.SeriesConstructors.time_series_constructor(<functionarr_dict>,False,'float')                  1.000                4
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_str>,False,'int')                 0.995                10
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_str>,True,'int')                  0.895                6
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_lists_with_none>,False,'float')   0.880                7
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_str>,True,'float')                0.875                6
ctors.SeriesConstructors.time_series_constructor(<class'list'>,True,'int')                          0.830                10
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_str>,False,'float')               0.825                8
ctors.SeriesDtypesConstructors.time_dtindex_from_series                                             0.825                2
ctors.SeriesDtypesConstructors.time_index_from_array_floats                                         0.805                2
ctors.SeriesDtypesConstructors.time_dtindex_from_index_with_series                                  0.800                3
ctors.SeriesDtypesConstructors.time_index_from_array_string                                         0.800                1
ctors.SeriesConstructors.time_series_constructor(<functionlist_of_tuples_with_none>,False,'int')    0.780                4
ctors.SeriesConstructors.time_series_constructor(<class'list'>,True,'float')                        0.770                9
ctors.SeriesConstructors.time_series_constructor(<class'list'>,False,'int')                         0.735                10
ctors.SeriesConstructors.time_series_constructor(<functionno_change>,True,'float')                  0.710                9
ctors.SeriesConstructors.time_series_constructor(<class'list'>,False,'float')                       0.690                13
ctors.SeriesConstructors.time_series_constructor(<functionno_change>,True,'int')                    0.625                12
ctors.SeriesConstructors.time_series_constructor(<functionno_change>,False,'float')                 0.490                12
ctors.SeriesConstructors.time_series_constructor(<functionno_change>,False,'int')                   0.480                13
ctors.MultiIndexConstructor.time_multiindex_from_iterables                                          0.390                3

venaturum · 2021-01-05T11:14:25Z

I'm also curious as to how asv decides whether to declare performance is increased or decreased.

In the example below all but one of the benchmarks reported were faster, yet it is summarised as "PERFORMANCE DECREASED", which seems unexpected.

       before           after         ratio
     [cc9c9b9f]       [0976c4ca]
     <GH38672^2>       <GH38845>
+         124±5μs         158±60μs     1.28  ctors.SeriesConstructors.time_series_constructor(<function no_change>, True, 'int')
-      2.98±0.3ms       2.70±0.1ms     0.91  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none>, True, 'float')
-      2.51±0.5ms       2.25±0.1ms     0.90  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')
-        655±90μs         583±90μs     0.89  ctors.SeriesConstructors.time_series_constructor(<function list_of_str>, True, 'int')
-      1.34±0.7ms       1.18±0.2ms     0.88  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'float')
-        3.42±1ms       2.98±0.4ms     0.87  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists>, True, 'float')
-      4.19±0.8ms       3.60±0.2ms     0.86  ctors.SeriesConstructors.time_series_constructor(<function arr_dict>, True, 'int')
-       785±200μs        668±200μs     0.85  ctors.SeriesConstructors.time_series_constructor(<function list_of_str>, True, 'float')
-      3.25±0.7ms       2.76±0.2ms     0.85  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples>, False, 'float')
-       659±300μs        555±200μs     0.84  ctors.SeriesConstructors.time_series_constructor(<function list_of_str>, False, 'int')
-      3.15±0.4ms       2.65±0.2ms     0.84  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none>, False, 'float')
-      3.35±0.7ms       2.79±0.2ms     0.83  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none>, True, 'int')
-      3.29±0.6ms       2.73±0.4ms     0.83  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples>, True, 'int')
-        3.61±1ms       2.98±0.3ms     0.83  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists>, False, 'float')
-       739±300μs        597±100μs     0.81  ctors.SeriesConstructors.time_series_constructor(<function list_of_str>, False, 'float')
-        40.9±9μs         32.6±1μs     0.80  ctors.SeriesDtypesConstructors.time_dtindex_from_index_with_series
-        3.77±1ms       2.89±0.4ms     0.77  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists>, True, 'int')
-        3.38±1ms       2.57±0.1ms     0.76  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none>, False, 'int')
-        3.68±1ms       2.80±0.1ms     0.76  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples>, True, 'float')
-        2.45±1ms       1.73±0.6ms     0.71  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'float')
-        3.83±1ms       2.67±0.2ms     0.70  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples_with_none>, False, 'int')
-        3.74±1ms      2.59±0.08ms     0.69  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples_with_none>, False, 'float')
-       317±100μs        214±200μs     0.68  ctors.SeriesConstructors.time_series_constructor(<function no_change>, True, 'float')
-        5.61±2ms         3.73±3ms     0.66  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')
-        143±90μs        64.5±10μs     0.45  ctors.SeriesConstructors.time_series_constructor(<function no_change>, False, 'float')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 31, 2020

simonjayhawkins added Categorical Categorical Data Type Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 31, 2020

simonjayhawkins added this to the Contributions Welcome milestone Dec 31, 2020

MarcoGorelli added the Needs Discussion Requires discussion from core team before further action label Jan 3, 2021

jbrockmendel mentioned this issue May 19, 2021

DEPR: DataFrame([categorical, ...]) special casing #41557

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.3 May 19, 2021

jreback closed this as completed in #41557 May 21, 2021

ggold7046 mentioned this issue Aug 10, 2023

Modified doc/make.py to run sphinx-build -b linkcheck #54265

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame([listlikes]) different shape if first listlike is Categorical #38845

BUG: DataFrame([listlikes]) different shape if first listlike is Categorical #38845

jbrockmendel commented Dec 31, 2020

venaturum commented Dec 31, 2020

venaturum commented Jan 1, 2021

venaturum commented Jan 1, 2021

jbrockmendel commented Jan 2, 2021

jreback commented Jan 3, 2021

venaturum commented Jan 3, 2021

venaturum commented Jan 5, 2021

jreback commented Jan 5, 2021

venaturum commented Jan 5, 2021

jreback commented Jan 5, 2021

venaturum commented Jan 5, 2021

venaturum commented Jan 5, 2021

venaturum commented Jan 5, 2021

BUG: DataFrame([listlikes]) different shape if first listlike is Categorical #38845

BUG: DataFrame([listlikes]) different shape if first listlike is Categorical #38845

Comments

jbrockmendel commented Dec 31, 2020

venaturum commented Dec 31, 2020

venaturum commented Jan 1, 2021

venaturum commented Jan 1, 2021

jbrockmendel commented Jan 2, 2021

jreback commented Jan 3, 2021

venaturum commented Jan 3, 2021

venaturum commented Jan 5, 2021

jreback commented Jan 5, 2021

venaturum commented Jan 5, 2021

jreback commented Jan 5, 2021

venaturum commented Jan 5, 2021

venaturum commented Jan 5, 2021

venaturum commented Jan 5, 2021