Skip to content

Pandas Tests rely on inconsistent array coercion #29978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
seberg opened this issue Dec 2, 2019 · 5 comments
Open

Pandas Tests rely on inconsistent array coercion #29978

seberg opened this issue Dec 2, 2019 · 5 comments
Labels
DataFrame DataFrame data structure Deprecate Functionality to remove in pandas

Comments

@seberg
Copy link
Contributor

seberg commented Dec 2, 2019

In numpy/numpy#14995 I have tried to make numpy consistent with respect to coercing dataframes (and other array-likes which also implement the sequence protocol) to numpy arrays.

With the new PR/behaviour, the __array__ interface would be fully preferred, and no mixed/inconsistent behaviour with respect to also being a sequence-like (with different behaviour) would occur.

Unfortunately, pandas DataFrames have this behaviour, since they are squence-like. This behaviour kicks in during DataFrame coercion, in the following case:

df1 = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
df2 = pd.DataFrame([df1, df1])

Where df2 is currently coerced as a dataframe with dataframes inside. Currently this happens due to the following logic:

        try:
            if is_list_like(values[0]) or hasattr(values[0], 'len'):  # <-- is hit
                # following convert does nothing; `np.array()` than raises Error...
                values = np.array([convert(v) for v in values])
            elif isinstance(values[0], np.ndarray) and values[0].ndim == 0:
                # GH#21861
                values = np.array([convert(v) for v in values])
            else:
                values = convert(values)
        except (ValueError, TypeError):
            values = convert(values)  # <-- Ends up getting called and forces object array.

EDIT: addtional code details: convert is a thin wrapper around:

def maybe_convert_platform(values):
    """ try to do platform conversion, allow ndarray or list here """

    if isinstance(values, (list, tuple, range)):
        values = construct_1d_object_array_from_listlike(values)
    # more logic

This takes the first branch (values is a list), which in turn forces a 1-D object array:

def construct_1d_object_array_from_listlike(values):
    # numpy will try to interpret nested lists as further dimensions, hence
    # making a 1D array that contains list-likes is a bit tricky:
    result = np.empty(len(values), dtype='object')
    result[:] = values
    return result

because np.array([df1, df1]) will raise an error due to the inconsistencies within NumPy, it ends up calling convert([df1, df1]) which in turn creates a NumPy dtype=object array with two dataframes inside.
However, the new/correct behaviour for NumPy would be to that np.array([df1, df1]) will return a 3 dimensional array. This ends up raising an error because pandas refuses to coerce a 3D array to a DataFrame.

It seems safest to not try to squeeze this into the upcoming NumPy release (it is planned in a few days). However, I would like to change it in master soon after branching. I am not sure if you see the current behaviour as important or not, but it would be nice if you can look into what the final intend will be here. If we (can) change this in NumPy I am not sure there is a way for pandas to retain the old behaviour.

@jbrockmendel
Copy link
Member

However, the new/correct behaviour for NumPy would be to that np.array([df1, df1]) will return a 3 dimensional array.

In this case you have df1 twice, but what if you had two dataframes of different shapes in there?

pandas refuses to coerce a 3D array to a DataFrame.

I guess we could coerce to an xarray object

Nested listlikes are a PITA, but it isn't clear that there's a better alternative. Is there something specific we need to fix here? Or is this a "be aware of" kind of thing?

@seberg
Copy link
Contributor Author

seberg commented Dec 2, 2019

If they have different shapes, things become interesting. Since numpy will automatically give it less dimensions (we are changing that).

Pandas has 3 tests (I think) which would fail if I just do this. The question is if you think that there is any issue with breaking this behaviour. It does seem fairly useless to me, but we cannot deprecate it really. So if pandas users rely on it, it would suddenly be broken.

In other words: I expect there is nothing you need to do. Unless you want to use it as an excuse to start cleaning up the listlike coercion in general.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2019

I would expect

DataFrame([df1, df]2) no matter the shape of df1 and df2 (same or different) to raise a ValueError; the only way we would support this is if dtype=object is specified.

similar to how this is handled

In [2]: arr                                                                                                                                                                                                                 
Out[2]: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [3]: pd.DataFrame([arr, arr])                                                                                                                                                                                            
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-0a6417e1d1de> in <module>
----> 1 pd.DataFrame([arr, arr])

~/pandas/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    467                     mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
    468                 else:
--> 469                     mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
    470             else:
    471                 mgr = init_dict({}, index, columns, dtype=dtype)

~/pandas/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    155     # by definition an array here
    156     # the dtypes will be coerced to a single dtype
--> 157     values = prep_ndarray(values, copy=copy)
    158 
    159     if dtype is not None:

~/pandas/pandas/core/internals/construction.py in prep_ndarray(values, copy)
    279         values = values.reshape((values.shape[0], 1))
    280     elif values.ndim != 2:
--> 281         raise ValueError("Must pass 2-d input")
    282 
    283     return values

ValueError: Must pass 2-d input

this is just too magical

In [4]: pd.DataFrame([pd.DataFrame(arr), pd.DataFrame(arr)])                                                                                                                                                                
Out[4]: 
                                                   0
0     0  1  2  3  4
0  0  1  2  3  4
1  5  6  7  ...
1     0  1  2  3  4
0  0  1  2  3  4
1  5  6  7  ...

so I think we should actually deprecate / change the current behavior now.

@gfyoung gfyoung added DataFrame DataFrame data structure Dependencies Required and optional dependencies Deprecate Functionality to remove in pandas labels Dec 3, 2019
@seberg
Copy link
Contributor Author

seberg commented Jan 29, 2020

Just a heads up, I have rebased that change in NumPy gh-14995, and would hope that fixing up pandas for it will be simple enough. It would be nice to get it over with (supporting such weird behaviours is just a pain moving forward). If you have concerns or we end up merging it and it is hard to catch up, let me know and we can revert...

@jbrockmendel
Copy link
Member

Just ran the test suite on that branch and only found 2 failures, both of which look like we're doing something sketchy that can be fixed on our end without too much trouble. Thanks for the heads up.

@mroeschke mroeschke removed the Dependencies Required and optional dependencies label Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Deprecate Functionality to remove in pandas
Projects
None yet
Development

No branches or pull requests

5 participants