-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Pandas Tests rely on inconsistent array coercion #29978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In this case you have df1 twice, but what if you had two dataframes of different shapes in there?
I guess we could coerce to an xarray object Nested listlikes are a PITA, but it isn't clear that there's a better alternative. Is there something specific we need to fix here? Or is this a "be aware of" kind of thing? |
If they have different shapes, things become interesting. Since numpy will automatically give it less dimensions (we are changing that). Pandas has 3 tests (I think) which would fail if I just do this. The question is if you think that there is any issue with breaking this behaviour. It does seem fairly useless to me, but we cannot deprecate it really. So if pandas users rely on it, it would suddenly be broken. In other words: I expect there is nothing you need to do. Unless you want to use it as an excuse to start cleaning up the listlike coercion in general. |
I would expect
similar to how this is handled
this is just too magical
so I think we should actually deprecate / change the current behavior now. |
Just a heads up, I have rebased that change in NumPy gh-14995, and would hope that fixing up pandas for it will be simple enough. It would be nice to get it over with (supporting such weird behaviours is just a pain moving forward). If you have concerns or we end up merging it and it is hard to catch up, let me know and we can revert... |
Just ran the test suite on that branch and only found 2 failures, both of which look like we're doing something sketchy that can be fixed on our end without too much trouble. Thanks for the heads up. |
In numpy/numpy#14995 I have tried to make numpy consistent with respect to coercing dataframes (and other array-likes which also implement the sequence protocol) to numpy arrays.
With the new PR/behaviour, the
__array__
interface would be fully preferred, and no mixed/inconsistent behaviour with respect to also being a sequence-like (with different behaviour) would occur.Unfortunately, pandas DataFrames have this behaviour, since they are squence-like. This behaviour kicks in during DataFrame coercion, in the following case:
Where
df2
is currently coerced as a dataframe with dataframes inside. Currently this happens due to the following logic:EDIT: addtional code details:
convert
is a thin wrapper around:This takes the first branch (
values
is a list), which in turn forces a 1-D object array:because
np.array([df1, df1])
will raise an error due to the inconsistencies within NumPy, it ends up callingconvert([df1, df1])
which in turn creates a NumPydtype=object
array with two dataframes inside.However, the new/correct behaviour for NumPy would be to that
np.array([df1, df1])
will return a 3 dimensional array. This ends up raising an error because pandas refuses to coerce a 3D array to a DataFrame.It seems safest to not try to squeeze this into the upcoming NumPy release (it is planned in a few days). However, I would like to change it in master soon after branching. I am not sure if you see the current behaviour as important or not, but it would be nice if you can look into what the final intend will be here. If we (can) change this in NumPy I am not sure there is a way for pandas to retain the old behaviour.
The text was updated successfully, but these errors were encountered: