-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame([listlikes]) different shape if first listlike is Categorical #38845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looks like the root cause is to_arrays, which is called by nested_data_to_arrays (pandas/core/internals/construction.py)
|
There's a few functions in construction.py which seem to base behaviour around the first element's type in a nested list. eg
works but:
results in an exception. This begs the question, "what is the desired result when every nested array is a Categorical:
Is it (by design):
If it is, then this what is the desired result if all, but one, of the nested arrays are Categorical? Seems like there needs to be clarification around design choices here. In any case, the order of the nested arrays shouldn't affect the final shape of the DataFrame and so using assumptions based on the type of the first nested array are not ideal. |
Is the solution as simple as replacing code in construction.py of the form:
with
? |
The first step is to get consensus that we want to change the behavior in the Categorical case. It may end up needing a deprecation cycle. |
in an ideal world i think we would want to change this (though needing a deprecation cycle). @venaturum can you make the change locally and run asvs and see if performance is an issue (it might be)? |
Sure thing. |
I've ran the asv. Do I zip and upload pandas/asv_bench/results? What about the output in the terminal? |
just post if anything is significantly different |
This is the last section of the terminal output |
looks very odd pick out a relevant benchmark or 2 and repeat |
I ran the three worst reported benchmarks, 5 times each, to see what sort of consistency I get. I tried to leave the machine otherwise idle. I'm not sure if much trust can be placed in the original benchmarks I uploaded - the machine certainly wasn't idle at the time.
Currently rerunning pandas/asv_bench/benchmarks/ctors as I assume this is the most relevant to the changes I've made locally, but I'm only guessing on that one. |
I ran the ctors benchmarks 13 times. The table below shows the median ratio reported. Interestingly it looks like those benchmarks which have worsened were generally highlighted as having changed, much less thatn the benchmarks that have improved. Sample size is still not great though.
|
I'm also curious as to how asv decides whether to declare performance is increased or decreased. In the example below all but one of the benchmarks reported were faster, yet it is summarised as "PERFORMANCE DECREASED", which seems unexpected.
|
It's specific to Categorical, as we check
isinstance(data[0], Categorical)
in nested_data_to_arraysThe text was updated successfully, but these errors were encountered: