-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
aggregating with a dictionary that specifies a column that has all nan's fails to use numpy #9149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Forget to specify, here's the versions in play in my setup: pandas-0.14.1 Also to note, the structure of my dataframe is that my "data" columns are all prefixed with the string "Column", that's why i'm doing the x.startswith('Column') above. |
can you show a complete reproducicle example that I can copy/paste? |
I did some more troubleshooting and I found out where the object dtype was being introduced. In my case I'm pd.concat()'ing a bunch of csv files and one of them had 0 data rows, the dataframe was being created with columns of dtype object. Then when concating it upcasts the float64 columns with nan to object. I can do due diligence to prevent this object upcasting. But I still believe there is a flaw in how agg() works with and without a dictionary specified for these columns. Here's a sample csv and ipython notebook output showing the problem. https://www.dropbox.com/s/1238l1m3g4zju4s/reproduce.csv?dl=0 |
@flyingchipmunk a small example that we can fully copy/paste is what is useful. Pls try to have as simple an example in order to reproduce. e.g.
|
Ok here it is simplified:
profiling the first .agg() call yields this:
profiling the second .agg() call yields this:
I would expect the second .agg() call with the dictionary to be significantly faster than doing a broad sum on the entire dataframe. Hope this helps. |
You have
closing as this is a usage question |
Yes I understand that, and I explained how/why the object dtype got introduced in my previous comments. My issue here is how under the hood pandas is treating .agg(sum) over that entire dataframe differently (correctly) than .agg(collections.OrderedDict({'EmptyCol':'sum'})) If the .agg(sum) can correctly use numpy cython np.sum for executing the aggregation than why can the dictionary not? I still believe this is a flaw. |
@flyingchipmunk its not clear why your original data ended up as Its also not performant to test whether an You can use
|
Thanks for the convert_objects() tip, that's very useful and will make my workaround code much simpler. The Then when that empty dataframe is concated pandas has to go with highest common denominator to hold the dtype of the columns from the different dataframes being concated. In my mind it comes down to the question of how does pandas make the determination that the |
hmm, on 2nd thought this might be a bug in the concat itself. It shouldn't downcast in this cast since its concatting with an empty. Might be an edge case. Welcome a PR to look at that. As far as how aggregation works. The cython routine tries with the entire frame first to see if it reduce. In this case it works because numpy does the coercion. The 2nd part might be a bug. If you could walk the code would be helpful. |
Ok, thanks for the explanation. I'll dig further into walking the code, see if I can nail it down. |
Looks to be fixed on master. Could use a test.
|
As the subject says, if I try to call .agg() with a dictionary with a column that has all np.nan's it falls back to python agg functions instead of numpy.
To reproduce: (my dataset is 60 cols, 100000 rows)
I imported a csv and one column was all null (np.nan). The column dtype was set to object. (that's one issue, why the large upcast container to store np.nan?)
sq_g.agg(sum)
Without specifying a dictionary and using sum over the entire dataframe it correctly uses the cython optimized numpy.sum:
10 loops, best of 3: 48.3 ms per loop
sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))
Specifying a dictionary and a column that is of dtype object that is entirely rows of np.nan falls back to python (bad):
1 loops, best of 3: 7.26 s per loop
For reference (ColumnRef20 has floats, ColumnRef53 has entirely np.nan's):
My workaround is to downcast these np.nan filled columns back to float64, then the dictionary aggregation correctly uses the numpy optimized functions and not python:
Then the dictionary .agg() works as expected:
sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))
100 loops, best of 3: 6.2 ms per loop
The text was updated successfully, but these errors were encountered: