Skip to content

aggregating with a dictionary that specifies a column that has all nan's fails to use numpy #9149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flyingchipmunk opened this issue Dec 24, 2014 · 12 comments · Fixed by #30646
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@flyingchipmunk
Copy link

As the subject says, if I try to call .agg() with a dictionary with a column that has all np.nan's it falls back to python agg functions instead of numpy.

To reproduce: (my dataset is 60 cols, 100000 rows)

I imported a csv and one column was all null (np.nan). The column dtype was set to object. (that's one issue, why the large upcast container to store np.nan?)

sq = pd.read_table(sqFile, sep='\t', skiprows = 1, nrows=None, header=0)
sq_g=sq.groupby(all_key_cols, as_index=False, sort=False)

sq_g.agg(sum)
Without specifying a dictionary and using sum over the entire dataframe it correctly uses the cython optimized numpy.sum:
10 loops, best of 3: 48.3 ms per loop

sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))
Specifying a dictionary and a column that is of dtype object that is entirely rows of np.nan falls back to python (bad):
1 loops, best of 3: 7.26 s per loop

For reference (ColumnRef20 has floats, ColumnRef53 has entirely np.nan's):

sq.dtypes
# Row               float64
Rowdesc.iphost       object
...
ColumnRef20         float64
ColumnRef53          object
...
dtype: object

My workaround is to downcast these np.nan filled columns back to float64, then the dictionary aggregation correctly uses the numpy optimized functions and not python:

# workaround for numpy groupby issue:
#  downcast columns with all NaN from object to float64 so agg() doesn't fallback to python.

# first find all columns with all np.nan rows
data_cols = [x for x in sq_concat.columns.tolist() if x.startswith('Column')]
all_nan = pd.isnull(sq_concat[data_cols]).all()
all_nan_cols = all_nan[all_nan == True].index.values.tolist()

# only need to downcast if type is object
obj_downcast = sq_concat[all_nan_cols].dtypes == object
obj_downcast_cols = obj_downcast[obj_downcast == True].index.values.tolist()

# downcast object to np.float64
for nan_col in obj_downcast_cols:
    sq_concat[nan_col] = sq_concat[nan_col].apply(np.float64)

Then the dictionary .agg() works as expected:
sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))
100 loops, best of 3: 6.2 ms per loop

@flyingchipmunk
Copy link
Author

Forget to specify, here's the versions in play in my setup:

pandas-0.14.1
numpy-1.8.2

Also to note, the structure of my dataframe is that my "data" columns are all prefixed with the string "Column", that's why i'm doing the x.startswith('Column') above.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

can you show a complete reproducicle example that I can copy/paste?

@flyingchipmunk
Copy link
Author

I did some more troubleshooting and I found out where the object dtype was being introduced. In my case I'm pd.concat()'ing a bunch of csv files and one of them had 0 data rows, the dataframe was being created with columns of dtype object. Then when concating it upcasts the float64 columns with nan to object. I can do due diligence to prevent this object upcasting.

But I still believe there is a flaw in how agg() works with and without a dictionary specified for these columns. Here's a sample csv and ipython notebook output showing the problem.

https://www.dropbox.com/s/1238l1m3g4zju4s/reproduce.csv?dl=0
https://www.dropbox.com/s/63k0i0yqxqkr4lh/ipython.html?dl=0

@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

@flyingchipmunk a small example that we can fully copy/paste is what is useful. Pls try to have as simple an example in order to reproduce.

e.g.

# create the frame with the appropriate dtypes
df = DataFrame(....)

# perform the operations

# show what is expected

@flyingchipmunk
Copy link
Author

Ok here it is simplified:

df_1 = pd.DataFrame({"Row":[0,1,1], "EmptyCol":np.nan, "NumberCol":[1,2,3]})
df_2 = pd.DataFrame(columns = df_1.columns)
df_concat = pd.concat([df_1, df_2], axis=0)
df_g = df_concat.groupby('Row', as_index=False)

df_g.agg(sum)
df_g.agg(collections.OrderedDict({'EmptyCol':'sum'}))

profiling the first .agg() call yields this:

1000 loops, best of 3: 888 µs per loop

profiling the second .agg() call yields this:

1000 loops, best of 3: 1.44 ms per loop

I would expect the second .agg() call with the dictionary to be significantly faster than doing a broad sum on the entire dataframe.

Hope this helps.

@jreback jreback added Groupby Performance Memory or execution speed performance labels Jan 2, 2015
@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

You have object types on that empty column. Not sure how it got that way, but you always want to have base dtypes for performance. Converting to float makes the perf pretty close.

In [36]: df_concat.dtypes
Out[36]: 
EmptyCol      object
NumberCol    float64
Row          float64
dtype: object

In [37]: df2 = df_concat.copy()

In [38]: df2['EmptyCol'] = df2['EmptyCol'].astype(float)

In [39]: df2.dtypes
Out[39]: 
EmptyCol     float64
NumberCol    float64
Row          float64
dtype: object

In [40]: %timeit df2.groupby('Row',as_index=False).agg(sum)
1000 loops, best of 3: 1.14 ms per loop

In [41]: df_g2 = df2.groupby('Row',as_index=False)

In [42]: %timeit df_g2.agg(sum)
1000 loops, best of 3: 643 us per loop

In [43]: %timeit df_g2.agg(collections.OrderedDict({'EmptyCol':'sum'}))
1000 loops, best of 3: 726 us per loop

closing as this is a usage question

@flyingchipmunk
Copy link
Author

Yes I understand that, and I explained how/why the object dtype got introduced in my previous comments.

My issue here is how under the hood pandas is treating .agg(sum) over that entire dataframe differently (correctly) than .agg(collections.OrderedDict({'EmptyCol':'sum'}))

If the .agg(sum) can correctly use numpy cython np.sum for executing the aggregation than why can the dictionary not? I still believe this is a flaw.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

@flyingchipmunk its not clear why your original data ended up as object dtype.

Its also not performant to test whether an object column can be converted in an operation.

You can use df.convert_objects() to try to do this if you wish.

In [52]: df_concat.dtypes
Out[52]: 
EmptyCol      object
NumberCol    float64
Row          float64
dtype: object

In [53]: df_concat.convert_objects()
Out[53]: 
   EmptyCol  NumberCol  Row
0       NaN          1    0
1       NaN          2    1
2       NaN          3    1

In [54]: df_concat.convert_objects().dtypes
Out[54]: 
EmptyCol     float64
NumberCol    float64
Row          float64
dtype: object

@flyingchipmunk
Copy link
Author

Thanks for the convert_objects() tip, that's very useful and will make my workaround code much simpler.

The object dtype columns is what pandas assigned as the dtype when it created an empty dataframe. In my specific case it was a source csv with only a header row and no data rows. The simplest way I could reproduce that was by DataFrame(columns=df1.columns)

Then when that empty dataframe is concated pandas has to go with highest common denominator to hold the dtype of the columns from the different dataframes being concated. df_1.EmptyCol is dtype float64 because it is filled with np.nan. df_2.EmptyCol is dtype object. Thus when calling df_concat = pd.concat([df_1, df_2], axis=0) the resultant dtype for df_concat.EmptyCol has to be object.

In my mind it comes down to the question of how does pandas make the determination that the df_concat.EmptyCol in the code branch for .agg(sum) can be accomplished with np.sum, but that same column when specified in .agg(collections.OrderedDict({'EmptyCol':'sum'})) can not be accomplished with np.sum?

@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

hmm, on 2nd thought this might be a bug in the concat itself. It shouldn't downcast in this cast since its concatting with an empty. Might be an edge case.

Welcome a PR to look at that.

As far as how aggregation works. The cython routine tries with the entire frame first to see if it reduce. In this case it works because numpy does the coercion. The 2nd part might be a bug. If you could walk the code would be helpful.

@jreback jreback reopened this Jan 2, 2015
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions and removed Usage Question Performance Memory or execution speed performance labels Jan 2, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 2, 2015
@flyingchipmunk
Copy link
Author

Ok, thanks for the explanation. I'll dig further into walking the code, see if I can nail it down.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jbrockmendel jbrockmendel added the Apply Apply, Aggregate, Transform, Map label Oct 17, 2019
@mroeschke
Copy link
Member

Looks to be fixed on master. Could use a test.

In [61]: df_1 = pd.DataFrame({"Row":[0,1,1], "EmptyCol":np.nan, "NumberCol":[1,2,3]})
    ...: df_2 = pd.DataFrame(columns = df_1.columns)
    ...: df_concat = pd.concat([df_1, df_2], axis=0)
    ...: df_concat.dtypes
Out[61]:
Row          object
EmptyCol     object
NumberCol    object
dtype: object

In [62]: pd.__version__
Out[62]: '0.26.0.dev0+682.g08ab156eb'

@mroeschke mroeschke removed Apply Apply, Aggregate, Transform, Map Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Oct 26, 2019
@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Oct 26, 2019
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Jan 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants