Skip to content

DataFrame apply change return type from Series to DataFrame if result is empty #3698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
asmirnov69 opened this issue May 27, 2013 · 9 comments

Comments

@asmirnov69
Copy link

pd.version.version ## return '0.11.0'
x = pd.DataFrame({'a': range(10), 'b': range(10)})
type(x.apply(lambda x: x['a'] + x['b'], 1)) # <class 'pandas.core.series.Series'>
x['c'] = x.apply(lambda x: x['a'] + x['b'], 1) ## works

x = x[x['a'] < 0]
type(x.apply(lambda x: x['a'] + x['b'], 1)) # <class 'pandas.core.frame.DataFrame'>
x['c'] = x.apply(lambda x: x['a'] + x['b'], 1) ## FAILS

Code above is the problem for quite common usage of apply to create new dataframe column using func applyed on row. It fails for empty dataframe and works for non-empty one.

@jreback
Copy link
Contributor

jreback commented May 27, 2013

this is in conflict with #2476. The issue is how to determine if a user supplied lambda is a reduction or not. When you have an empty frame this fails, so we return as if its not a reduction

e.g. will do what you want in this case

x.apply(lambda x: x.get('a',0) + x.get('b',0), 1)

@asmirnov69
Copy link
Author

I didn't realize that x in lambda expression in apply is Series. I will use your way of access to row values from now on.

Only suggestion would be to introduce new type to represent the row instead using of Series.
In row handling context [] operator is misleading -- IMHO without it access to row element will be less confusing for users not familiar with pandas code base.

@jreback
Copy link
Contributor

jreback commented May 28, 2013

actually, I think your suggestion is more confusing :), when you use axis=1, (that is the 1 in your expression), you are saying give me a row of the frame as a Series, by definition. You are then free in your function to do what you want, however, sometimes it is best to forgo the custom function,

e.g (and this is much faster)

In [16]: x.apply(lambda z: z['a'] + z['b'],1)
Out[16]: 
0     0
1     2
2     4
3     6
4     8
5    10
6    12
7    14
8    16
9    18
dtype: int64

In [17]: x[['a','b']].sum(1) 
Out[17]: 
0     0
1     2
2     4
3     6
4     8
5    10
6    12
7    14
8    16
9    18
dtype: int64

@asmirnov69
Copy link
Author

After getting a bit more in pandas method docs and examples I would retract my suggestion about introducing any changes into DataFrame.apply(). Also, in the light of my newly acquired knowledge, your workaround for my original problem using Series.get with default 0 will not work. This is because there are situations when there is no convenient default. Also it may mask typo in column name which will be hard to track down.

One more illustration of the same problem is below. As you can see return type of apply changes depending on condition applied to original dataframe. Does it look like bug? Workaround in my case is to use wrapper around apply call which I am going to do in my code.

>>> import pandas as pd
>>> pd.version.version
'0.11.0'
>>> df = pd.DataFrame({'a': range(10), 'b': range(10)})
>>> type(df[df['a']>0].apply(lambda x: x['a'] + x['b'], 1))
<class 'pandas.core.series.Series'>
>>> type(df[df['a']<0].apply(lambda x: x['a'] + x['b'], 1))
<class 'pandas.core.frame.DataFrame'>

@jreback
Copy link
Contributor

jreback commented May 28, 2013

you can do a lot of things inside the apply, e.g. lambda x: x.get('a',np.nan) + x.get('b',np.nan), but logic applies (pun intended).

This is probably not the best way to do it for the reason you indicate. The example you gave is not a bug, but a defined behavior.

@asmirnov69
Copy link
Author

So it is a feature, not a bug :)
Well, i think it is fine provided it is documented in 'Caveats and Gotchas' section at least.
As well as some stackoverflow readers might find it helpful: http://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe

Thanks for the discussion, I've learn a bit more of pandas -- hope never get back to R again.

@jreback
Copy link
Contributor

jreback commented May 28, 2013

yes, its a bit undefined what you do with the empties

would always welcome a doc PR!

@jreback
Copy link
Contributor

jreback commented Jun 4, 2013

going to move this to 0.12 for doc update in groupby, as @asmirnov69 suggest above

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 26, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

the reduce kw can be specified to deal with this

@jreback jreback closed this as completed Feb 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants