Skip to content

Confusing behaviour of df.empty #12393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
phil20686 opened this issue Feb 19, 2016 · 16 comments
Closed

Confusing behaviour of df.empty #12393

phil20686 opened this issue Feb 19, 2016 · 16 comments
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@phil20686
Copy link

This is as much a documentation issue as anything else. Basically it seems confusing that df.empty != df.dropna().empty I.e. that a a dataframe consiting entirely of na is not treated as empty. Obviously this is a bit of an edge case, but it caused a bunch of failures for me when used eith pd.read_sql methods, as database tables will often have columns that are not available for partiicular entities, and so can return an entire series of na.

It seems to me that in all cases df.empty should be the same as df.dropna().empty, but I understand that opinions might differ on this point, but at least the behaviour should be clearly documented.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

Can you show an example, the following is works. I agree certainly could update the documentation (with some examples and such)

In [8]: df = DataFrame({'A' : [np.nan]})

In [9]: df.empty
Out[9]: False

In [10]: df.dropna().empty
Out[10]: True

@jreback jreback added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Novice labels Feb 19, 2016
@jreback jreback added this to the 0.18.1 milestone Feb 19, 2016
@phil20686
Copy link
Author

That is exactly the behavior I was questioning, I think out[9] should be True. It seems to me that a dataframe containing nothing by na cells is "empty" according to most definitions.....

Phil

@phil20686
Copy link
Author

I certainly expected that df.empty would be true if a data frame contained nothing but na cells.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

no, that is not a normal definition of empty which is 0-len. nulls are real values which are placeholders. The key here is that you actually have a valid index. Changing this would involve the definition dependent on the data itself which is not a good thing. welcome to have a doc update with some examples though.

@phil20686
Copy link
Author

Well its certainly not the case that df.empty is the same as len(df.index==0) e.g.

df = pd.DataFrame([], index=[0,1,2])
print df.empty #True
print len(df.index==0) #False

Also

df = pd.DataFrame([], index=[0,1,2], columns=['A','B'])
df.empty #False

@phil20686
Copy link
Author

So not only do I dispute that len=0 is the semantic definition of empty, that doesn't appear to be the implementation anyway.

@phil20686
Copy link
Author

Also stuff like:

df = pd.DataFrame([], index=[0,1,2], columns=['A','B'])
df.set_value(1, 'A', 17)
df['B'].empty # False

which just seems plain wrong. If I have a column that I have never added any data to, it should not return false when asked if its empty. I guess under the hood when you specify an index and columns it autofills the dataframe somehow, and that results in this behaviour, but the definition of empty should really play nicely with the default dataframe constructor in these examples imo. Else its just confusing.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

its very simple. its empty only if all axes are len 0

@kawochen
Copy link
Contributor

It sounds like you're looking for some other collection of Series. A DataFrame is a tabular collection, and it makes sense to look at the shape.

@jorisvandenbossche
Copy link
Member

@phil20686 You can eg use:

In [15]: df['B'].isnull().all()
Out[15]: True

@phil20686
Copy link
Author

@jrebeck my example shows that that is not the implemented behavior:

df = pd.DataFrame([], index=[0,1,2])
print df.empty #True 
sum(df.shape) == 0 #false
df = pd.DataFrame([], columns=['A','B'])
print df.empty # True
sum(df.shape) == 0 #false

I really find it super weird that pre-allocation should result in empty=False, e.g. if you concatenate an empty series with a non empty dataframe it will get preallocated and then extracting it means empty has changed from True to False. This seems very strange to me, in some abstract sense series C is the same object, but merely moving it around has changed its properties.

df = pd.DataFrame([], index=[0,1,2], columns=['A','B'])
df.set_value(1, 'A', 17)
series = pd.Series(name="C")
print series.empty
df2 = pd.concat([df,series], axis=1)
print df2["C"].empty

Anyway, my main point is that this behavior should be documented, because its quite counter-intuitive, not to argue about definitions of empty.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

@phil20686 every one of those results is correct, what exactly is counter intuitive here?

there isn't any 'pre-allocation' at all. You have indices. If the indicies are 0 in any way (could be 1 dim or not) then you are empty, otherwise you are not.

What exactly are you using .empty? To be honest I have rarely needed this. Most operations in pandas just work regardless if things are empty or not. I think the docs are fairly clear on this.

@phil20686
Copy link
Author

Um. I had a series that had .empty = True, I concat it with a dataframe, and then I extract the series, and then magically .empty=False? Even though at no time has the user added data to it?

Similarly, you can create a dataframe with either an index or columns but no values and its "empty", but if it has both and no values its "non-empty".

You don't think that is counter intuitive behavior?

Anyway, I think most people would assume that empty == contains no data. That clearly isn't the case as it looks like its the same as

not any(df.shape)

Anyway, my main point was that the documentation of Dataframe.empty should note these behaviors.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

@phil20686 one of the highlites of pandas is that it aligns data. When you put in a series it was empty, however, the concat realigned the Series to the other values in the DataFrame

In [19]: df2
Out[19]: 
     A    B   C
0  NaN  NaN NaN
1   17  NaN NaN
2  NaN  NaN NaN

In [20]: df2.columns
Out[20]: Index([u'A', u'B', u'C'], dtype='object')

In [21]: df2['C']
Out[21]: 
0   NaN
1   NaN
2   NaN
Name: C, dtype: float64

then [21] is clearly NOT empty; yes it is all null. Which is a MUCH more common operation. .empty is a very blunt instrument and not really used much in practice for this very reason. It is correct as far as it goes.

@jorisvandenbossche
Copy link
Member

@phil20686 In trying to clear some things up, I think we have to make a distinction between two points:

  • The definition of empty is clear**: it returns True if the length of one or all of the axes is 0 (you could see it as "len(index) x len(columns) == 0" for a dataframe). It is by defintion not about having all NaNs or not. It is quite possible you find this not the best definition, but taking this definition as a starting point, all the return values in the examples you showed are consistent and as expected.
  • What is maybe more surprising in some cases, leading to the unintuitive behaviour you described, is the way pandas fills Series/DataFrames with NaNs (and once filled with NaNs, it is not empty anymore). Pandas will fill a DataFrame with NaNs once it has both an index and columns. And there are indeed operation (eg concat) where index/columns can be added, leading to filling with NaNs.

I just want to point out that I think the confusion you get has another root cause than the empty method.
And the current empty method is just not the method you are looking for I think. It would be more something like allnan or allnull, which you can obtain with isnull().all()

** I don't say it is 'clear' in the docs, I mean in implementation

But indeed, the docs of empty can certainly point that out. Do you want to do a PR to specify that this is not about NaNs ?

@masongallo
Copy link
Contributor

I agree that the docs of empty could point this out (I actually had a student just ask me about this). Since it's been a few days, I can do a quick PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants