Skip to content

DataFrame.loc[] returns inconsistent types depending on row count #11224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jerryatmda opened this issue Oct 2, 2015 · 9 comments
Closed

DataFrame.loc[] returns inconsistent types depending on row count #11224

jerryatmda opened this issue Oct 2, 2015 · 9 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Usage Question

Comments

@jerryatmda
Copy link

If a dataframe has a single row for a given index entry, it returns a Series. If it
has two rows for that index, it returns a DataFrame. I believe that it should return a
DataFrame in either case for consistency.

Image attached, small dataframe and notebook exhibiting the problem attached. OK, so
I can't attach either the dataframe or the notebook, even suffixing them with .txt (github barfs).
So I'm pasting the text fragment after the image..

pandasinconsistenttypes

dataframe = '''
Locus,Decision,Group,Var,Region,Gene,Rows,Mutation,Profile
chr01:0018961727,Homopolymer,VS,CA,exonic,PAX7,1.0,synonymous SNV,000000010010001000000000001101
chr01:0027057772,Bad,IR-PM-VS,CA,exonic,ARID1A,1.0,nonsynonymous SNV,000000000000000000001000100000
chr01:0027057772,Bad,IR-PM-VS,CA,exonic,ARID1A,1.0,nonsynonymous SNV,000000000000000000011001110100
chr01:0027057772,Bad,IR-PM-VS,CA,exonic,ARID1A,1.0,nonsynonymous SNV,100000000001010000010001110110
'''
df = [line.split(',') for line in txt.split('\n')]
tdf = pd.DataFrame.from_records(df, index=(0,))
tdf
type(tdf.loc['chr01:0018961727']), type(tdf.loc['chr01:0027057772'])

@jerryatmda
Copy link
Author

Sorry, the [0:1] in the image might mislead (I was trying to return exactly the first matching line of the group), but note that the call in the text is simply for the .loc[]

A second thing I forgot to mention was tying this to this issue, because I think they may be related:
#5839

I say that because I first encountered the problem above in the context of .groupby() where a group of one row is a Series, and a group of two or more is a DataFrame.

Oh, and because I couldn't add the notebook, I forgot to mention this is 0.16.2 with 2.7.
pandasinconsistenttypesversions

@shoyer
Copy link
Member

shoyer commented Oct 2, 2015

I agree this might be a good idea, but it would certainly be a major break in the API. So I think it's unlikely to be feasible for pandas.

@TomAugspurger TomAugspurger added the Indexing Related to indexing on series/frames, not to indexes themselves label Oct 2, 2015
@TomAugspurger
Copy link
Contributor

Agreed that this would be too disruptive a change.

@jerryatmda
Copy link
Author

It certainly warrants a modification to
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc
describing the differential return values.
Right now that page does not even discuss return values.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

@jerryatmda

this is solely due to the fact that you have duplicates in the index.

In a unique index, you will always get the same type of data.

so a doc-note is fine, but this is actually a very rare case.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

It is much simpler to use a guaranteed syntax

eg.

df.loc[[....]] which will always return a frame.

If you would like to add a note about using duplicates and selection (and how to use the guaranteed syntax) that would be fine.

@jerryatmda
Copy link
Author

OK, that works, but I guess I don't understand "guaranteed syntax."
I just searched it in the docs, and came up with a single reference to the word "syntax."
This is pretty clearly a pandas term of art that has somehow escaped documentation in the manual thus far.
Since I don't know what it means, I am not the person to write that, sorry.

@TomAugspurger
Copy link
Contributor

Jeff meant that passing in a list as an indexer will always return a DataFrame. So in your case it's tdf.loc[['chr01:0018961727']]) (notice the two sets of square brackets).

@jerryatmda
Copy link
Author

Oh, and yes, I agree, it's due to the duplicates in the index, and sure, the wise DBA normalizes his data to 4th normal form -- unless he wants to use it.

Thanks to all for your help. I will hope to propose a note for the docs, but I really don't know where to start.

Thanks again,
Jerry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants