Skip to content

BUG: Rowwise subset of a DataFrame based on index using .loc #7690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Gitman-code opened this issue Jul 8, 2014 · 4 comments
Closed

BUG: Rowwise subset of a DataFrame based on index using .loc #7690

Gitman-code opened this issue Jul 8, 2014 · 4 comments
Labels
Duplicate Report Duplicate issue or pull request Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@Gitman-code
Copy link

https://stackoverflow.com/questions/24536734/rowwise-subset-of-a-dataframe-based-on-index

This should act like an inner join but it adds NULL rows to the table being operated on. At a very least it should raise an error when the index is not found.

@jreback
Copy link
Contributor

jreback commented Jul 8, 2014

dupe here: #2033

This was an original API decision to make .loc/.ix act like a reindex when presented with a slice/index-like, e.g.

df.loc['a':'f'] only requires the inclusion of the end-points so
df.loc[list(....)] only requires at least 1 point that is included.

you can simply df.loc[list(...)].dropna() if you are doing an isin type of operation (or you can
do some sort of join if you want.)

further you would completely break (I suppose it could work with special casing, maybe), a very common operation:

In [3]: df = DataFrame(np.arange(6),index=pd.MultiIndex.from_product([['a','b'],range(3)]))

In [4]: df
Out[4]: 
     0
a 0  0
  1  1
  2  2
b 0  3
  1  4
  2  5

In [6]: df.loc[['a']]
Out[6]: 
     0
a 0  0
  1  1
  2  2

In [8]: df.index.values
Out[8]: array([('a', 0), ('a', 1), ('a', 2), ('b', 0), ('b', 1), ('b', 2)], dtype=object)

if you would like to enhance the docs that would be fine.

closing as a dupe

@jreback jreback closed this as completed Jul 8, 2014
@jreback jreback modified the milestone: 0.15.0 Jul 8, 2014
@jreback
Copy link
Contributor

jreback commented Jul 8, 2014

as a side note its possible somthing like:

df.loc(strict=True)[.....] could be done, that could set a 'strict' mode (and possibly have an option for this).
not sure how useful / trouble this is worth though

@Gitman-code
Copy link
Author

I would be strongly in favor of such an implementation however I do not know how much effort it will be so the worth is unknown. I would think that this is basic functionality and I can't think of when the current functionality would be useful. Perhaps a option with set theory ('Union', 'intersection') or SQL ('Inner','Outer') semantics would be more clear in the implementation than 'strict'.

The current work around Smalldf = Smalldf[Smalldf.index.isin(Largedf.index))] as given in the stack overflow answer is more cumbersome and possibly slower than Smalldf = Smalldf.loc[Largedf.index]

Also, Smalldf = Smalldf.loc[Largedf.index].dropna() does not work because I often have meaningful NULL values.

@jreback
Copy link
Contributor

jreback commented Jul 8, 2014

this is an API issue, and as I stated has long been the case.

This is the usecase for isin. The point is that you maybe selecting values, but you don't need a KeyError, just because say a small number of values are missing. Pandas always wants to align to the index.

You have many options on what to do.

You are effectively doing a join, so you should explore that as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

2 participants