Skip to content

ENH: Improve Filter function with Filter_Columns and Filter_Rows #55289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
speed650 opened this issue Sep 25, 2023 · 12 comments
Open
1 of 3 tasks

ENH: Improve Filter function with Filter_Columns and Filter_Rows #55289

speed650 opened this issue Sep 25, 2023 · 12 comments
Assignees
Labels
Enhancement Filters e.g. head, tail, nth Needs Discussion Requires discussion from core team before further action

Comments

@speed650
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Using the [ ] syntax can get messy/complicated for applying filters to a dataframe. The current Filter() function is also confusing to use. I Propose adding dedicated functions to quickly filter out columns and rows to make Pandas easier to use.

Feature Description

Propose adding 2 new functions

Filter_Columns(), Filter_Rows()

def Filter_Columns( columns: List, inverse:Bool, inplace:Bool )
def Filter_Rows( Rows: List, inverse:Bool, inplace:Bool )

Usage:

Filter_Columns( ['Names', 'Ages' ], inverse=False, inplace=True)

Shows columns for name and age.. Inverse is used to hide show other columns that are not name and age.

def Filter_Rows( [ ('name'==bob), ( 'age' > 20) ] , inverse:Bool, inplace:Bool )

Shows dataframe where the value in names column =bob, and age column >20

Chained

Filter_Columns( ['Names', 'Ages' ], inverse=False, inplace=True).Filter_Rows( [ ('name'==bob), ( 'age' > 20) ] , inverse:Bool, inplace:Bool )

Alternative Solutions

Use [ ] syntex... More confusing

Additional Context

No response

@speed650 speed650 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 25, 2023
@jbrockmendel
Copy link
Member

I’d be on board with deprecating filter so we can change it to a more standard row-filter

@speed650
Copy link
Author

speed650 commented Sep 25, 2023 via email

@Cappuchinoo
Copy link

take

@phofl
Copy link
Member

phofl commented Oct 21, 2023

cc @pandas-dev/pandas-core

@rhshadrach
Copy link
Member

rhshadrach commented Oct 22, 2023

Most pandas methods offer an axis argument. I'm not a fan of adding fragmentation to the API by sometimes have axis and sometimes having two methods *_rows and *_columns). I also prefer fewer methods with more arguments as opposed to more methods with fewer arguments, especially where there aren't arguments that go unused in certain cases. I can be on board if we want to move away from axis argument to having two methods - but think it should be across the API rather than just for some methods.

pyspark calls our current method select and uses filter for conditions (similar to what's proposed here). It looks like polars is similar. I like this terminology.

In 2.x, I propose we:

  1. Alias filter to select
  2. Add filter_cond (or something similar) - in my opinion this does not need, but can have, an axis argument
  3. Deprecate filter

In the future, we then have the option to:

  1. Alias filter_cond to filter
  2. Deprecate filter_cond

For the filter_cond method, it should only be Boolean conditions (e.g. not accept lists of labels), does not need an inverse argument (negation is easy enough), and should have an inplace argument (it can't be done inplace).

@jreback
Copy link
Contributor

jreback commented Oct 22, 2023

@rhshadrach there have been a number of issues to do somewhat similar

not averse but it would likely need a dedicated issue (pls link the original) and comprehensive schedules for this

@rhshadrach rhshadrach added Filters e.g. head, tail, nth Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2023
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 24, 2023

I did a PR that didn't go anywhere about having a filter for Index that would create some better syntax for "filtering" on an index, IMHO: #51370 That might cover some of the use cases here.

OP wrote:

The current Filter() function is also confusing to use. I Propose adding dedicated functions to quickly filter out columns and rows to make Pandas easier to use.

I guess the only benefit of this proposal is to allow a list of conditions to be applied.

So why not just add an argument to DataFrame.filter() that allows that list to be specified, and avoid filter_cond ?

@bashtage
Copy link
Contributor

So why not just add an argument to DataFrame.filter() that allows that list to be specified, and avoid filter_cond ?

Most pandas methods offer an axis argument. I'm not a fan of adding fragmentation to the API by sometimes have axis and sometimes having two methods *_rows and *_columns).

I fully agree with these two. DataFrame.filter already accepts axis. It seems the only suggestion here is to add an inverse argument which would make filter drop?

I'm not sure I see where the value add of this proposal lies?

Unless I am missing something, these are pretty easy to do.

def Filter_Columns( columns, inverse=False) -> df.filter(columns)
def Filter_Rows(index, inverse=False) -> df.filter(index, axis=0)
def Filter_Columns( columns, inverse=True) -> df.drop(columns)
def Filter_Rows( columns, inverse=True) -> df.drop(index, axis=0)

@rhshadrach
Copy link
Member

rhshadrach commented Oct 24, 2023

@bashtage

I fully agree with these two. DataFrame.filter already accepts axis. It seems the only suggestion here is to add an inverse argument which would make filter drop?

I don't think this is accurate. The OP is also asking to be able to filter based on conditions.

@Dr-Irv

So why not just add an argument to DataFrame.filter() that allows that list to be specified, and avoid filter_cond ?

A few reasons:

  1. Arguments like and regex don't make sense when filtering by condition. I think having arguments that don't make sense in the presence of values of other arguments is not good API design.
  2. I think it is more common one would filter by labels with columns, and filter by conditions with rows. If this is the case, there isn't a good default for axis if it were all one method.
  3. R, pyspark, and polars all use select for "filter by label" and filter for "filter by condition".

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 24, 2023

A few reasons:

  1. Arguments like and regex don't make sense when filtering by condition. I think having arguments that don't make sense in the presence of values of other arguments is not good API design.

I agree on the design part, but at least for filter today, we already have that items, like and regex are enforced to be mutually exclusive, so adding a condition would be consistent with that (admittedly - not a great design to begin with).

  1. I think it is more common one would filter by labels with columns, and filter by conditions with rows. If this is the case, there isn't a good default for axis if it were all one method.

I have had use cases where filtering by condition on column names would be useful, unless you want a complex regex. Also, to filter rows by condition, you can just use query()

  1. R, pyspark, and polars all use select for "filter by label" and filter for "filter by condition".

A reasonable argument to change things along the lines you propose

@rhshadrach
Copy link
Member

I have had use cases where filtering by condition on column names would be useful, unless you want a complex regex. Also, to filter rows by condition, you can just use query()

Agreed there are cases - but I'm curious about your perception as to how common one case is vs another, as this is discussing the default value of axis.

query is great, I use it a lot in ad-hoc analysis, but it comes with a ton of overhead and I avoid it otherwise.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 24, 2023

Agreed there are cases - but I'm curious about your perception as to how common one case is vs another, as this is discussing the default value of axis.

Since filter has a default of columns as the axis, I'd vote for that. Coming up with a way to select a subset of columns based on properties of the column names is something that only filter can do right now.

query is great, I use it a lot in ad-hoc analysis, but it comes with a ton of overhead and I avoid it otherwise.

I used to think that, but after some testing I did a few years ago, I didn't see the performance difference, and the syntax is pretty clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Filters e.g. head, tail, nth Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants