Skip to content

Allow to disselect features #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datajanko opened this issue Feb 2, 2018 · 6 comments · Fixed by #246
Closed

Allow to disselect features #137

datajanko opened this issue Feb 2, 2018 · 6 comments · Fixed by #246

Comments

@datajanko
Copy link

Hey,

assume you used DataFrameMapper to preprocess some of you columns and you generated a lot of new columns. Now you want to use a large portion of those columns to impute another subset of columns using some kind of regression. In such a case it might be easier to just disselect a handful of columns and use the rest to perform this task.
Why not pass those columns to default?

  • would need an do nothing transformer for the columns to disselect
  • if I want to do some further processing on a subset of the large columns -> I'd need to do this later in the pipeline

How it could be implemented?
keyword argument after alias names deselectwith default value False

Problems: Will potentially interfere with the way the default columns are calculated.

Objections against this idea? What further problems would you think of?

@dukebody
Copy link
Collaborator

dukebody commented Feb 4, 2018

I don't see this as a very common feature, but however could be cleanly implemented using a class of "column selectors":

  • By default, if the first element of the feature definition is an string or a list, select those columns.
  • If the first element is an instance of subclass of ColumnSelector, use column_selector.select(dataframe).

Default implementation:

class ColumnSelector:
    def __init__(self, columns):
        self.columns = columns

    def select(self, dataframe):
        return dataframe[self.columns]

This way one can easily define subclasses of this class like:

class ExcludeColumnsSelector(ColumnSelector):
    def __init__(self, columns_to_exclude):
        self.columns_to_exclude = columns_to_exclude

    def select(self, dataframe):
        columns = dataframe.columns - self.columns_to_exclude
        return dataframe[self.columns]

And use them like:

mapper = DataFrameMapper([
    (ExcludeColumnsSelector('a'), LabelBinarizer())
])

Opinions?

@devforfu
Copy link
Collaborator

devforfu commented Feb 4, 2018

Probably the only thing that could be considered is naming of inherited classes, i.e. ExcludeColumnsSelector sounds a bit vague (though maybe it is just a matter of taste) and could be something like this instead:

mapper1 = DataFrameMapper([
    (ExcludeColumns('a'), LabelBinarizer())
])

mapper2 = DataFrameMapper([
    (Skip('a'), LabelBinarizer())
])

I would say that this feature could be quite suitable, especially, in case of more sophisticated transformations or filters. Also, one could add additional flags to these classes like enable=True to decide if column should be "selected" or not. (Though probably it is sounds like a kind of replication of feature selection transformers from sklearn).

@dukebody
Copy link
Collaborator

dukebody commented Feb 4, 2018

@devforfu The ExcludeColumnsSelector was an example :) I think we can provide the base selector class, then users can implement any behavior subclassing that one, but would be up to them what to do.

We can however provide the selector to exclude columns as example - both ExcludeClumnsSelector and ExcludeColumns sound OK to me. I don't mind as soon as it's implemented and works correctly. :)

@devforfu
Copy link
Collaborator

devforfu commented Feb 4, 2018

@dukebody Ok, understood =) So, the main thing is to bring support of that additional interface which could be provided instead of list of strings and would call a select() method on dataframe before passing it down to next steps, as I can see.

@dukebody
Copy link
Collaborator

dukebody commented Feb 4, 2018 via email

@datajanko
Copy link
Author

Currently I am using column selector transformers objects. But your definition of column selector looks really nice. Haven't thought of that place to insert the column selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants