Skip to content

Allow specifying columns with a filter function #245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

kitmonisit
Copy link
Contributor

A use case for this would be when you do not have full knowledge of all the columns in the dataframe.

In the current implementation, I cannot do following, because df is not defined. mapper could be used in a middle step of a pipeline where a previous step produces a variable number of columns starting with "special_"

mapper = DataFrameMapper(
    [
        (
            [col for col in df.columns if col.startswith("special_")],
            SomeTransformer(),
        )
    ]
)

In my implementation, I could do this:

mapper = DataFrameMapper(
    [
        (
            lambda x: x.startswith("special_"),
            SomeTransformer(),
        )
    ]
)

I don't mean to hijack #239, but I thought this would be a good start.

What do you think?

@ragrawal
Copy link
Collaborator

hi @kitmonisit thanks for the MR. Let me review this . I definitely agree this is an important functionality to provide. However, I am wondering it might be a good opportunity to look into #239 and see if there is a merit in using column selector instead.

@kitmonisit
Copy link
Contributor Author

kitmonisit commented Apr 28, 2021

Thank you @ragrawal. I did this in a hurry (and for quite selfish reasons :P) Yes, I think that the column selector is a better approach.

The if callable() approach has seen its dry run in my use case and I'm quite satisfied. The column selector referred to by #239 also returns a callable. I guess the same if callable() entry point can be applied.

What do you think?

EDIT: Now I remember why I didn't go with the column selector and spun my own solution. It's because I'm terrible with regex. startswith is so much easier. haha

@ragrawal
Copy link
Collaborator

ragrawal commented May 8, 2021

Hi @kitmonisit I looked into your changes more closely and I think this is not the correct approach. The problem is you can have different data frames during the fit and transform and this might create all kinds of inconsistencies. For instance, let's assume I use the following data frame with following columns for "fit" operation: x1, x2, x3, a1, a2. However during transform operation, I have the following data frame: x1, x2, x3, x4, a1, a2. Note that there is an additional column X4 . If you have condition to select all columns that start with "x", this will cause an issue. I think during "fit" operation, we need to apply the data frame and generate the list of columns. I have some ideas and let me work on them.

@kitmonisit
Copy link
Contributor Author

Thanks for reviewing @ragrawal. You are right, that's the reason why I specified for the ability to pass a custom function that is compatible with filter. My startswith use case is only one of many possible use cases.

@ragrawal
Copy link
Collaborator

ragrawal commented May 8, 2021

This is now fixed in V2.2.0 release . It leverages callable function. Please see Dynamic column name section for an example.

@ragrawal ragrawal closed this May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants