Skip to content

CategoricalImputer enhancements. #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

arnau126
Copy link
Collaborator

@arnau126 arnau126 commented Apr 10, 2017

Enhancements:

  • makes that CategoricalImputer also inherits from BaseEstimator.
  • add missing_values param: to specify which is the placeholder for the missing values.
  • add copy param: to specify whether to perfom the imputation in a copy of X or inplace.
  • add y param in fit for Pipeline compatibility.
  • raise NotFittedError in transform if the imputer was not previously fitted.

Fix bugs:

  • If no category has more than 1 observation, fitting would have crashed because Series.mode returns

empty if nothing occurs at least 2 times

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html

@arnau126
Copy link
Collaborator Author

any thoughts, @dukebody ?

@dukebody
Copy link
Collaborator

Checking it now, sorry for the delay, recovering from the party. ;P

mask = _get_mask(X, self.missing_values)
X = X[~mask]

self.fill_ = Counter(X).most_common(1)[0][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I'm concerned about the scalability of this, since it will iterate the whole array in Python. Let's do some timings...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, for long arrays Counter is slower than series.mode, but for short arrays is significantly faster:

x = pd.Series(['a', 'b', 'c', 'a']* int(1e6))
    ...: 
    ...: %timeit Counter(x).most_common(1)[0][0]
    ...: %timeit x.mode()
    ...: 

1 loop, best of 3: 227 ms per loop
10 loops, best of 3: 141 ms per loop
x = pd.Series(['a', 'b', 'c', 'a']* int(1e2))
    ...: 
    ...: %timeit Counter(x).most_common(1)[0][0]
    ...: %timeit x.mode()
    ...: 
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 146 µs per loop

The break point seems to be around 1e3 elements and I expect many datasets to be as big as that, so let's better use the mode method. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, seems fair.

@dukebody
Copy link
Collaborator

Regarding

If no category has more than 1 observation, fitting would have crashed because Series.mode returns
empty if nothing occurs at least 2 times

I'm not sure what's best to do there. Does it make sense at all to use the 'most frequent' strategy to impute values if no value is repeated more than once? I think it's better to error out in that case.

Another issue is what to do when there is more than one mode, i.e. when there are two values that are repeated exactly the same number of times. What sklearn Imputer (that doesn't work with strings) does in this case is to take the smallest one, only because scipy's mode function does so. We can return a random one here as well.

@dukebody
Copy link
Collaborator

See #82 (comment)

@arnau126
Copy link
Collaborator Author

arnau126 commented Apr 18, 2017

Series.mode already returns all modes lexicographically sorted, by default.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html

So, just returning the first one seems a good option.

@dukebody
Copy link
Collaborator

Superseded by #89. Thanks @arnau126 !

@dukebody dukebody closed this Apr 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants