-
Notifications
You must be signed in to change notification settings - Fork 415
CategoricalImputer enhancements. #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
any thoughts, @dukebody ? |
Checking it now, sorry for the delay, recovering from the party. ;P |
mask = _get_mask(X, self.missing_values) | ||
X = X[~mask] | ||
|
||
self.fill_ = Counter(X).most_common(1)[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I'm concerned about the scalability of this, since it will iterate the whole array in Python. Let's do some timings...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, for long arrays Counter
is slower than series.mode
, but for short arrays is significantly faster:
x = pd.Series(['a', 'b', 'c', 'a']* int(1e6))
...:
...: %timeit Counter(x).most_common(1)[0][0]
...: %timeit x.mode()
...:
1 loop, best of 3: 227 ms per loop
10 loops, best of 3: 141 ms per loop
x = pd.Series(['a', 'b', 'c', 'a']* int(1e2))
...:
...: %timeit Counter(x).most_common(1)[0][0]
...: %timeit x.mode()
...:
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 146 µs per loop
The break point seems to be around 1e3 elements and I expect many datasets to be as big as that, so let's better use the mode
method. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, seems fair.
Regarding
I'm not sure what's best to do there. Does it make sense at all to use the 'most frequent' strategy to impute values if no value is repeated more than once? I think it's better to error out in that case. Another issue is what to do when there is more than one mode, i.e. when there are two values that are repeated exactly the same number of times. What sklearn |
See #82 (comment) |
So, just returning the first one seems a good option. |
Enhancements:
BaseEstimator
.missing_values
param: to specify which is the placeholder for the missing values.copy
param: to specify whether to perfom the imputation in a copy of X or inplace.y
param infit
forPipeline
compatibility.NotFittedError
intransform
if the imputer was not previously fitted.Fix bugs:
Series.mode
returnshttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html