CategoricalImputer enhancements. #87

arnau126 · 2017-04-10T10:10:41Z

Enhancements:

makes that CategoricalImputer also inherits from BaseEstimator.
add missing_values param: to specify which is the placeholder for the missing values.
add copy param: to specify whether to perfom the imputation in a copy of X or inplace.
add y param in fit for Pipeline compatibility.
raise NotFittedError in transform if the imputer was not previously fitted.

Fix bugs:

If no category has more than 1 observation, fitting would have crashed because Series.mode returns

empty if nothing occurs at least 2 times

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html

Bug in the CategoricalImputer tests: Bug in a CategorialImputer test. #86

arnau126 · 2017-04-14T17:00:40Z

any thoughts, @dukebody ?

dukebody · 2017-04-17T08:47:42Z

Checking it now, sorry for the delay, recovering from the party. ;P

dukebody · 2017-04-17T09:07:35Z

sklearn_pandas/categorical_imputer.py

+        mask = _get_mask(X, self.missing_values)
+        X = X[~mask]
+
+        self.fill_ = Counter(X).most_common(1)[0][0]


Hmmm I'm concerned about the scalability of this, since it will iterate the whole array in Python. Let's do some timings...

Interesting, for long arrays Counter is slower than series.mode, but for short arrays is significantly faster:

x = pd.Series(['a', 'b', 'c', 'a']* int(1e6)) ...: ...: %timeit Counter(x).most_common(1)[0][0] ...: %timeit x.mode() ...: 1 loop, best of 3: 227 ms per loop 10 loops, best of 3: 141 ms per loop

x = pd.Series(['a', 'b', 'c', 'a']* int(1e2)) ...: ...: %timeit Counter(x).most_common(1)[0][0] ...: %timeit x.mode() ...: The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 37.2 µs per loop 10000 loops, best of 3: 146 µs per loop

The break point seems to be around 1e3 elements and I expect many datasets to be as big as that, so let's better use the mode method. :)

Okay, seems fair.

dukebody · 2017-04-17T09:36:55Z

Regarding

If no category has more than 1 observation, fitting would have crashed because Series.mode returns
empty if nothing occurs at least 2 times

I'm not sure what's best to do there. Does it make sense at all to use the 'most frequent' strategy to impute values if no value is repeated more than once? I think it's better to error out in that case.

Another issue is what to do when there is more than one mode, i.e. when there are two values that are repeated exactly the same number of times. What sklearn Imputer (that doesn't work with strings) does in this case is to take the smallest one, only because scipy's mode function does so. We can return a random one here as well.

dukebody · 2017-04-17T09:37:22Z

See #82 (comment)

arnau126 · 2017-04-18T08:03:18Z

Series.mode already returns all modes lexicographically sorted, by default.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html

So, just returning the first one seems a good option.

dukebody · 2017-04-29T16:44:56Z

Superseded by #89. Thanks @arnau126 !

CategoricalImputer enhancements.

23afa83

arnau126 mentioned this pull request Apr 10, 2017

Bug in a CategorialImputer test. #86

Closed

dukebody added bug enhancement labels Apr 17, 2017

dukebody reviewed Apr 17, 2017

View reviewed changes

dukebody mentioned this pull request Apr 17, 2017

Categoricalimputer improvements #89

Merged

dukebody closed this Apr 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CategoricalImputer enhancements. #87

CategoricalImputer enhancements. #87

arnau126 commented Apr 10, 2017 •

edited

Loading

arnau126 commented Apr 14, 2017

dukebody commented Apr 17, 2017

dukebody Apr 17, 2017

dukebody Apr 17, 2017

arnau126 Apr 18, 2017

dukebody commented Apr 17, 2017

dukebody commented Apr 17, 2017

arnau126 commented Apr 18, 2017 •

edited

Loading

dukebody commented Apr 29, 2017

CategoricalImputer enhancements. #87

CategoricalImputer enhancements. #87

Conversation

arnau126 commented Apr 10, 2017 • edited Loading

arnau126 commented Apr 14, 2017

dukebody commented Apr 17, 2017

dukebody Apr 17, 2017

Choose a reason for hiding this comment

dukebody Apr 17, 2017

Choose a reason for hiding this comment

arnau126 Apr 18, 2017

Choose a reason for hiding this comment

dukebody commented Apr 17, 2017

dukebody commented Apr 17, 2017

arnau126 commented Apr 18, 2017 • edited Loading

dukebody commented Apr 29, 2017

arnau126 commented Apr 10, 2017 •

edited

Loading

arnau126 commented Apr 18, 2017 •

edited

Loading