Skip to content

BUG: fix Categorical.astype for dtype=np.int32 argument #39615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 8, 2021

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Feb 5, 2021

Fixing a bug I introduced in #37355

@jreback jreback added Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version labels Feb 5, 2021
@jreback jreback added this to the 1.2.2 milestone Feb 5, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine. can you add a note in fixed regressions for 1.2.2 & re-run appropriate asv's to make sure not much change.

@simonjayhawkins
Copy link
Member

looks fine. can you add a note in fixed regressions for 1.2.2 & re-run appropriate asv's to make sure not much change.

@arw2019 when do you think you can do this? 1.2.2 release scheduled for today. we have a couple of blockers though.

@simonjayhawkins simonjayhawkins mentioned this pull request Feb 7, 2021
@arw2019
Copy link
Member Author

arw2019 commented Feb 7, 2021

looks fine. can you add a note in fixed regressions for 1.2.2 & re-run appropriate asv's to make sure not much change.

@arw2019 when do you think you can do this? 1.2.2 release scheduled for today. we have a couple of blockers though.

@simonjayhawkins this is ready to go assuming nothing comes up in benchmarks. Running them now, will post once they're done

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to merge ,a followup comment (which can be here or in a followon for 1.3)

@@ -94,6 +94,14 @@ def cmp(a, b):
result = ser.astype("object").astype(CategoricalDtype())
tm.assert_series_equal(result, roundtrip_expected)

def test_categorical_int_to_int32(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be named test_categorical_astype_to_int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -94,6 +94,14 @@ def cmp(a, b):
result = ser.astype("object").astype(CategoricalDtype())
tm.assert_series_equal(result, roundtrip_expected)

def test_categorical_int_to_int32(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize on the any_int_dtype, ideally this also works on any_int_nullable_dtype (but not sure that will just work).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - created a new fixture any_int_or_nullable_int_dtype

@arw2019
Copy link
Member Author

arw2019 commented Feb 7, 2021

@jreback addressed test comments here - does work already for all int dtypes, including nullable

@arw2019
Copy link
Member Author

arw2019 commented Feb 7, 2021

Doesn't seem to affect basic Categorical benchmarks (but will run some others that I think hit this)

(pandas-dev) andrewwieteska@Andrews-MacBook-Pro asv_bench % asv continuous -f 1.1 upstream/master HEAD -b Categorical
       before           after         ratio
     [87f72bb8]       [e8d9650c]
     <categorical-astype-int32~3^2>       <categorical-astype-int32~3>
-      1.52±0.1μs         868±20ns     0.57  dtypes.Dtypes.time_pandas_dtype(<class 'pandas.core.dtypes.dtypes.CategoricalDtype'>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@@ -1242,6 +1242,32 @@ def any_nullable_int_dtype(request):
return request.param


@pytest.fixture(params=tm.ALL_INT_DTYPES + tm.ALL_EA_INT_DTYPES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i guess this is a missing one

@arw2019
Copy link
Member Author

arw2019 commented Feb 7, 2021

Some issues in groupby benchmarks (I don't have experience with how noisy these are - can rerun to see)

       before           after         ratio
     [098b9707]       [97c981f0]
     <categorical-astype-int32^2^2>       <categorical-astype-int32>
+         579±7μs       1.99±0.7ms     3.43  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'transformation')
+        509±40μs       1.36±0.4ms     2.68  groupby.GroupByMethods.time_dtype_as_group('float', 'mean', 'transformation')
+         281±6μs        592±400μs     2.11  groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'transformation')
+         482±9μs        999±500μs     2.07  groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'direct')
+        300±10μs        605±100μs     2.02  groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'direct')
+         521±7μs       1.03±0.6ms     1.97  groupby.GroupByMethods.time_dtype_as_field('datetime', 'nunique', 'direct')
+         468±8μs        854±400μs     1.82  groupby.GroupByMethods.time_dtype_as_group('float', 'median', 'direct')
+        567±20μs       1.01±0.3ms     1.78  groupby.GroupByMethods.time_dtype_as_group('float', 'quantile', 'transformation')
+         471±6μs        693±200μs     1.47  groupby.GroupByMethods.time_dtype_as_group('object', 'rank', 'transformation')
+         251±5μs         329±30μs     1.31  groupby.GroupByMethods.time_dtype_as_group('object', 'bfill', 'direct')
-       3.22±0.2s       2.27±0.02s     0.70  groupby.GroupByMethods.time_dtype_as_group('int', 'describe', 'direct')
-        458±30μs         308±20μs     0.67  groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'transformation')
-        227±30μs          153±6μs     0.67  groupby.GroupByMethods.time_dtype_as_group('int', 'count', 'direct')
-        391±70μs          259±4μs     0.66  groupby.GroupByMethods.time_dtype_as_group('datetime', 'head', 'direct')
-       567±100μs         351±20μs     0.62  groupby.GroupByMethods.time_dtype_as_group('int', 'first', 'direct')
-       280±100μs         167±10μs     0.60  groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'transformation')
-       701±100μs         405±30μs     0.58  groupby.GroupByMethods.time_dtype_as_group('int', 'bfill', 'direct')
-        1.74±1ms      1.01±0.06ms     0.58  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'transformation')
-       679±200μs         367±40μs     0.54  groupby.GroupByMethods.time_dtype_as_group('int', 'var', 'transformation')
-       334±100μs         139±10μs     0.42  groupby.GroupByMethods.time_dtype_as_field('datetime', 'size', 'transformation')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@arw2019
Copy link
Member Author

arw2019 commented Feb 7, 2021

re-running the groupby benchmarks that had been affected:

(pandas-dev) andrewwieteska@Andrews-MacBook-Pro asv_bench % asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods.time_dtype_as_group

       before           after         ratio
     [098b9707]       [97c981f0]
     <categorical-astype-int32^2^2>       <categorical-astype-int32>
+        276±10μs        974±300μs     3.53  groupby.GroupByMethods.time_dtype_as_group('float', 'cummax', 'direct')
+        309±40μs       1.03±0.1ms     3.33  groupby.GroupByMethods.time_dtype_as_group('float', 'cumsum', 'direct')
+         149±5ms        352±200ms     2.36  groupby.GroupByMethods.time_dtype_as_group('datetime', 'unique', 'transformation')
+         273±6μs        560±200μs     2.05  groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'transformation')
+     1.03±0.04ms         1.91±1ms     1.86  groupby.GroupByMethods.time_dtype_as_group('float', 'value_counts', 'direct')
+        321±20μs        556±200μs     1.73  groupby.GroupByMethods.time_dtype_as_group('datetime', 'last', 'transformation')
+         308±3μs         491±40μs     1.59  groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'direct')
+         454±3μs        703±400μs     1.55  groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'transformation')
+         581±7μs        853±300μs     1.47  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'direct')
+         250±3μs        351±200μs     1.41  groupby.GroupByMethods.time_dtype_as_group('object', 'ffill', 'direct')
+         306±2μs         399±50μs     1.30  groupby.GroupByMethods.time_dtype_as_group('float', 'max', 'transformation')
+         457±7μs         533±60μs     1.16  groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'direct')
+         241±2ms         273±10ms     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct')
-       564±200μs          181±9μs     0.32  groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'direct')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@arw2019
Copy link
Member Author

arw2019 commented Feb 7, 2021

since a different combination seems to be affected on a re-run i'm inclined to think that the perf regressions are (at least largely?) noise

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm am concerned about this led issue

let's push this to 1.2.3 until can look at closer

@jreback jreback modified the milestones: 1.2.2, 1.2.3 Feb 7, 2021
@arw2019
Copy link
Member Author

arw2019 commented Feb 7, 2021

hmm am concerned about this led issue

let's push this to 1.2.3 until can look at closer

ok! anything i can do to move this forward (is it worth getting reliable numbers for some of these with timeit for example?)

@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

since categorical is used in groupby this slows down everything

see if it's possible to bypass / special case the conversion (when it doesn't need to be converted); eg it's already int64 - might fix it

@arw2019
Copy link
Member Author

arw2019 commented Feb 7, 2021

Benchmarks as of e108f0d (change is to use np.asarray instead of extract_array) are below. The big regressions are (mostly) gone

(pandas-dev) andrewwieteska@Andrews-MacBook-Pro asv_bench % asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods.time_dtype_as_group

       before           after         ratio
     [098b9707]       [97c981f0]
     <maybe_box_native^2>       <categorical-astype-int32>
+        682±90μs         1.23±4ms     1.81  groupby.GroupByMethods.time_dtype_as_group('float', 'rank', 'direct')
-        353±30μs          314±6μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'first', 'direct')
-      1.67±0.2ms      1.47±0.02ms     0.88  groupby.GroupByMethods.time_dtype_as_group('float', 'pct_change', 'transformation')
-        573±60μs          500±4μs     0.87  groupby.GroupByMethods.time_dtype_as_group('float', 'nunique', 'transformation')
-        304±60μs          265±3μs     0.87  groupby.GroupByMethods.time_dtype_as_group('datetime', 'head', 'transformation')
-         286±2ms          247±5ms     0.86  groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct')
-        164±10ms          141±1ms     0.86  groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'direct')
-      1.76±0.3ms      1.46±0.02ms     0.83  groupby.GroupByMethods.time_dtype_as_group('float', 'pct_change', 'direct')
-         113±4ms         92.1±5ms     0.81  groupby.GroupByMethods.time_dtype_as_group('int', 'unique', 'transformation')
-        298±10ms          241±4ms     0.81  groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'transformation')
-       902±100μs          718±5μs     0.80  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct')
-        581±70μs          450±4μs     0.77  groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'transformation')
-        405±60μs          313±2μs     0.77  groupby.GroupByMethods.time_dtype_as_group('datetime', 'first', 'transformation')
-       367±400μs          259±3μs     0.71  groupby.GroupByMethods.time_dtype_as_group('datetime', 'cummin', 'transformation')
-       706±100μs          488±6μs     0.69  groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'transformation')
-        423±20μs          291±4μs     0.69  groupby.GroupByMethods.time_dtype_as_group('float', 'cumsum', 'transformation')
-        506±50μs          340±7μs     0.67  groupby.GroupByMethods.time_dtype_as_group('int', 'var', 'transformation')
-        428±80μs          283±3μs     0.66  groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'direct')
-        228±50μs          132±2μs     0.58  groupby.GroupByMethods.time_dtype_as_group('object', 'size', 'direct')
-      1.50±0.7ms         570±10μs     0.38  groupby.GroupByMethods.time_dtype_as_group('datetime', 'quantile', 'transformation')
-      1.75±0.3ms         599±20μs     0.34  groupby.GroupByMethods.time_dtype_as_group('datetime', 'rank', 'direct')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

ok great

@jreback jreback modified the milestones: 1.2.3, 1.2.2 Feb 7, 2021
@arw2019 arw2019 closed this Feb 8, 2021
@arw2019 arw2019 reopened this Feb 8, 2021
@jreback jreback merged commit f773083 into pandas-dev:master Feb 8, 2021
@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

thanks @arw2019

@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

@meeseeksdev backport 1.2.x

@lumberbot-app

This comment has been minimized.

@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

cc @simonjayhawkins @arw2019 for manual backport

@simonjayhawkins
Copy link
Member

the conflict is due to test_astype_categorical_invalid_conversions, xref #37730 and #37677. there are two instances of this test on 1.2.x

will leave unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: astype from categorical to np.int32 conversion returns np.int64 in pandas 1.2.0 and 1.2.0
3 participants