Skip to content

PERF: improve conversion to BooleanArray from int/float array #30095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 9, 2019
Merged

PERF: improve conversion to BooleanArray from int/float array #30095

merged 8 commits into from
Dec 9, 2019

Conversation

ethanywang
Copy link

@ethanywang ethanywang commented Dec 5, 2019

@ethanywang ethanywang marked this pull request as ready for review December 5, 2019 22:15
@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Dec 5, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this issue number onto the list for adding boolean array in the whatsnew.

can you add some tests specifically for this (we might have them, but pls point them out)

@ethanywang
Copy link
Author

Related Test:

def test_to_boolean_array_from_integer_array():
result = pd.array(np.array([1, 0, 1, 0]), dtype="boolean")
expected = pd.array([True, False, True, False], dtype="boolean")
tm.assert_extension_array_equal(result, expected)
# with missing values
result = pd.array(np.array([1, 0, 1, None]), dtype="boolean")
expected = pd.array([True, False, True, None], dtype="boolean")
tm.assert_extension_array_equal(result, expected)
def test_to_boolean_array_from_float_array():
result = pd.array(np.array([1.0, 0.0, 1.0, 0.0]), dtype="boolean")
expected = pd.array([True, False, True, False], dtype="boolean")
tm.assert_extension_array_equal(result, expected)
# with missing values
result = pd.array(np.array([1.0, 0.0, 1.0, np.nan]), dtype="boolean")
expected = pd.array([True, False, True, None], dtype="boolean")
tm.assert_extension_array_equal(result, expected)

(np.array([np.nan, np.nan], dtype=float), [None, None]),

@ethanywang ethanywang requested a review from jreback December 6, 2019 05:28
@jreback jreback added this to the 1.0 milestone Dec 6, 2019
@jreback
Copy link
Contributor

jreback commented Dec 6, 2019

ok lgtm. can you do a simple benchmark on this and show it (+1 if you can add it to the asvs)

@jreback jreback added the Performance Memory or execution speed performance label Dec 6, 2019
@ethanywang ethanywang requested a review from WillAyd December 7, 2019 00:38
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as well outside of @jreback comments

@ethanywang
Copy link
Author

ethanywang commented Dec 7, 2019

I used pytest-benchmark to do the simple benchmarking, as I'm not quite sure where should I write the benchmark for the boolean array in the asv folder.

Original Branch:

------------------------------------------------------------------------------------------------ benchmark: 2 tests ------------------------------------------------------------------------------------------------
Name (time in us)                          Min                   Max                Mean              StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_from_integer_array     632.0190 (1.0)      4,040.8400 (1.0)      959.2070 (1.11)     343.6619 (1.04)     859.9870 (1.13)     310.8663 (1.28)        86;37        1.0425 (0.90)        829           1
test_benchmark_from_float_array       650.8890 (1.03)     4,129.7300 (1.02)     863.9706 (1.0)      330.2377 (1.0)      761.4645 (1.0)      242.4200 (1.0)         96;64        1.1574 (1.0)         978           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

PR Branch:

------------------------------------------------------------------------------------------------ benchmark: 2 tests -----------------------------------------------------------------------------------------------
Name (time in us)                          Min                   Max                Mean              StdDev              Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_from_float_array       580.4190 (1.0)      2,836.9650 (1.53)     687.1475 (1.02)     196.7836 (1.69)     611.1900 (1.0)      97.8967 (2.22)      117;146        1.4553 (0.98)       1359           1
test_benchmark_from_integer_array     609.3710 (1.05)     1,856.8680 (1.0)      675.1754 (1.0)      116.3234 (1.0)      637.5215 (1.04)     44.0175 (1.0)         62;95        1.4811 (1.0)         760           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Is it now okay to merge? @jreback

@jreback
Copy link
Contributor

jreback commented Dec 8, 2019

@ethanywang see the asv docs here: https://dev.pandas.io/docs/development/contributing.html#running-the-performance-test-suite

can you construct some asvs which add some benchmarks (and then show the results here)

you can create a new file in benchmarks/

call it array.py (and then use pd.array for the construction).

@ethanywang
Copy link
Author

@jreback Using the asv bechmark. The results are:

       before           after         ratio
     [c0f6428b]       [3dcbbd61]
     <master>         <int_float_to_boolean>
-        90.7±1μs       60.2±0.7μs     0.66  array.BooleanArray.time_from_float_array
-        95.0±4μs         60.5±2μs     0.64  array.BooleanArray.time_from_integer_array

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@jreback
Copy link
Contributor

jreback commented Dec 8, 2019

perfect @ethanywang

would a follow up PR for similar asvs for IntegerArray and StringArray

(we might have some in series somewhere for IntegerArray already)
can move those

@ethanywang
Copy link
Author

@jreback So you mean I can remove the array.py in the asv_benchmark folder, and not submit it in this PR?

@jreback
Copy link
Contributor

jreback commented Dec 8, 2019

@jreback So you mean I can remove the array.py in the asv_benchmark folder, and not submit it in this PR?

no in a follow up PR i would like to add asv constructions for IntegerArray and StringArray in array.py

we may have some construction benchmarks already for Integer dtypes in Series which we can move

@jorisvandenbossche jorisvandenbossche merged commit fb50258 into pandas-dev:master Dec 9, 2019
@jorisvandenbossche
Copy link
Member

@ethanywang Thanks a lot!

@ethanywang ethanywang deleted the int_float_to_boolean branch December 9, 2019 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: improve conversion to BooleanArray from int/float array
4 participants