-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Faster SparseArray.__getitem__ for boolean masks (#23122) #44955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatsnew note for the perf improvement. is there a bug fix here too?
do we have asv's that cover these cases?
Yes, here also the bugfix, I needed to fix them because some test was failed. Seems it would be better to create separate feature. # before fix
s = pd.arrays.SparseArray([0,2,3,4,0,0,0],fill_value=0)
s.isna()
#[False, False, False, False, False, False, False]
#Fill: False
#IntIndex
#Indices: array([1, 2, 3])
#after fix
s
#[0, 2, 3, 4, 0, 0, 0]
#Fill: 0
#IntIndex
#Indices: array([1, 2, 3])
s.isna()
#[False, False, False, False, False, False, False]
#Fill: False
#IntIndex
#Indices: array([], dtype=int32) About asv, no, sorry I've missed it in contribution guides (btw as well as result variable in tests...) |
@bdrum fine on the bugfix, just add a whatsnew note describing (and you can reference this PR number) |
@jreback Thanks! What about increasing performance. I've added benchmark to asv, but looks strange
Because when I ran the same code via timeit, I've got huge performance increasing for False fill_value and a bit (about 5-7%) when fill_value is True. |
136a880
to
73fd5c4
Compare
Sorry, a bit confusion after merging conflicts... Now I have 26 failed test in my branch as well as in current master:
=============================================================== short test summary info ===============================================================
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other3]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other7]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other9]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other3]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other6]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other7]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other9]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-DataFrame-other3]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-array-other3]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-array-other7]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-array-other9]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_mixed_invalid[numexpr-tzlocal()] - OSError: [...
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Index-other3]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Index-other7]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Index-other9]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other3]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other6]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other7]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other9]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-DataFrame-other3]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-array-other3]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-array-other7]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-array-other9]
FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_mixed_invalid[python-tzlocal()] - OSError: [E...
FAILED pandas/tests/indexing/test_chaining_and_caching.py::TestChaining::test_detect_chained_assignment_warning_stacklevel[3] - AssertionError: asser...
FAILED pandas/tests/indexing/test_chaining_and_caching.py::TestChaining::test_detect_chained_assignment_warning_stacklevel[rhs1] - AssertionError: as...
====================================== 26 failed, 4 passed, 2 skipped, 151259 deselected, 29 warnings in 52.32s =======================================
|
Can you post the traceback for one of the arithmetic tests that failed? |
Sure. Here it is: FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other3] <?xml version="1.0" encoding="utf-8"?><testsuites><testsuite name="pytest" errors="0" failures="1" skipped="0" tests="1" time="0.974" timestamp="2021-12-22T10:48:13.621099" hostname="RUMLAB"><testcase classname="pandas.tests.arithmetic.test_datetime64.TestDatetime64ArrayLikeComparisons" name="test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other3]" time="0.012"><failure message="OSError: [Errno 22] Invalid argument">self = <pandas.tests.arithmetic.test_datetime64.TestDatetime64ArrayLikeComparisons object at 0x000001F5FE6B3790>
other = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
tz_naive_fixture = tzlocal()
box_with_array = <class 'pandas.core.indexes.base.Index'>
@pytest.mark.parametrize(
"other",
[
# GH#4968 invalid date/int comparisons
list(range(10)),
np.arange(10),
np.arange(10).astype(np.float32),
np.arange(10).astype(object),
pd.timedelta_range("1ns", periods=10).array,
np.array(pd.timedelta_range("1ns", periods=10)),
list(pd.timedelta_range("1ns", periods=10)),
pd.timedelta_range("1 Day", periods=10).astype(object),
pd.period_range("1971-01-01", freq="D", periods=10).array,
pd.period_range("1971-01-01", freq="D", periods=10).astype(object),
],
)
def test_dt64arr_cmp_arraylike_invalid(
self, other, tz_naive_fixture, box_with_array
):
tz = tz_naive_fixture
dta = date_range("1970-01-01", freq="ns", periods=10, tz=tz)._data
obj = tm.box_expected(dta, box_with_array)
> assert_invalid_comparison(obj, other, box_with_array)
pandas\tests\arithmetic\test_datetime64.py:122:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\tests\arithmetic\common.py:111: in assert_invalid_comparison
result = xbox2(left == right)
pandas\core\ops\common.py:70: in new_method
return method(self, other)
pandas\core\arraylike.py:40: in __eq__
return self._cmp_method(other, operator.eq)
pandas\core\indexes\base.py:6591: in _cmp_method
result = op(self._values, other)
pandas\core\ops\common.py:70: in new_method
return method(self, other)
pandas\core\arraylike.py:40: in __eq__
return self._cmp_method(other, operator.eq)
pandas\core\arrays\datetimelike.py:1024: in _cmp_method
op, np.asarray(self.astype(object)), other
pandas\core\arrays\datetimes.py:666: in astype
return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy)
pandas\core\arrays\datetimelike.py:413: in astype
converted = ints_to_pydatetime(
pandas\_libs\tslibs\vectorized.pyx:177: in pandas._libs.tslibs.vectorized.ints_to_pydatetime
local_value = tz_convert_utc_to_tzlocal(value, tz)
pandas\_libs\tslibs\tzconversion.pyx:383: in pandas._libs.tslibs.tzconversion.tz_convert_utc_to_tzlocal
return _tz_convert_tzlocal_utc(utc_val, tz, to_utc=False, fold=fold)
pandas\_libs\tslibs\tzconversion.pyx:594: in pandas._libs.tslibs.tzconversion._tz_convert_tzlocal_utc
delta = _tzlocal_get_offset_components(val, tz, to_utc, fold)
pandas\_libs\tslibs\tzconversion.pyx:554: in pandas._libs.tslibs.tzconversion._tzlocal_get_offset_components
dt = dt.astimezone(tz)
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\_common.py:144: in fromutc
return f(self, dt)
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\_common.py:261: in fromutc
_fold = self._fold_status(dt, dt_wall)
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\_common.py:196: in _fold_status
if self.is_ambiguous(dt_wall):
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\tz.py:254: in is_ambiguous
naive_dst = self._naive_is_dst(dt)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = tzlocal(), dt = datetime.datetime(1969, 12, 31, 21, 0, tzinfo=tzlocal())
def _naive_is_dst(self, dt):
timestamp = _datetime_to_timestamp(dt)
> return time.localtime(timestamp + time.timezone).tm_isdst
E OSError: [Errno 22] Invalid argument
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\tz.py:260: OSError</failure></testcase></testsuite></testsuites> |
asv report:
|
@bdrum how's this coming? |
@jreback Thanks!) I think that is done right now. I resolved your changes request: modified whatsnew file and added benchmark to asv. |
some failing builds which are relevant benchmarks and i think the doc build |
Ah, you right that's my fault. Doc doesn't know what is pull, in this case as I can see I have to use :issue: and number of this pr. Sorry about that. I just noticed that some pr was merged despite on the fact that not all check was done, so that is the reason why I wasn't very attentive about that. Now I will. Let me fix it. |
well i check the failures to see if they are unrelated but generally we need all green to avoid issues |
@jreback Now failed checks looks like not by my fault. |
thanks @bdrum |
This broke some things with greater comparisons. We should fix them before 1.4 is released or revert here.
returned
which was correct. Now this returns
|
Usage of direct indexing is working very fast, but calculating setdiff1d is working very slow, the good solution for both case (much faster than setdiff1d and bit slower than direct indices) using masks. Here is the plot of comparison old (densifying) implementation and new one (masks):
I understand that using mask will take more memory, but I don't think that it is crucial point here especially for bool8 type, but let's discuss.
We can combine both approaches and use direct indices when fill value is False and use mask in the opposite case.
One more question to discuss related with case when True value of Boolean sparse array lead to the fill value of original array. I assume that in this case fill value should be return.
I've got benchmark results via pytest-benchmark framework.
Code for benchmarking: