-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add masked algorithm for mean() function #34814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
See here: pandas/pandas/core/arrays/masked.py Lines 337 to 339 in 4628665
You can add "mean" there |
@jorisvandenbossche, Should tests be added? and if yes then where? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add tests (we likely aready have some setup for sum, you can just add onto those)
can you add some asv's for the existing and this new case (perf metrics) |
Alright, so I need to add tests in
Where exactly do I add the asv's in? |
This operation is already covered by pandas/asv_bench/benchmarks/series_methods.py Lines 250 to 280 in 107ad15
(I added Int64 in general for all ops, so including mean, when I added a masked sum/prod algorithm) So no need to write additional benchmarks I think. For tests, it would be good to add it here:
|
We are also going to need to decide what to return for an empty array/Series. Right now we return np.nan, but this can also be pd.NA. This touches a bit the discussion we had about NaN vs NA for the floating dtype, but I suppose for now we probably want to go with pd.NA (that's what we do now). |
This would involve
|
@Akshatt You could also do a length check before any calculation as with other functions here, e.g. something like |
Hi @dsaxton, thanks for replying!
This would suffice, right? |
@Akshatt Since we're always returning NA on empty input regardless of skipna could also if not values.size:
return pd.NA
# logic for non-empty input |
@@ -688,6 +688,12 @@ def test_ops_consistency_on_empty(self, method): | |||
result = getattr(Series(dtype=float), method)() | |||
assert pd.isna(result) | |||
|
|||
# Empty Mean | |||
if method == "mean": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is already being tested directly above? Probably want tests for non-empty input also (i.e., non-empty with some missing, non-empty with no missing, non-empty with all missing, parametrizing over skipna).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes redundant now. Previously np.nan
was being returned for empty values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should i add those tests here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cases would I hope already be tested for Series but maybe doesn't hurt to add some more tests under /pandas/tests/series/test_arithmetic.py
(they could always be deduplicated later if already tested) since I believe masked arrays boxed inside Series hit this path. You could test means of both nullable integer and boolean dtypes.
As a follow-up enhancement would also be nice to use this to directly implement mean on masked arrays:
[ins] In [1]: import pandas as pd
[ins] In [2]: arr = pd.array([1, 2, 3])
[ins] In [3]: arr.sum()
Out[3]: 6
[ins] In [4]: arr.mean()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-057c7202e3e0> in <module>
----> 1 arr.mean()
AttributeError: 'IntegerArray' object has no attribute 'mean'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cases would I hope already be tested for Series but maybe doesn't hurt to add some more tests under
/pandas/tests/series/test_arithmetic.py
(they could always be deduplicated later if already tested) since I believe masked arrays boxed inside Series hit this path. You could test means of both nullable integer and boolean dtypes.
I'm not sure exactly where to add the tests? In /pandas/tests/series/test_arithmetic.py
should I add them in Unsorted section for now?
Also, I would be adding tests for:
- Nullable integer dtype
- Nullable boolean dtype
yes?
and how do I check if the tests are passing?
As a follow-up enhancement would also be nice to use this to directly implement mean on masked arrays:
[ins] In [1]: import pandas as pd [ins] In [2]: arr = pd.array([1, 2, 3]) [ins] In [3]: arr.sum() Out[3]: 6 [ins] In [4]: arr.mean() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-4-057c7202e3e0> in <module> ----> 1 arr.mean() AttributeError: 'IntegerArray' object has no attribute 'mean'
yes that makes sense, although I'm not sure how to go about doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also probably good to merge master since it's been a while
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsaxton
I have added these tests under
def test_ops_consistency_on_empty(self, method): |

Any quick fixes? also, how do I ensure these tests are passing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dsaxton,
pushed to branch named tests
. Please have a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Akshatt please push it to this branch (the same as this PR is based on, so which is your master branch), then we can see the tests here in the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry @Akshatt if that was confusing. I meant in the future it's good to avoid modifying your master branch, not to work on a separate branch for this PR.
Regarding your test you'll also want to explicitly assert result is pd.NA
which (oddly enough) is slightly different from pd.isna(result)
.
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@Akshatt Is this still active? Can you address previous comments if so? |
this is fine (you are repeating an old commment which i agree is not relevant). the asv's and whatsnew are still needed. |
Jeff, I am answering @Akshatt because he asked exactly that 2 hours ago |
Since setting up ASV is not easiest thing to do, I quickly ran the aforementioned benchmark that covers this case ( master:
this branch:
So you see a modest speedup for the Int64 / boolean case (although apparently for larger data, the relative benefit diminishes) |
as did i but ok |
Hi, sorry to have caused all the misunderstanding. |
Yes, that's all. |
thanks @Akshatt |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff