Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Support uncertainties as EA Dtype #53970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support uncertainties as EA Dtype #53970
Changes from 7 commits
4c04e30
a165278
0b833aa
2a1913b
7d8ba79
8ada662
42ff1a8
e53a62c
b8083a9
93be04f
ed69055
65cf17a
2f92811
e572b3d
2964f44
6644fe0
c9e0de3
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this is decribing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is describing a change of behavior of the
GroupBy.first
helper function to not assume that np.nan is a valid NA value for the EA in question. The idea of the change (which, if valid, should also be applied toGroupBy.last
) is that if we cannot find any non-NA values in the array, but if there are NA values, the return value should be the first (or last) NA value in the array. I did this because I thought that my EA could only deal with NaN values that were UFloats (having a nominal value and a std_dev value). But as I traced through the code and observed things more closely, I see that the EA does have tolerance for np.nan as an element among UFloats, giving np.nan as the nominal value and zero as the std_dev. So I will remove this change (by being less strict in some of my own assumptions in checking EA consistency).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would you even get here with uncertainties? is your EA subclassing BaseMaskedArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EA itself is a very vanilla EA. The test suite instantiates the EA in myriad ways for test purposes. In this case the EA gets instantiated as a MaskedArray (which derives from BaseMaskedArray) for testing MaskedArray functionality. I think the test suite has every right to rigorously instantiate EAs in every way imaginable, which it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im still confused here. the EA i see in hgrecco/pint-pandas#140 subclasses ExtensionArray but not BaseMaskedArray, so i don't see how the code change here would affect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK...I looked more deeply into this when I get back the computer where I did original work and didn't upload some notes to the cloud. I'm pretty sure it's due to some conditions created when trying to merge two DataFrames that I dummied up in my notes. I'll create a test case that can go into the Pandas test suite. It won't depend on uncertainties, but if and when used with uncertainties, it should trigger the path in question. Should be done by Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that was a lot more challenging to reconstruct than I expected, but I re-created the condition. TL;DR: now that I use pd.NA as my NA value for my particular EA, this change that caught your eye is indeed unnecessary and I can revert it.
But, since you asked, here's the path I took initially. Starting with
PintType
we define this:The idea (before settling on pd.NA, which works without this particular change) was to use a NaN to which we could cast any numerical type (we cannot downcast uncertainties to
float
).In the test code we set up
data_missing
anddata_missing_for_sorting
to useufloat(np.nan, 0)
as the NA value where appropriate. The test casetest_groupby_agg_extension
usesdata_for_grouping
, which is called withoutnumeric_dtype
asNone
, and this is where we start going off the rails.Here's the
data_for_grouping
code:The way the test code is run, because
numeric_dtype
isNone
we get:When we execute
expected = df.iloc[[0, 2, 4, 7]]
there comes a point where we come here (/Users/michael/Documents/GitHub/MichaelTiemannOSC/pandas-dev/pandas/core/internals/managers.py(646)<listcomp>()
):With
fill_value
as None, we inherit the fill value from the PintArray, which is the uncertain NaN. Here's the stack trace down from there:In this particular case, because none of the indexes are -1, no filling is going to happen. But should we have things to fill, they'd be filled with the uncertain nan. And because this code really doesn't want to fill add new NaNs to the array, I put this extra defensive code in. And that's how we hit this condition.
I'm going to take this extraneous test out (
fill_value == fill_value
). Does this answer your question?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change to track the identical changes in test_integer.py and test_floating.py. The it does not matter whether the NA values in this case are pd.NA, np.nan, or a mixture. But I did think it better to make the three files consistent. I'm happy to revert this change if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this should be removed. It's ideal to minimize extraneous changes