-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
WIP: Maintain Int64 Precision on Construction #26272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @WillAyd! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
# See why tm.assert_frame_equal doesn't fail... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow what's going here exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't dig through the assert_*_methods but I'm pretty sure there is internal casting to float64 going on, so you can lose precision pretty easily.
Here's an example that passes without issue:
>>> ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64')
>>> ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64')
>>> tm.assert_series_equal(ser1, ser2) # should fail but doesn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally. I think quite a few more need to go into test_integers for construction.
When I started this I wasn't expecting the precision issue with DataFrame construction and the assert methods. Ended up being a Pandora's box pretty quick, so posting this for discussion to see how we want to tackle the various components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...that's awkward. Should open an issue about this. Our handling of large numbers is definitely more bug-prone across the codebase relatively speaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I can get assert_series_equal
to work if check_exact=True
is passed. My guess is that the float conversion happens due to defaulting to checking equality up to a given precision:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+476.g9feb3ad92c'
In [2]: ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64')
In [3]: ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64')
In [4]: pd.util.testing.assert_series_equal(ser1, ser2)
In [5]: pd.util.testing.assert_series_equal(ser1, ser2, check_exact=True)
---------------------------------------------------------------------------
AssertionError: Series are different
Series values are different (50.0 %)
[left]: [1556559573141592653, 1556559573141592654]
[right]: [1556559573141592654, 1556559573141592654]
I guess it could make sense to default to exact checks for integer dtypes instead of doing precision checks, but it doesn't really seem like a clean solution since it'd require implementing dtype specific behavior.
@@ -509,3 +509,27 @@ def test_numeric_dtype(all_parsers, dtype): | |||
|
|||
result = parser.read_csv(StringIO(data), header=None, dtype=dtype) | |||
tm.assert_frame_equal(expected, result) | |||
|
|||
|
|||
def test_intna_precision(all_parsers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to need several more tests to check this behavior, but we can get revisit this point after the actual implementation has been critiqued.
expected = DataFrame([ | ||
[1556559573141592653], | ||
[1556559573141592654], | ||
[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify the below TODO if you use np.nan
the precision is lost before the DataFrame gets constructed, hence why I've constructed this with just integers first and subsequently assigned nan here
# undesirable cast to float which may lose precision | ||
mask = isna(strings) | ||
masked_strings = np.where(mask, 0, strings) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can do this instead in to_numeric itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking that too but the downside I think is we'd have to return both an array of values and a mask from that method otherwise we'd keep hitting the same issue (unless we have pd.to_numeric construct the Int64 array directly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(unless we have pd.to_numeric construct the Int64 array directly)
yes that is a desirable outcome
# undesirable cast to float which may lose precision | ||
mask = isna(strings) | ||
masked_strings = np.where(mask, 0, strings) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(unless we have pd.to_numeric construct the Int64 array directly)
yes that is a desirable outcome
I agree that is a desirable outcome. However, we will need to think a bit about how to achieve this, as it would be a breaking change, and one probably wants to be able to choose if you get nullable integer array or not. A keyword to trigger to use nullable ints as output? (not very nice, but can't think directly think of good solutions ..) |
yes, but a desirable one; i don't think its a big deal to simply change this with an appropriate example in the whatsnew. downcast='integer' and 'unsigned' would simply return Int64/UInt64 arrays; this would only be a breaking change if you had floats previously. |
@WillAyd can you merge master and update |
I'll give this another look over next few days |
Haven't looked at this in a few months - closing to clear up queue |
git diff upstream/master -u -- "*.py" | flake8 --diff
This is not the final implementation but providing for discussion and feedback. Consideration points are:
_from_sequence_of_strings
was hacked to just return an IntervalArray, which isn't correct. Do we want to potentially pass through a mask to_from_sequence
->integer_array
->IntegerArray()
instead? I think there is also a bug when construction the DataFrame correctly which this might fixarrays.test_integer
which may either supplement or replace the CSV test hereintm.assert_frame_equal
and large integers where precision could be lost. This probably applies to all of thetm.assert_*_equal
functions