-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
WIP: Maintain Int64 Precision on Construction #26272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -509,3 +509,27 @@ def test_numeric_dtype(all_parsers, dtype): | |
|
||
result = parser.read_csv(StringIO(data), header=None, dtype=dtype) | ||
tm.assert_frame_equal(expected, result) | ||
|
||
|
||
def test_intna_precision(all_parsers): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Going to need several more tests to check this behavior, but we can get revisit this point after the actual implementation has been critiqued. |
||
parser = all_parsers | ||
data = "1556559573141592653\n1556559573141592654\n\n1556559573141592655" | ||
dtype = 'Int64' | ||
|
||
expected = DataFrame([ | ||
[1556559573141592653], | ||
[1556559573141592654], | ||
[0], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to clarify the below TODO if you use |
||
[1556559573141592655], | ||
], dtype=dtype) | ||
expected.iloc[2] = np.nan # TODO: fix general bug on df construction | ||
|
||
result = parser.read_csv(StringIO(data), header=None, dtype=dtype, | ||
skip_blank_lines=False) | ||
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
# See why tm.assert_frame_equal doesn't fail... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure I follow what's going here exactly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't dig through the assert_*_methods but I'm pretty sure there is internal casting to float64 going on, so you can lose precision pretty easily. Here's an example that passes without issue: >>> ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64')
>>> ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64')
>>> tm.assert_series_equal(ser1, ser2) # should fail but doesn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Totally. I think quite a few more need to go into test_integers for construction. When I started this I wasn't expecting the precision issue with DataFrame construction and the assert methods. Ended up being a Pandora's box pretty quick, so posting this for discussion to see how we want to tackle the various components There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm...that's awkward. Should open an issue about this. Our handling of large numbers is definitely more bug-prone across the codebase relatively speaking. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW I can get In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+476.g9feb3ad92c'
In [2]: ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64')
In [3]: ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64')
In [4]: pd.util.testing.assert_series_equal(ser1, ser2)
In [5]: pd.util.testing.assert_series_equal(ser1, ser2, check_exact=True)
---------------------------------------------------------------------------
AssertionError: Series are different
Series values are different (50.0 %)
[left]: [1556559573141592653, 1556559573141592654]
[right]: [1556559573141592654, 1556559573141592654] I guess it could make sense to default to exact checks for integer dtypes instead of doing precision checks, but it doesn't really seem like a clean solution since it'd require implementing dtype specific behavior. |
||
assert result.iloc[0] == expected.iloc[0] | ||
assert result.iloc[1] == expected.iloc[1] | ||
assert result.iloc[3] == expected.iloc[3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can do this instead in to_numeric itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking that too but the downside I think is we'd have to return both an array of values and a mask from that method otherwise we'd keep hitting the same issue (unless we have pd.to_numeric construct the Int64 array directly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that is a desirable outcome