Skip to content

WIP: Maintain Int64 Precision on Construction #26272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented May 3, 2019

This is not the final implementation but providing for discussion and feedback. Consideration points are:

  • _from_sequence_of_strings was hacked to just return an IntervalArray, which isn't correct. Do we want to potentially pass through a mask to _from_sequence -> integer_array -> IntegerArray() instead? I think there is also a bug when construction the DataFrame correctly which this might fix
  • I'm not super familiar with the IntervalArray tests just yet, but I think I need to add a construction test in arrays.test_integer which may either supplement or replace the CSV test herein
  • There appears to be a bug with tm.assert_frame_equal and large integers where precision could be lost. This probably applies to all of the tm.assert_*_equal functions

@WillAyd WillAyd added the ExtensionArray Extending pandas with custom dtypes or arrays. label May 3, 2019
@pep8speaks
Copy link

Hello @WillAyd! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 529:1: W293 blank line contains whitespace

@WillAyd WillAyd changed the title WIP: Int64 Precision on Construction WIP: Maintain Int64 Precision on Construction May 3, 2019

tm.assert_frame_equal(result, expected)

# See why tm.assert_frame_equal doesn't fail...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow what's going here exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't dig through the assert_*_methods but I'm pretty sure there is internal casting to float64 going on, so you can lose precision pretty easily.

Here's an example that passes without issue:

>>> ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64')
>>> ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64')
>>> tm.assert_series_equal(ser1, ser2) # should fail but doesn't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally. I think quite a few more need to go into test_integers for construction.

When I started this I wasn't expecting the precision issue with DataFrame construction and the assert methods. Ended up being a Pandora's box pretty quick, so posting this for discussion to see how we want to tackle the various components

Copy link
Member

@gfyoung gfyoung May 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...that's awkward. Should open an issue about this. Our handling of large numbers is definitely more bug-prone across the codebase relatively speaking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I can get assert_series_equal to work if check_exact=True is passed. My guess is that the float conversion happens due to defaulting to checking equality up to a given precision:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+476.g9feb3ad92c'

In [2]: ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64')

In [3]: ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64')

In [4]: pd.util.testing.assert_series_equal(ser1, ser2)

In [5]: pd.util.testing.assert_series_equal(ser1, ser2, check_exact=True)
---------------------------------------------------------------------------
AssertionError: Series are different

Series values are different (50.0 %)
[left]:  [1556559573141592653, 1556559573141592654]
[right]: [1556559573141592654, 1556559573141592654]

I guess it could make sense to default to exact checks for integer dtypes instead of doing precision checks, but it doesn't really seem like a clean solution since it'd require implementing dtype specific behavior.

@@ -509,3 +509,27 @@ def test_numeric_dtype(all_parsers, dtype):

result = parser.read_csv(StringIO(data), header=None, dtype=dtype)
tm.assert_frame_equal(expected, result)


def test_intna_precision(all_parsers):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to need several more tests to check this behavior, but we can get revisit this point after the actual implementation has been critiqued.

expected = DataFrame([
[1556559573141592653],
[1556559573141592654],
[0],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify the below TODO if you use np.nan the precision is lost before the DataFrame gets constructed, hence why I've constructed this with just integers first and subsequently assigned nan here

# undesirable cast to float which may lose precision
mask = isna(strings)
masked_strings = np.where(mask, 0, strings)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can do this instead in to_numeric itself

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking that too but the downside I think is we'd have to return both an array of values and a mask from that method otherwise we'd keep hitting the same issue (unless we have pd.to_numeric construct the Int64 array directly)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unless we have pd.to_numeric construct the Int64 array directly)

yes that is a desirable outcome

# undesirable cast to float which may lose precision
mask = isna(strings)
masked_strings = np.where(mask, 0, strings)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unless we have pd.to_numeric construct the Int64 array directly)

yes that is a desirable outcome

@jorisvandenbossche
Copy link
Member

unless we have pd.to_numeric construct the Int64 array directly

yes that is a desirable outcome

I agree that is a desirable outcome. However, we will need to think a bit about how to achieve this, as it would be a breaking change, and one probably wants to be able to choose if you get nullable integer array or not.

A keyword to trigger to use nullable ints as output? (not very nice, but can't think directly think of good solutions ..)

@jreback
Copy link
Contributor

jreback commented May 12, 2019

I agree that is a desirable outcome. However, we will need to think a bit about how to achieve this, as it would be a breaking change, and one probably wants to be able to choose if you get nullable integer array or not.

yes, but a desirable one; i don't think its a big deal to simply change this with an appropriate example in the whatsnew. downcast='integer' and 'unsigned' would simply return Int64/UInt64 arrays; this would only be a breaking change if you had floats previously.

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

@WillAyd can you merge master and update

@WillAyd
Copy link
Member Author

WillAyd commented Jun 8, 2019

I'll give this another look over next few days

@WillAyd
Copy link
Member Author

WillAyd commented Jul 11, 2019

Haven't looked at this in a few months - closing to clear up queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Int64Dtype in read_csv leads to unexpected values
6 participants