WIP: Maintain Int64 Precision on Construction #26272

WillAyd · 2019-05-03T04:22:50Z

closes Int64Dtype in read_csv leads to unexpected values #26259
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is not the final implementation but providing for discussion and feedback. Consideration points are:

_from_sequence_of_strings was hacked to just return an IntervalArray, which isn't correct. Do we want to potentially pass through a mask to _from_sequence -> integer_array -> IntegerArray() instead? I think there is also a bug when construction the DataFrame correctly which this might fix
I'm not super familiar with the IntervalArray tests just yet, but I think I need to add a construction test in arrays.test_integer which may either supplement or replace the CSV test herein
There appears to be a bug with tm.assert_frame_equal and large integers where precision could be lost. This probably applies to all of the tm.assert_*_equal functions

pep8speaks · 2019-05-03T04:22:53Z

Hello @WillAyd! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pandas/tests/io/parser/test_dtypes.py:

Line 529:1: W293 blank line contains whitespace

gfyoung · 2019-05-03T04:24:35Z

pandas/tests/io/parser/test_dtypes.py

+
+    tm.assert_frame_equal(result, expected)
+
+    # See why tm.assert_frame_equal doesn't fail...


Not sure I follow what's going here exactly?

I didn't dig through the assert_*_methods but I'm pretty sure there is internal casting to float64 going on, so you can lose precision pretty easily.

Here's an example that passes without issue:

>>> ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64') >>> ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64') >>> tm.assert_series_equal(ser1, ser2) # should fail but doesn't

Totally. I think quite a few more need to go into test_integers for construction.

When I started this I wasn't expecting the precision issue with DataFrame construction and the assert methods. Ended up being a Pandora's box pretty quick, so posting this for discussion to see how we want to tackle the various components

Hmm...that's awkward. Should open an issue about this. Our handling of large numbers is definitely more bug-prone across the codebase relatively speaking.

FWIW I can get assert_series_equal to work if check_exact=True is passed. My guess is that the float conversion happens due to defaulting to checking equality up to a given precision:

In [1]: import pandas as pd; pd.__version__ Out[1]: '0.25.0.dev0+476.g9feb3ad92c' In [2]: ser1 = pd.Series([1556559573141592653, 1556559573141592654], dtype='Int64') In [3]: ser2 = pd.Series([1556559573141592654, 1556559573141592654], dtype='Int64') In [4]: pd.util.testing.assert_series_equal(ser1, ser2) In [5]: pd.util.testing.assert_series_equal(ser1, ser2, check_exact=True) --------------------------------------------------------------------------- AssertionError: Series are different Series values are different (50.0 %) [left]: [1556559573141592653, 1556559573141592654] [right]: [1556559573141592654, 1556559573141592654]

I guess it could make sense to default to exact checks for integer dtypes instead of doing precision checks, but it doesn't really seem like a clean solution since it'd require implementing dtype specific behavior.

gfyoung · 2019-05-03T04:30:24Z

pandas/tests/io/parser/test_dtypes.py

@@ -509,3 +509,27 @@ def test_numeric_dtype(all_parsers, dtype):

    result = parser.read_csv(StringIO(data), header=None, dtype=dtype)
    tm.assert_frame_equal(expected, result)
+
+
+def test_intna_precision(all_parsers):


Going to need several more tests to check this behavior, but we can get revisit this point after the actual implementation has been critiqued.

WillAyd · 2019-05-03T04:33:13Z

pandas/tests/io/parser/test_dtypes.py

+    expected = DataFrame([
+        [1556559573141592653],
+        [1556559573141592654],
+        [0],


Just to clarify the below TODO if you use np.nan the precision is lost before the DataFrame gets constructed, hence why I've constructed this with just integers first and subsequently assigned nan here

jreback · 2019-05-03T11:08:01Z

pandas/core/arrays/integer.py

+        # undesirable cast to float which may lose precision
+        mask = isna(strings)
+        masked_strings = np.where(mask, 0, strings)
+


i think we can do this instead in to_numeric itself

Was thinking that too but the downside I think is we'd have to return both an array of values and a mask from that method otherwise we'd keep hitting the same issue (unless we have pd.to_numeric construct the Int64 array directly)

(unless we have pd.to_numeric construct the Int64 array directly)

yes that is a desirable outcome

jreback · 2019-05-05T22:22:20Z

pandas/core/arrays/integer.py

+        # undesirable cast to float which may lose precision
+        mask = isna(strings)
+        masked_strings = np.where(mask, 0, strings)
+


(unless we have pd.to_numeric construct the Int64 array directly)

yes that is a desirable outcome

jorisvandenbossche · 2019-05-06T19:54:10Z

unless we have pd.to_numeric construct the Int64 array directly

yes that is a desirable outcome

I agree that is a desirable outcome. However, we will need to think a bit about how to achieve this, as it would be a breaking change, and one probably wants to be able to choose if you get nullable integer array or not.

A keyword to trigger to use nullable ints as output? (not very nice, but can't think directly think of good solutions ..)

jreback · 2019-05-12T21:07:19Z

I agree that is a desirable outcome. However, we will need to think a bit about how to achieve this, as it would be a breaking change, and one probably wants to be able to choose if you get nullable integer array or not.

yes, but a desirable one; i don't think its a big deal to simply change this with an appropriate example in the whatsnew. downcast='integer' and 'unsigned' would simply return Int64/UInt64 arrays; this would only be a breaking change if you had floats previously.

jreback · 2019-06-08T20:33:56Z

@WillAyd can you merge master and update

WillAyd · 2019-06-08T20:40:54Z

I'll give this another look over next few days

WillAyd · 2019-07-11T14:37:12Z

Haven't looked at this in a few months - closing to clear up queue

WillAyd added 3 commits May 2, 2019 21:13

Added failing test

5e76a17

Improved hacking of test

6d98872

Fixed construction issue

8132596

WillAyd added the ExtensionArray Extending pandas with custom dtypes or arrays. label May 3, 2019

WillAyd changed the title ~~WIP: Int64 Precision on Construction~~ WIP: Maintain Int64 Precision on Construction May 3, 2019

gfyoung reviewed May 3, 2019

View reviewed changes

WillAyd commented May 3, 2019

View reviewed changes

jreback requested changes May 3, 2019

View reviewed changes

jreback requested changes May 5, 2019

View reviewed changes

jreback mentioned this pull request Jul 8, 2019

ENH: infer_dtype should infer integer-na #27283

Closed

WillAyd closed this Jul 11, 2019

simonjayhawkins mentioned this pull request Dec 19, 2019

tm.assert_series_equal checking on Integer EA should be exact by default? #30347

Closed

WillAyd deleted the csv-int64-parse branch January 16, 2020 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Maintain Int64 Precision on Construction #26272

WIP: Maintain Int64 Precision on Construction #26272

WillAyd commented May 3, 2019

pep8speaks commented May 3, 2019

gfyoung May 3, 2019

WillAyd May 3, 2019

WillAyd May 3, 2019

gfyoung May 3, 2019 •

edited

Loading

jschendel May 3, 2019

gfyoung May 3, 2019

WillAyd May 3, 2019

jreback May 3, 2019

WillAyd May 3, 2019

jreback May 5, 2019

jreback May 5, 2019

jorisvandenbossche commented May 6, 2019

jreback commented May 12, 2019

jreback commented Jun 8, 2019

WillAyd commented Jun 8, 2019

WillAyd commented Jul 11, 2019


		tm.assert_frame_equal(result, expected)

		# See why tm.assert_frame_equal doesn't fail...

WIP: Maintain Int64 Precision on Construction #26272

WIP: Maintain Int64 Precision on Construction #26272

Conversation

WillAyd commented May 3, 2019

pep8speaks commented May 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung May 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 6, 2019

jreback commented May 12, 2019

jreback commented Jun 8, 2019

WillAyd commented Jun 8, 2019

WillAyd commented Jul 11, 2019

gfyoung May 3, 2019 •

edited

Loading