Incorrect reading of CSV containing large integers Issue#52505 #54679

kvn4 · 2023-08-21T23:20:25Z

mroeschke · 2023-08-22T18:40:14Z

pandas/io/parsers/arrow_parser_wrapper.py

@@ -223,5 +223,10 @@ def read(self) -> DataFrame:
        elif using_pyarrow_string_dtype():
            frame = table.to_pandas(types_mapper=arrow_string_types_mapper())
        else:
-            frame = table.to_pandas()
+            if self.kwds.get("dtype") is not None and issubclass(


Can you use isinstance instead of issubclass?

mroeschke

This needs a tests and a whatsnew note in 2.2.0.rst

mroeschke · 2023-08-23T15:54:25Z

doc/source/whatsnew/v2.2.0.rst

@@ -139,7 +139,7 @@ Timezones

 Numeric
 ^^^^^^^
-
+- Bug in :func:`_read`, pyarrow engine not defaulting to float64 causing precision errors when specifying a dtype; fixed by explicitly setting dtype if dtype not none and isinstance of dict (:issue:`52505`)


Can you phrase this in terms of a public API?

mroeschke · 2023-08-23T15:55:15Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

+
+def test_accurate_parsing_of_large_integers(all_parsers):
+    # GH#52505
+    data = """SYMBOL,SYSTEM,TYPE,MOMENT,ID,ACTION,PRICE,VOLUME,ID_DEAL,PRICE_DEAL


Can you reduce the unnecessary data for testing?

mroeschke · 2023-08-23T15:55:44Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

+MSFT,F,S,20230301181139587,2023552585717889863,2,75.00000,2,2023552585717263361,75.00000
+NVDA,F,B,20230301181139587,2023552585717889827,2,75.00000,2,2023552585717263361,75.00000"""
+    orders = pd.read_csv(StringIO(data), dtype={"ID_DEAL": pd.Int64Dtype()})
+    print(len(orders.query("ID_DEAL==2023552585717263360", engine="python")))


Suggested change

print(len(orders.query("ID_DEAL==2023552585717263360", engine="python")))

mroeschke · 2023-08-23T15:56:12Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

+    orders = pd.read_csv(StringIO(data), dtype={"ID_DEAL": pd.Int64Dtype()})
+    print(len(orders.query("ID_DEAL==2023552585717263360", engine="python")))
+    tm.assert_equal(
+        len(orders.query("ID_DEAL==2023552585717263360", engine="python")), 2


Can you not use query here?

Also a plain assert should be fine

mroeschke · 2023-08-23T21:16:48Z

doc/source/whatsnew/v2.2.0.rst

@@ -139,7 +139,7 @@ Timezones

 Numeric
 ^^^^^^^
-
+- Bug in :func:`_read`, pyarrow engine defaulting to float64 causing rounding errors for large integers; now processes input appropriately (:issue:`52505`)


Suggested change

- Bug in :func:`_read`, pyarrow engine defaulting to float64 causing rounding errors for large integers; now processes input appropriately (:issue:`52505`)

- Bug in :func:`read_csv` with `engine="pyarrow"` causing rounding errors for large integers (:issue:`52505`)

mroeschke · 2023-08-23T21:18:25Z

pandas/io/parsers/arrow_parser_wrapper.py

+            if self.kwds.get("dtype") is not None and isinstance(
+                type(self.kwds.get("dtype")), dict


Suggested change

if self.kwds.get("dtype") is not None and isinstance(

type(self.kwds.get("dtype")), dict

if isinstance(self.kwds.get("dtype"), dict):

mroeschke · 2023-08-24T16:26:00Z

Thanks @kvn4

mend

a7d8838

kvn4 changed the title ~~mend~~ Incorrect reading of CSV containing large integers Issue#52505 Aug 21, 2023

mroeschke reviewed Aug 22, 2023

View reviewed changes

mroeschke requested changes Aug 22, 2023

View reviewed changes

mroeschke added Bug IO CSV read_csv, to_csv Arrow pyarrow functionality labels Aug 22, 2023

mmend

4429c6a

mroeschke reviewed Aug 23, 2023

View reviewed changes

mmend

6f23751

mroeschke reviewed Aug 23, 2023

View reviewed changes

mmend

e2d51c4

mroeschke approved these changes Aug 24, 2023

View reviewed changes

mroeschke added this to the 2.2 milestone Aug 24, 2023

mroeschke merged commit 766e2fc into pandas-dev:main Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect reading of CSV containing large integers Issue#52505 #54679

Incorrect reading of CSV containing large integers Issue#52505 #54679

kvn4 commented Aug 21, 2023

mroeschke Aug 22, 2023

mroeschke left a comment

mroeschke Aug 23, 2023

mroeschke Aug 23, 2023

mroeschke Aug 23, 2023

mroeschke Aug 23, 2023

mroeschke Aug 23, 2023

mroeschke Aug 23, 2023

mroeschke commented Aug 24, 2023

	- Bug in :func:`_read`, pyarrow engine defaulting to float64 causing rounding errors for large integers; now processes input appropriately (:issue:`52505`)
	- Bug in :func:`read_csv` with `engine="pyarrow"` causing rounding errors for large integers (:issue:`52505`)

		if self.kwds.get("dtype") is not None and isinstance(
		type(self.kwds.get("dtype")), dict

	if self.kwds.get("dtype") is not None and isinstance(
	type(self.kwds.get("dtype")), dict
	if isinstance(self.kwds.get("dtype"), dict):

Incorrect reading of CSV containing large integers Issue#52505 #54679

Incorrect reading of CSV containing large integers Issue#52505 #54679

Conversation

kvn4 commented Aug 21, 2023

mroeschke Aug 22, 2023

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke Aug 23, 2023

Choose a reason for hiding this comment

mroeschke Aug 23, 2023

Choose a reason for hiding this comment

mroeschke Aug 23, 2023

Choose a reason for hiding this comment

mroeschke Aug 23, 2023

Choose a reason for hiding this comment

mroeschke Aug 23, 2023

Choose a reason for hiding this comment

mroeschke Aug 23, 2023

Choose a reason for hiding this comment

mroeschke commented Aug 24, 2023