BUG: Fix some more arrow CSV tests #52087

lithomas1 · 2023-03-20T14:44:28Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

lithomas1 · 2023-03-20T14:45:41Z

pandas/io/parsers/arrow_parser_wrapper.py

@@ -158,6 +160,19 @@ def read(self) -> DataFrame:
            parse_options=pyarrow_csv.ParseOptions(**self.parse_options),
            convert_options=pyarrow_csv.ConvertOptions(**self.convert_options),
        )
+
+        # Convert all pa.null() cols -> float64
+        # TODO: There has to be a better way... right?


What I'm looking for here is something like a combination of select_dtypes + astype.

I couldn't find anything in the pyarrow docs, but maybe I'm just bad at Googling.

Sadly the types_mapper doesn't work, since you can only feed that EA types.

Maybe @jorisvandenbossche knows a better way?

Does Table.select not work?

I don't think there is an easier way. We currently don't have something like select_dtypes, so what you do (iterating over the schema, checking the type, and if needed adapt the new schema) seems the best option.

pandas/io/parsers/arrow_parser_wrapper.py

pandas/io/_util.py

doc/source/whatsnew/v2.1.0.rst

pandas/io/parsers/arrow_parser_wrapper.py

pandas/tests/io/parser/test_na_values.py

Co-authored-by: Matthew Roeschke <[email protected]>

…-more-arrow

…more-arrow

lithomas1 · 2023-03-24T15:35:06Z

pandas/io/parsers/arrow_parser_wrapper.py

+        dtype_backend = self.kwds["dtype_backend"]
+
+        # Convert all pa.null() cols -> float64 (non nullable)
+        # else Int64 (nullable case)


This was really confusing for me, but apparently even convert_dtypes on an all null column(float64) will change it to an Int64.

I am not sure we should replicate that here.
In convert_dtypes, I suppose that is the consequence of trying to cast a float column where all non-null values can faithfully be represented as an integer. Which gives this corner case where this set of values is empty.

But if you don't know the dtype you are starting with (like here), I think using float is safer (it won't give problems later on if you then want to use it for floats and not just integers). Or actually using object dtype might even be safest (as that can represent everything), but of course we don't (yet) have a nullable object dtype, so that's probably not really an option.

The problem is that the C and python parsers already convert to an Int64. Is this something that we need to fix before 2.0?

>>> import pandas as pd >>> from io import StringIO >>> f = StringIO("a\n1,") >>> pd.read_csv(f, dtype_backend="numpy_nullable") a 1 <NA> >>> f.seek(0) 0 >>> pd.read_csv(f, dtype_backend="numpy_nullable").dtypes a Int64 dtype: object >>> f.seek(0) 0 >>> pd.read_csv(f) a 1 NaN >>> f.seek(0) 0 >>> pd.read_csv(f).dtypes a float64 dtype: object

We do this everywhere, so I am +1 on doing Int64 here as well. Initially this came from read_parquet I think

jorisvandenbossche · 2023-03-24T15:55:49Z

pandas/io/_util.py

@@ -8,6 +8,9 @@
 def _arrow_dtype_mapping() -> dict:
    pa = import_optional_dependency("pyarrow")
    return {
+        # All nulls should still give Float64 not object
+        # TODO: This breaks parquet
+        # pa.null(): pd.Float64Dtype(),


In what way does it break parquet?

Well, first I hit this error.

>>> import pandas as pd >>> import pyarrow as pa >>> a = pa.nulls(0) >>> pd.Float64Dtype().__from__arrow__(a) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'Float64Dtype' object has no attribute '__from__arrow__'

After patching that(another PR incoming),
now an empty DataFrame of dtype object(pd.DataFrame({"value": pd.array([], dtype=object)})) returns a Float64Dtype when roundtripped.

Best guess here is that somehow the types_mapper is overriding the pandas metadata somehow.

lithomas1 · 2023-03-30T14:46:28Z

Anyone want to have another look?
Assuming we are OK with read_csv returning Int64 when dtype_backend is numpy_nullable, this should be good to go now.

phofl · 2023-03-30T18:02:12Z

pandas/io/parsers/arrow_parser_wrapper.py

+        if dtype_backend != "pyarrow":
+            new_schema = table.schema
+            if dtype_backend == "numpy_nullable":
+                new_type = pa.int64()


Can't you just path the _arrow_dtype_mapping?

Will need to wait for #52223 then.

LGTM, but yeah since #52223 is close lets wait for that one then merge this

phofl · 2023-03-30T18:02:44Z

pandas/tests/io/parser/dtypes/test_dtypes_basic.py

-            "h": pd.Series(
-                [pd.NA if parser.engine != "pyarrow" else "", "a"], dtype="string"
-            ),
+            "h": pd.Series([pd.NA, "a"], dtype="string"),


Thx for these

…more-arrow

…-more-arrow

phofl · 2023-04-10T17:45:20Z

thx @lithomas1

lithomas1 · 2023-05-20T18:39:07Z

@meeseeksdev backport 2.0.x.

lumberbot-app · 2023-05-20T18:39:31Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.0.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 f29ef30e9939468e1b866b6a554ee6b69b8322c5

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #52087: BUG: Fix some more arrow CSV tests'

Push to a named branch:

git push YOURFORK 2.0.x:auto-backport-of-pr-52087-on-2.0.x

Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #52087 on branch 2.0.x (BUG: Fix some more arrow CSV tests)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

BUG: Fix some more arrow CSV tests

d35ffd2

lithomas1 added IO CSV read_csv, to_csv Arrow pyarrow functionality labels Mar 20, 2023

lithomas1 commented Mar 20, 2023

View reviewed changes

jbrockmendel reviewed Mar 20, 2023

View reviewed changes

pandas/io/parsers/arrow_parser_wrapper.py Outdated Show resolved Hide resolved

mroeschke reviewed Mar 20, 2023

View reviewed changes

pandas/io/_util.py Outdated Show resolved Hide resolved

mroeschke reviewed Mar 20, 2023

View reviewed changes

doc/source/whatsnew/v2.1.0.rst Outdated Show resolved Hide resolved

mroeschke reviewed Mar 20, 2023

View reviewed changes

doc/source/whatsnew/v2.1.0.rst Outdated Show resolved Hide resolved

mroeschke reviewed Mar 20, 2023

View reviewed changes

pandas/io/parsers/arrow_parser_wrapper.py Outdated Show resolved Hide resolved

mroeschke reviewed Mar 20, 2023

View reviewed changes

pandas/tests/io/parser/test_na_values.py Show resolved Hide resolved

lithomas1 and others added 8 commits March 20, 2023 13:26

Apply suggestions from code review

04857b3

Co-authored-by: Matthew Roeschke <[email protected]>

simplify

fb08b13

Merge branch 'fix-more-arrow' of github.com:lithomas1/pandas into fix…

eaff558

…-more-arrow

skip test

c0268d9

Update _util.py

d7d5e32

Merge branch 'main' of https://github.com/pandas-dev/pandas into fix-…

dc44d91

…more-arrow

fix tests

b4014c1

last one?

671df0f

lithomas1 marked this pull request as ready for review March 24, 2023 14:59

lithomas1 commented Mar 24, 2023

View reviewed changes

jorisvandenbossche reviewed Mar 24, 2023

View reviewed changes

Merge branch 'main' into fix-more-arrow

18450ed

lithomas1 requested review from mroeschke, phofl and jorisvandenbossche March 30, 2023 14:46

phofl reviewed Mar 30, 2023

View reviewed changes

lithomas1 added 2 commits April 2, 2023 08:42

Merge branch 'main' of https://github.com/pandas-dev/pandas into fix-…

3d3d5f3

…more-arrow

Merge branch 'main' of https://github.com/pandas-dev/pandas into fix-…

32935b2

…more-arrow

lithomas1 added 2 commits April 9, 2023 22:21

simplify according to code review

2b4aae7

Merge branch 'fix-more-arrow' of github.com:lithomas1/pandas into fix…

43db279

…-more-arrow

lithomas1 added this to the 2.1 milestone Apr 10, 2023

lithomas1 requested a review from phofl April 10, 2023 11:53

mroeschke approved these changes Apr 10, 2023

View reviewed changes

phofl approved these changes Apr 10, 2023

View reviewed changes

phofl merged commit f29ef30 into pandas-dev:main Apr 10, 2023

lithomas1 deleted the fix-more-arrow branch April 10, 2023 19:30

lithomas1 mentioned this pull request Apr 27, 2023

BUG: read_csv with engine='pyarrow' doesn't respect na_values unless entire column has na values #52946

Closed

3 tasks

lithomas1 mentioned this pull request May 19, 2023

BUG: read_csv raising for arrow engine and parse_dates #53295

Merged

5 tasks

lithomas1 modified the milestones: 2.1, 2.0.2 May 20, 2023

lumberbot-app bot added the Still Needs Manual Backport label May 20, 2023

lithomas1 added a commit to lithomas1/pandas that referenced this pull request May 20, 2023

Backport PR pandas-dev#52087: BUG: Fix some more arrow CSV tests

efbb63b

This was referenced May 20, 2023

BUG: __from_arrow__ doesn't accept pyarrow null arrays for numeric ma… #52223

Merged

DOC: Move whatsnew for Arrow CSV fixes #53320

Closed

lithomas1 removed the Still Needs Manual Backport label Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix some more arrow CSV tests #52087

BUG: Fix some more arrow CSV tests #52087

lithomas1 commented Mar 20, 2023

lithomas1 Mar 20, 2023

lithomas1 Mar 20, 2023

WillAyd Mar 24, 2023

jorisvandenbossche Mar 24, 2023

lithomas1 Mar 24, 2023

jorisvandenbossche Mar 24, 2023

lithomas1 Mar 24, 2023

lithomas1 Mar 25, 2023

phofl Mar 25, 2023

jorisvandenbossche Mar 24, 2023

lithomas1 Mar 25, 2023

lithomas1 commented Mar 30, 2023

phofl Mar 30, 2023

lithomas1 Apr 2, 2023

mroeschke Apr 4, 2023

phofl Mar 30, 2023

phofl commented Apr 10, 2023

lithomas1 commented May 20, 2023

lumberbot-app bot commented May 20, 2023

BUG: Fix some more arrow CSV tests #52087

BUG: Fix some more arrow CSV tests #52087

Conversation

lithomas1 commented Mar 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Mar 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Apr 10, 2023

lithomas1 commented May 20, 2023

lumberbot-app bot commented May 20, 2023