-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix some more arrow CSV tests #52087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -158,6 +160,19 @@ def read(self) -> DataFrame: | |||
parse_options=pyarrow_csv.ParseOptions(**self.parse_options), | |||
convert_options=pyarrow_csv.ConvertOptions(**self.convert_options), | |||
) | |||
|
|||
# Convert all pa.null() cols -> float64 | |||
# TODO: There has to be a better way... right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm looking for here is something like a combination of select_dtypes
+ astype
.
I couldn't find anything in the pyarrow docs, but maybe I'm just bad at Googling.
Sadly the types_mapper doesn't work, since you can only feed that EA types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @jorisvandenbossche knows a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Table.select not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is an easier way. We currently don't have something like select_dtypes, so what you do (iterating over the schema, checking the type, and if needed adapt the new schema) seems the best option.
dtype_backend = self.kwds["dtype_backend"] | ||
|
||
# Convert all pa.null() cols -> float64 (non nullable) | ||
# else Int64 (nullable case) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was really confusing for me, but apparently even convert_dtypes
on an all null column(float64) will change it to an Int64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we should replicate that here.
In convert_dtypes
, I suppose that is the consequence of trying to cast a float column where all non-null values can faithfully be represented as an integer. Which gives this corner case where this set of values is empty.
But if you don't know the dtype you are starting with (like here), I think using float is safer (it won't give problems later on if you then want to use it for floats and not just integers). Or actually using object dtype might even be safest (as that can represent everything), but of course we don't (yet) have a nullable object dtype, so that's probably not really an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the C and python parsers already convert to an Int64. Is this something that we need to fix before 2.0?
>>> import pandas as pd
>>> from io import StringIO
>>> f = StringIO("a\n1,")
>>> pd.read_csv(f, dtype_backend="numpy_nullable")
a
1 <NA>
>>> f.seek(0)
0
>>> pd.read_csv(f, dtype_backend="numpy_nullable").dtypes
a Int64
dtype: object
>>> f.seek(0)
0
>>> pd.read_csv(f)
a
1 NaN
>>> f.seek(0)
0
>>> pd.read_csv(f).dtypes
a float64
dtype: object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @phofl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this everywhere, so I am +1 on doing Int64 here as well. Initially this came from read_parquet I think
pandas/io/_util.py
Outdated
@@ -8,6 +8,9 @@ | |||
def _arrow_dtype_mapping() -> dict: | |||
pa = import_optional_dependency("pyarrow") | |||
return { | |||
# All nulls should still give Float64 not object | |||
# TODO: This breaks parquet | |||
# pa.null(): pd.Float64Dtype(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what way does it break parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, first I hit this error.
>>> import pandas as pd
>>> import pyarrow as pa
>>> a = pa.nulls(0)
>>> pd.Float64Dtype().__from__arrow__(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Float64Dtype' object has no attribute '__from__arrow__'
After patching that(another PR incoming),
now an empty DataFrame of dtype object(pd.DataFrame({"value": pd.array([], dtype=object)})
) returns a Float64Dtype when roundtripped.
Best guess here is that somehow the types_mapper
is overriding the pandas metadata somehow.
Anyone want to have another look? |
if dtype_backend != "pyarrow": | ||
new_schema = table.schema | ||
if dtype_backend == "numpy_nullable": | ||
new_type = pa.int64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just path the _arrow_dtype_mapping
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to wait for #52223 then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but yeah since #52223 is close lets wait for that one then merge this
"h": pd.Series( | ||
[pd.NA if parser.engine != "pyarrow" else "", "a"], dtype="string" | ||
), | ||
"h": pd.Series([pd.NA, "a"], dtype="string"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for these
thx @lithomas1 |
@meeseeksdev backport 2.0.x. |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.