-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix some more arrow CSV tests #52087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
d35ffd2
04857b3
fb08b13
eaff558
c0268d9
d7d5e32
dc44d91
b4014c1
671df0f
18450ed
3d3d5f3
32935b2
2b4aae7
43db279
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -80,6 +80,7 @@ def _get_pyarrow_options(self) -> None: | |
"decimal_point", | ||
) | ||
} | ||
self.convert_options["strings_can_be_null"] = "" in self.kwds["null_values"] | ||
self.read_options = { | ||
"autogenerate_column_names": self.header is None, | ||
"skip_rows": self.header | ||
|
@@ -149,6 +150,7 @@ def read(self) -> DataFrame: | |
DataFrame | ||
The DataFrame created from the CSV file. | ||
""" | ||
pa = import_optional_dependency("pyarrow") | ||
pyarrow_csv = import_optional_dependency("pyarrow.csv") | ||
self._get_pyarrow_options() | ||
|
||
|
@@ -158,6 +160,18 @@ def read(self) -> DataFrame: | |
parse_options=pyarrow_csv.ParseOptions(**self.parse_options), | ||
convert_options=pyarrow_csv.ConvertOptions(**self.convert_options), | ||
) | ||
|
||
# Convert all pa.null() cols -> float64 | ||
# TODO: There has to be a better way... right? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I'm looking for here is something like a combination of I couldn't find anything in the pyarrow docs, but maybe I'm just bad at Googling. Sadly the types_mapper doesn't work, since you can only feed that EA types. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe @jorisvandenbossche knows a better way? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does Table.select not work? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think there is an easier way. We currently don't have something like select_dtypes, so what you do (iterating over the schema, checking the type, and if needed adapt the new schema) seems the best option. |
||
new_schema = table.schema | ||
for i, arrow_type in enumerate(table.schema.types): | ||
if pa.types.is_null(arrow_type): | ||
new_schema = new_schema.set( | ||
i, new_schema.field(i).with_type(pa.float64()) | ||
) | ||
|
||
table = table.cast(new_schema) | ||
|
||
if self.kwds["dtype_backend"] == "pyarrow": | ||
frame = table.to_pandas(types_mapper=pd.ArrowDtype) | ||
elif self.kwds["dtype_backend"] == "numpy_nullable": | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what way does it break parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, first I hit this error.
After patching that(another PR incoming),
now an empty DataFrame of dtype object(
pd.DataFrame({"value": pd.array([], dtype=object)})
) returns a Float64Dtype when roundtripped.Best guess here is that somehow the
types_mapper
is overriding the pandas metadata somehow.