Skip to content

BUG: Fix some more arrow CSV tests #52087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 10, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,8 @@ Period
- :meth:`arrays.PeriodArray.map` can now take a ``na_action`` argument. :meth:`PeriodIndex.map` with ``na_action="ignore"`` now works as expected. (:issue:`51644`)
- Bug in :class:`PeriodDtype` constructor raising ``ValueError`` instead of ``TypeError`` when an invalid type is passed (:issue:`51790`)
- Bug in :meth:`arrays.PeriodArray.map` and :meth:`PeriodIndex.map`, where the supplied callable operated array-wise instead of element-wise (:issue:`51977`)
-
- Bug in :func:`read_csv` not processing empty strings as a null value, with ``engine="pyarrow"`` (:issue:`52087`)
- Bug in :func:`read_csv` returning ``object`` dtype columns instead of ``float64`` dtype columns with ``engine="pyarrow"`` for columns that are all null with ``engine="pyarrow"`` (:issue:`52087`)

Plotting
^^^^^^^^
Expand Down
3 changes: 3 additions & 0 deletions pandas/io/_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
def _arrow_dtype_mapping() -> dict:
pa = import_optional_dependency("pyarrow")
return {
# All nulls should still give Float64 not object
# TODO: This breaks parquet
# pa.null(): pd.Float64Dtype(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way does it break parquet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, first I hit this error.

>>> import pandas as pd
>>> import pyarrow as pa
>>> a = pa.nulls(0)
>>> pd.Float64Dtype().__from__arrow__(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Float64Dtype' object has no attribute '__from__arrow__'

After patching that(another PR incoming),
now an empty DataFrame of dtype object(pd.DataFrame({"value": pd.array([], dtype=object)})) returns a Float64Dtype when roundtripped.

Best guess here is that somehow the types_mapper is overriding the pandas metadata somehow.

pa.int8(): pd.Int8Dtype(),
pa.int16(): pd.Int16Dtype(),
pa.int32(): pd.Int32Dtype(),
Expand Down
26 changes: 24 additions & 2 deletions pandas/io/parsers/arrow_parser_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ def _get_pyarrow_options(self) -> None:
"decimal_point",
)
}
self.convert_options["strings_can_be_null"] = "" in self.kwds["null_values"]
self.read_options = {
"autogenerate_column_names": self.header is None,
"skip_rows": self.header
Expand Down Expand Up @@ -149,6 +150,7 @@ def read(self) -> DataFrame:
DataFrame
The DataFrame created from the CSV file.
"""
pa = import_optional_dependency("pyarrow")
pyarrow_csv = import_optional_dependency("pyarrow.csv")
self._get_pyarrow_options()

Expand All @@ -158,9 +160,29 @@ def read(self) -> DataFrame:
parse_options=pyarrow_csv.ParseOptions(**self.parse_options),
convert_options=pyarrow_csv.ConvertOptions(**self.convert_options),
)
if self.kwds["dtype_backend"] == "pyarrow":

dtype_backend = self.kwds["dtype_backend"]

# Convert all pa.null() cols -> float64 (non nullable)
# else Int64 (nullable case)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really confusing for me, but apparently even convert_dtypes on an all null column(float64) will change it to an Int64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should replicate that here.
In convert_dtypes, I suppose that is the consequence of trying to cast a float column where all non-null values can faithfully be represented as an integer. Which gives this corner case where this set of values is empty.

But if you don't know the dtype you are starting with (like here), I think using float is safer (it won't give problems later on if you then want to use it for floats and not just integers). Or actually using object dtype might even be safest (as that can represent everything), but of course we don't (yet) have a nullable object dtype, so that's probably not really an option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the C and python parsers already convert to an Int64. Is this something that we need to fix before 2.0?

>>> import pandas as pd
>>> from io import StringIO
>>> f = StringIO("a\n1,")
>>> pd.read_csv(f, dtype_backend="numpy_nullable")
      a
1  <NA>
>>> f.seek(0)
0
>>> pd.read_csv(f, dtype_backend="numpy_nullable").dtypes
a    Int64
dtype: object
>>> f.seek(0)
0
>>> pd.read_csv(f)
    a
1 NaN
>>> f.seek(0)
0
>>> pd.read_csv(f).dtypes
a    float64
dtype: object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @phofl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this everywhere, so I am +1 on doing Int64 here as well. Initially this came from read_parquet I think

# TODO: There has to be a better way... right?
if dtype_backend != "pyarrow":
new_schema = table.schema
if dtype_backend == "numpy_nullable":
new_type = pa.int64()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just path the _arrow_dtype_mapping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to wait for #52223 then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but yeah since #52223 is close lets wait for that one then merge this

else:
new_type = pa.float64()
for i, arrow_type in enumerate(table.schema.types):
if pa.types.is_null(arrow_type):
new_schema = new_schema.set(
i, new_schema.field(i).with_type(new_type)
)

table = table.cast(new_schema)

if dtype_backend == "pyarrow":
frame = table.to_pandas(types_mapper=pd.ArrowDtype)
elif self.kwds["dtype_backend"] == "numpy_nullable":
elif dtype_backend == "numpy_nullable":
frame = table.to_pandas(types_mapper=_arrow_dtype_mapping().get)
else:
frame = table.to_pandas()
Expand Down
3 changes: 3 additions & 0 deletions pandas/io/parsers/readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1458,8 +1458,11 @@ def _get_options_with_defaults(self, engine: CSVEngine) -> dict[str, Any]:
value = kwds[argname]

if engine != "c" and value != default:
# TODO: Refactor this logic, its pretty convoluted
if "python" in engine and argname not in _python_unsupported:
pass
elif "pyarrow" in engine and argname not in _pyarrow_unsupported:
pass
else:
raise ValueError(
f"The {repr(argname)} option is not supported with the "
Expand Down
12 changes: 3 additions & 9 deletions pandas/tests/io/parser/dtypes/test_dtypes_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -423,13 +423,9 @@ def test_dtype_backend(all_parsers):
"e": pd.Series([pd.NA, 6], dtype="Int64"),
"f": pd.Series([pd.NA, 7.5], dtype="Float64"),
"g": pd.Series([pd.NA, True], dtype="boolean"),
"h": pd.Series(
[pd.NA if parser.engine != "pyarrow" else "", "a"], dtype="string"
),
"h": pd.Series([pd.NA, "a"], dtype="string"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for these

"i": pd.Series([Timestamp("2019-12-31")] * 2),
"j": pd.Series(
[pd.NA, pd.NA], dtype="Int64" if parser.engine != "pyarrow" else object
),
"j": pd.Series([pd.NA, pd.NA], dtype="Int64"),
}
)
tm.assert_frame_equal(result, expected)
Expand All @@ -451,7 +447,6 @@ def test_dtype_backend_and_dtype(all_parsers):
tm.assert_frame_equal(result, expected)


@pytest.mark.usefixtures("pyarrow_xfail")
def test_dtype_backend_string(all_parsers, string_storage):
# GH#36712
pa = pytest.importorskip("pyarrow")
Expand Down Expand Up @@ -499,7 +494,6 @@ def test_dtype_backend_pyarrow(all_parsers, request):
# GH#36712
pa = pytest.importorskip("pyarrow")
parser = all_parsers
engine = parser.engine

data = """a,b,c,d,e,f,g,h,i,j
1,2.5,True,a,,,,,12-31-2019,
Expand All @@ -516,7 +510,7 @@ def test_dtype_backend_pyarrow(all_parsers, request):
"f": pd.Series([pd.NA, 7.5], dtype="float64[pyarrow]"),
"g": pd.Series([pd.NA, True], dtype="bool[pyarrow]"),
"h": pd.Series(
[pd.NA if engine != "pyarrow" else "", "a"],
[pd.NA, "a"],
dtype=pd.ArrowDtype(pa.string()),
),
"i": pd.Series([Timestamp("2019-12-31")] * 2),
Expand Down
15 changes: 10 additions & 5 deletions pandas/tests/io/parser/test_na_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@
xfail_pyarrow = pytest.mark.usefixtures("pyarrow_xfail")


@skip_pyarrow
def test_string_nas(all_parsers):
parser = all_parsers
data = """A,B,C
Expand All @@ -36,7 +35,6 @@ def test_string_nas(all_parsers):
tm.assert_frame_equal(result, expected)


@skip_pyarrow
def test_detect_string_na(all_parsers):
parser = all_parsers
data = """A,B
Expand Down Expand Up @@ -89,7 +87,6 @@ def test_non_string_na_values(all_parsers, data, na_values):
tm.assert_frame_equal(result, expected)


@skip_pyarrow
def test_default_na_values(all_parsers):
_NA_VALUES = {
"-1.#IND",
Expand Down Expand Up @@ -138,6 +135,7 @@ def f(i, v):
tm.assert_frame_equal(result, expected)


# TODO: needs skiprows list support in pyarrow
@skip_pyarrow
@pytest.mark.parametrize("na_values", ["baz", ["baz"]])
def test_custom_na_values(all_parsers, na_values):
Expand Down Expand Up @@ -172,6 +170,7 @@ def test_bool_na_values(all_parsers):
tm.assert_frame_equal(result, expected)


# TODO: Needs pyarrow support for dictionary in na_values
@skip_pyarrow
def test_na_value_dict(all_parsers):
data = """A,B,C
Expand All @@ -191,7 +190,6 @@ def test_na_value_dict(all_parsers):
tm.assert_frame_equal(df, expected)


@skip_pyarrow
@pytest.mark.parametrize(
"index_col,expected",
[
Expand Down Expand Up @@ -225,6 +223,7 @@ def test_na_value_dict_multi_index(all_parsers, index_col, expected):
tm.assert_frame_equal(result, expected)


# TODO: xfail components of this test, the first one passes
@skip_pyarrow
@pytest.mark.parametrize(
"kwargs,expected",
Expand Down Expand Up @@ -287,7 +286,6 @@ def test_na_values_keep_default(all_parsers, kwargs, expected):
tm.assert_frame_equal(result, expected)


@skip_pyarrow
def test_no_na_values_no_keep_default(all_parsers):
# see gh-4318: passing na_values=None and
# keep_default_na=False yields 'None" as a na_value
Expand All @@ -314,6 +312,7 @@ def test_no_na_values_no_keep_default(all_parsers):
tm.assert_frame_equal(result, expected)


# TODO: Blocked on na_values dict support in pyarrow
@skip_pyarrow
def test_no_keep_default_na_dict_na_values(all_parsers):
# see gh-19227
Expand All @@ -326,6 +325,7 @@ def test_no_keep_default_na_dict_na_values(all_parsers):
tm.assert_frame_equal(result, expected)


# TODO: Blocked on na_values dict support in pyarrow
@skip_pyarrow
def test_no_keep_default_na_dict_na_scalar_values(all_parsers):
# see gh-19227
Expand All @@ -338,6 +338,7 @@ def test_no_keep_default_na_dict_na_scalar_values(all_parsers):
tm.assert_frame_equal(df, expected)


# TODO: Blocked on na_values dict support in pyarrow
@skip_pyarrow
@pytest.mark.parametrize("col_zero_na_values", [113125, "113125"])
def test_no_keep_default_na_dict_na_values_diff_reprs(all_parsers, col_zero_na_values):
Expand Down Expand Up @@ -368,6 +369,7 @@ def test_no_keep_default_na_dict_na_values_diff_reprs(all_parsers, col_zero_na_v
tm.assert_frame_equal(result, expected)


# TODO: Empty null_values doesn't work properly on pyarrow
@skip_pyarrow
@pytest.mark.parametrize(
"na_filter,row_data",
Expand All @@ -390,6 +392,7 @@ def test_na_values_na_filter_override(all_parsers, na_filter, row_data):
tm.assert_frame_equal(result, expected)


# TODO: Arrow parse error
@skip_pyarrow
def test_na_trailing_columns(all_parsers):
parser = all_parsers
Expand Down Expand Up @@ -418,6 +421,7 @@ def test_na_trailing_columns(all_parsers):
tm.assert_frame_equal(result, expected)


# TODO: xfail the na_values dict case
@skip_pyarrow
@pytest.mark.parametrize(
"na_values,row_data",
Expand Down Expand Up @@ -495,6 +499,7 @@ def test_empty_na_values_no_default_with_index(all_parsers):
tm.assert_frame_equal(result, expected)


# TODO: Missing support for na_filter kewyord
@skip_pyarrow
@pytest.mark.parametrize(
"na_filter,index_data", [(False, ["", "5"]), (True, [np.nan, 5.0])]
Expand Down
12 changes: 1 addition & 11 deletions pandas/tests/io/parser/test_parse_dates.py
Original file line number Diff line number Diff line change
Expand Up @@ -1252,17 +1252,7 @@ def test_bad_date_parse(all_parsers, cache_dates, value):
parser = all_parsers
s = StringIO((f"{value},\n") * 50000)

if parser.engine == "pyarrow":
# None in input gets converted to 'None', for which
# pandas tries to guess the datetime format, triggering
# the warning. TODO: parse dates directly in pyarrow, see
# https://github.com/pandas-dev/pandas/issues/48017
warn = UserWarning
else:
warn = None
parser.read_csv_check_warnings(
warn,
"Could not infer format",
parser.read_csv(
s,
header=None,
names=["foo", "bar"],
Expand Down