-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: allow opt-in to inferring pyarrow strings #54430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
ebe0bd5
0889028
533a642
066160d
364112a
2c36db2
157cb84
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2670,6 +2670,41 @@ def test_construct_with_strings_and_none(self): | |
expected = DataFrame({"a": ["1", "2", None]}, dtype="str") | ||
tm.assert_frame_equal(df, expected) | ||
|
||
def test_frame_string_inference(self): | ||
# GH#54430 | ||
pa = pytest.importorskip("pyarrow") | ||
dtype = pd.ArrowDtype(pa.string()) | ||
expected = DataFrame( | ||
{"a": ["a", "b"]}, dtype=dtype, columns=Index(["a"], dtype=dtype) | ||
) | ||
with pd.option_context("future.infer_string", True): | ||
df = DataFrame({"a": ["a", "b"]}) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
expected = DataFrame( | ||
{"a": ["a", "b"]}, | ||
dtype=dtype, | ||
columns=Index(["a"], dtype=dtype), | ||
index=Index(["x", "y"], dtype=dtype), | ||
) | ||
with pd.option_context("future.infer_string", True): | ||
df = DataFrame({"a": ["a", "b"]}, index=["x", "y"]) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
expected = DataFrame( | ||
{"a": ["a", 1]}, dtype="object", columns=Index(["a"], dtype=dtype) | ||
) | ||
with pd.option_context("future.infer_string", True): | ||
df = DataFrame({"a": ["a", 1]}) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
expected = DataFrame( | ||
{"a": ["a", "b"]}, dtype="object", columns=Index(["a"], dtype=dtype) | ||
) | ||
with pd.option_context("future.infer_string", True): | ||
df = DataFrame({"a": ["a", "b"]}, dtype="object") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a case with null/nan/None in it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added, needed a fix... |
||
tm.assert_frame_equal(df, expected) | ||
|
||
|
||
class TestDataFrameConstructorIndexInference: | ||
def test_frame_from_dict_of_series_overlapping_monthly_period_indexes(self): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -538,3 +538,24 @@ def test_ea_int_avoid_overflow(all_parsers): | |
} | ||
) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
def test_string_inference(all_parsers): | ||
# GH#54430 | ||
pa = pytest.importorskip("pyarrow") | ||
dtype = pd.ArrowDtype(pa.string()) | ||
|
||
data = """a,b | ||
x,1 | ||
y,2""" | ||
parser = all_parsers | ||
if parser.engine == "pyarrow": | ||
pytest.skip("TODO: Follow up") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did we ever discuss pyarrow backed engines returning pyarrow types by default? I think from a user perspective it is less than ideal to have to specify both this option and the dtype backend for any pyarrow backed IO methods There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's move this discussion to Basel, but there is an issue about this: #51846 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you link the GH issue in this skip message? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #54431 already addresses this |
||
with pd.option_context("future.infer_string", True): | ||
result = parser.read_csv(StringIO(data)) | ||
|
||
expected = DataFrame( | ||
{"a": pd.Series(["x", "y"], dtype=dtype), "b": [1, 2]}, | ||
columns=pd.Index(["a", "b"], dtype=dtype), | ||
) | ||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any tests that hit this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one