Skip to content

Fix inference for fixed with numpy strings with arrow string option #54496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 21, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Aug 11, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@phofl phofl requested a review from jbrockmendel August 11, 2023 12:15
@phofl phofl added this to the 2.1 milestone Aug 11, 2023
@mroeschke mroeschke added the Strings String extension data type and string data label Aug 11, 2023
@phofl
Copy link
Member Author

phofl commented Aug 16, 2023

@jbrockmendel gentle ping

pa = pytest.importorskip("pyarrow")
expected = Series(["a", "b"], dtype=pd.ArrowDtype(pa.string()))
with pd.option_context("future.infer_string", True):
ser = Series(np.array(["a", "b"]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to specify dtype explictly here.

Not sure how numpy's string dtype work is coming along, but just in case that becomes the default, it would be good to future-proof this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep this, we should infer the default type as arrow backed strings for now, that's the mos common use case for users

{"a": ["a", "b"], "b": ["c", "d"]},
dtype=dtype,
columns=Index(["a", "b"], dtype=dtype),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for a 2-D string ndarray?
e.g.

a = np.array([['a', 'b'], ['c', 'd']])
df = pd.DataFrame(a)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@lithomas1
Copy link
Member

Moving off milestone for now.

If Brock reviews in time, can always put it back.

@lithomas1 lithomas1 removed this from the 2.1 milestone Aug 21, 2023
@phofl
Copy link
Member Author

phofl commented Aug 21, 2023

@jbrockmendel ping

@phofl phofl added this to the 2.1 milestone Aug 21, 2023
@phofl phofl merged commit 3060df9 into pandas-dev:main Aug 21, 2023
@phofl phofl deleted the arrow_strings_numpy_strings branch August 21, 2023 19:59
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 21, 2023
mroeschke pushed a commit that referenced this pull request Aug 21, 2023
…y strings with arrow string option) (#54672)

Backport PR #54496: Fix inference for fixed with numpy strings with arrow string option

Co-authored-by: Patrick Hoefler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants