Skip to content

REGR: ensure numpy fixed-size binary/string arrays are converted to object dtype for Index creation #57995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

PF2100
Copy link
Contributor

@PF2100 PF2100 commented Mar 25, 2024

  • closes BUG: Cannot use numpy FLS as indicies since pandas 2.2.1 #57645
  • [Tests added and passed] if fixing a bug or adding a new feature
  • All [code checks passed].
  • Added [type annotations] to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/v3.0.0.rst file if fixing a bug or adding a new feature.

While using the function set_index with parameter inplace=True, the function would try and create a new index where its dtype would be a FLS S{value} dtype, which was not recognized by the function _dtype_to_subclass and raised a NotImplementedError. That said , by adding a verification that recognizes FLS dtype , the index is created successfully and the function executes properly.

@PF2100 PF2100 force-pushed the 57645-cannot-detect-FLSdtype-in-base-_dtype_to_subclass branch from 39fcc6a to 4b27a30 Compare March 25, 2024 23:24
@PF2100 PF2100 marked this pull request as ready for review March 26, 2024 00:17
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 26, 2024
@jorisvandenbossche jorisvandenbossche changed the title FIX #57645: Cannot use numpy FLS as indicies since pandas 2.2.1 REGR: ensure numpy fixed-size binary/string arrays are converted to object dtype for Index creation Apr 29, 2024
@jorisvandenbossche jorisvandenbossche added Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version Index Related to the Index class or subclasses and removed Stale labels Apr 29, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.2.3 milestone Apr 29, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this, and apologies for the lack of reviews!

The original PR that introduced this change edited the array_tuplesafe function (#57139), which is called inside Index.__new__ before calling _dtype_to_subclass. So if we would change array_tuplesafe to still return object dtype instead of numpy S/U, that would fix it before getting to _dtype_to_subclass.

The current fix also essentially allows to create an Index with an underlying numpy array with S6 dtype, while I think we want to restore the behaviour of converting to object dtype (I think my suggestion above should do that)

Other comment: could you also add a test for direct Index constructor from a Series object (i.e. pd.Index(pd.Series(arr))), because that is affected as well, in addition to set_index

Comment on lines 621 to 626
string_length = 6
in_dtype, df_name = f"S{string_length}", "fruit"
data = ["apple", "banana", "orange", "grape"]

# Create array with FLS(|S{value}) dtype
arr = np.array(data, dtype=in_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purpose of the tests, we can simplify the snippet from the issue report a bit

Suggested change
string_length = 6
in_dtype, df_name = f"S{string_length}", "fruit"
data = ["apple", "banana", "orange", "grape"]
# Create array with FLS(|S{value}) dtype
arr = np.array(data, dtype=in_dtype)
# Create array with FLS(|S{value}) dtype
arr = np.array( ["apple", "banana", "orange", "grape"], dtype="S6")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the sugestion and I will change the test. Regarding the asarray_tuplesafe, I went into a little more detail in this PR's issue (#57645) but looking back into the original PR, the line elif isinstance(values, ABCSeries) was added which allows for any Series to keep their original types and values without performing any check if their type is compatible or not.

Could a possible solution be something of the sort:

elif isinstance(values, ABCSeries) and values.dtype.kind != "S":
        return values_values

By refactoring this line, the elif will fail if the values have dtype ="S" and creates a new array with dtype object as it used to. Before I proceed, could I know your thoughts on the suggestion I've made?

@PF2100 PF2100 force-pushed the 57645-cannot-detect-FLSdtype-in-base-_dtype_to_subclass branch from 4b27a30 to 77138d5 Compare May 13, 2024 21:56
PF2100 added 2 commits May 16, 2024 23:11
….2.1

While using the function set_index with parameter inplace=True, the function would try and create a new index where its dtype would be a FLS S{value} dtype, which was not recognized by the function _dtype_to_subclass and raised a NotImplementedError. That said , by adding a verification that recognizes FLS dtype , the index is created successfully and the function executes properly.
@PF2100 PF2100 force-pushed the 57645-cannot-detect-FLSdtype-in-base-_dtype_to_subclass branch 3 times, most recently from 0785383 to 1bf53b3 Compare May 17, 2024 15:57
@PF2100 PF2100 requested a review from jorisvandenbossche May 19, 2024 12:54
…he function attempted to create a new index with a dtype of FLS

S{value}. This dtype was not recognized by the function _dtype_to_subclass, which raised a NotImplementedError. To address this, I added
a verification to the function asarray_tuplesafe that converts data to an array with object type, allowing the index to be created succes
sfully.

Additionally, I created a new test and simplified a previously created test. I also reverted the test file test_parquet.py to restore
the intended FLS behavior.
@PF2100 PF2100 force-pushed the 57645-cannot-detect-FLSdtype-in-base-_dtype_to_subclass branch from 1bf53b3 to aaa4563 Compare May 21, 2024 10:14
@PF2100
Copy link
Contributor Author

PF2100 commented May 25, 2024

@jorisvandenbossche Hi, sorry to bother you, but would it be possible to get some feedback on this?
Thank you!

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Cannot use numpy FLS as indicies since pandas 2.2.1
3 participants