-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: ensure numpy fixed-size binary/string arrays are converted to object dtype for Index creation #57995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
39fcc6a
to
4b27a30
Compare
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this, and apologies for the lack of reviews!
The original PR that introduced this change edited the array_tuplesafe
function (#57139), which is called inside Index.__new__
before calling _dtype_to_subclass
. So if we would change array_tuplesafe
to still return object dtype instead of numpy S/U, that would fix it before getting to _dtype_to_subclass
.
The current fix also essentially allows to create an Index with an underlying numpy array with S6 dtype, while I think we want to restore the behaviour of converting to object dtype (I think my suggestion above should do that)
Other comment: could you also add a test for direct Index constructor from a Series object (i.e. pd.Index(pd.Series(arr))
), because that is affected as well, in addition to set_index
string_length = 6 | ||
in_dtype, df_name = f"S{string_length}", "fruit" | ||
data = ["apple", "banana", "orange", "grape"] | ||
|
||
# Create array with FLS(|S{value}) dtype | ||
arr = np.array(data, dtype=in_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purpose of the tests, we can simplify the snippet from the issue report a bit
string_length = 6 | |
in_dtype, df_name = f"S{string_length}", "fruit" | |
data = ["apple", "banana", "orange", "grape"] | |
# Create array with FLS(|S{value}) dtype | |
arr = np.array(data, dtype=in_dtype) | |
# Create array with FLS(|S{value}) dtype | |
arr = np.array( ["apple", "banana", "orange", "grape"], dtype="S6") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the sugestion and I will change the test. Regarding the asarray_tuplesafe, I went into a little more detail in this PR's issue (#57645) but looking back into the original PR, the line elif isinstance(values, ABCSeries)
was added which allows for any Series to keep their original types and values without performing any check if their type is compatible or not.
Could a possible solution be something of the sort:
elif isinstance(values, ABCSeries) and values.dtype.kind != "S":
return values_values
By refactoring this line, the elif will fail if the values have dtype ="S" and creates a new array with dtype object as it used to. Before I proceed, could I know your thoughts on the suggestion I've made?
4b27a30
to
77138d5
Compare
….2.1 While using the function set_index with parameter inplace=True, the function would try and create a new index where its dtype would be a FLS S{value} dtype, which was not recognized by the function _dtype_to_subclass and raised a NotImplementedError. That said , by adding a verification that recognizes FLS dtype , the index is created successfully and the function executes properly.
0785383
to
1bf53b3
Compare
…he function attempted to create a new index with a dtype of FLS S{value}. This dtype was not recognized by the function _dtype_to_subclass, which raised a NotImplementedError. To address this, I added a verification to the function asarray_tuplesafe that converts data to an array with object type, allowing the index to be created succes sfully. Additionally, I created a new test and simplified a previously created test. I also reverted the test file test_parquet.py to restore the intended FLS behavior.
1bf53b3
to
aaa4563
Compare
@jorisvandenbossche Hi, sorry to bother you, but would it be possible to get some feedback on this? |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/v3.0.0.rst
file if fixing a bug or adding a new feature.While using the function set_index with parameter inplace=True, the function would try and create a new index where its dtype would be a FLS
S{value}
dtype, which was not recognized by the function_dtype_to_subclass
and raised a NotImplementedError. That said , by adding a verification that recognizes FLS dtype , the index is created successfully and the function executes properly.