Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

tasfia8 · 2024-11-21T03:17:03Z

This PR fixes BUG (string): contruction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60343 @jorisvandenbossche
The default behaviour (pd.Index(d.keys())) worked correctly, but explicitly setting dtype="str" raised a ValueError. The issue stemmed from dict_keys not being converted to a proper array-like structure before being passed to StringDtype, which couldn't handle such inputs.

To fix the issue:

KeyView was introduced to identify and preprocess dict_keys before passing them to Pandas internals. The keys are now converted to a list for compatibility.
Updated logic in Index and sanitize_array to map dtype="str" to StringDtype(storage="python"). Updated check_array_indexer to allow empty boolean indexers for StringArray
New test added "test_index_from_dict_keys_with_dtype" to ensure:
Default inference (pd.Index(d.keys())) works.
Explicit dtype="str" works, resulting in string[python].
Updated existing tests (test_is_object and test_empty_fancy) to handle new behaviours introduced by the fix.

After the fix both the default (pd.Index(d.keys())) and explicit (pd.Index(d.keys(), dtype="str")) cases work:

tasfia8 · 2024-11-21T06:54:49Z

I was able to fix the bug but I am a bit stuck on how to fix the checks. Could @jorisvandenbossche or anyone else help? Especially the unit tests. I tried to fix the pre-commit (using ruff lint fix) but every time I fixed a formatting issue, after running pre-commit it goes to the initial position before I did the fix.

For the Doc build and upload check (it was giving an error for every declaration of ipython that didn't have import pandas as pd), I manually inserted it but don't know if there is an easy way.

jorisvandenbossche · 2024-11-23T09:38:29Z

For the Doc build and upload check (it was giving an error for every declaration of ipython that didn't have import pandas as pd), I manually inserted it but don't know if there is an easy way.

That should normally not have been needed. Did you get those errors locally? (in that case maybe something with the set up was wrong)
On the CI build I see that there is an error specifically in the doc/source/getting_started/comparison/includes/nth_word.rst file.

jorisvandenbossche

Thanks for working on this!

I added a few comments. It seems you have made more changes than I think would be needed to fix it. I would try to focus the PR a bit more (also, only fixing either Index or Series constructor would also be fine)

jorisvandenbossche · 2024-11-23T09:39:53Z

pandas/core/dtypes/common.py

+    return (
+        isinstance(arr_or_dtype, np.dtype)
+        and arr_or_dtype == "object"
+        or isinstance(arr_or_dtype, StringDtype)


We don't want to change the meaning of is_object_dtype to also include StringDtype. What was the reason you needed this change?

This change was included to pass the test case "test_is_object[string-True]" because it did not recognize the string as a valid object type. It passed the 60343 test case I added but failed the existing ones. In that case, would I leave the existing test case as is with no modification even if it fails so that I an make the PR more focused?

The main issue I am having is changing something for this bug is leaded to existing test cases and I am wondering if I should only be passing the test case I added for this bug or all of the test cases?

I don't think that test_is_object should change because of the fix that this PR is trying to do. The test already has a if using_infer_string and index.dtype == "string" check to change the expected result when using string dtype

jorisvandenbossche · 2024-11-23T09:42:37Z

pandas/tests/frame/test_query_eval.py

-        df = DataFrame(
-            {
-                "A": range(3),
-                "B": range(3),
-                "C": range(3)
-            }
-        ).rename(columns={"B": "A"})
+        df = DataFrame({"A": range(3), "B": range(3), "C": range(3)}).rename(
+            columns={"B": "A"}
+        )


It seems you included some unintended formatting changes. Maybe some setting in the IDE you are using that is conflicting with the formatting defaults in pandas?
I would recommend you to set up the pre-commit hook, which will ensure the code is formatted correctly when committing (see https://pandas.pydata.org/docs/development/contributing_codebase.html#pre-commit)

pandas/core/construction.py

tasfia8 · 2024-11-26T06:49:27Z

For the Doc build and upload check (it was giving an error for every declaration of ipython that didn't have import pandas as pd), I manually inserted it but don't know if there is an easy way.

That should normally not have been needed. Did you get those errors locally? (in that case maybe something with the set up was wrong) On the CI build I see that there is an error specifically in the doc/source/getting_started/comparison/includes/nth_word.rst file.

Yes when I was doing pre-commit hook locally, I saw that without ipython import it was causing failed hooks in each.rst file

tasfia8 · 2024-11-28T12:47:34Z

Hi Joris! I will make another pull request. I wanted to work on a new branch and accidentally removed my changes.

tasfia8 mentioned this pull request Nov 21, 2024

BUG (string): contruction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60343

Closed

jorisvandenbossche added this to the 2.3 milestone Nov 23, 2024

jorisvandenbossche added Strings String extension data type and string data Constructors Series/DataFrame/Index/pd.array Constructors labels Nov 23, 2024

jorisvandenbossche reviewed Nov 23, 2024

View reviewed changes

tasfia8 closed this Nov 28, 2024

tasfia8 force-pushed the bug60343v1 branch from f2041a0 to 1d809c3 Compare November 28, 2024 12:18

tasfia8 mentioned this pull request Nov 28, 2024

BUG: fix construction of Series / Index from dict keys when "str" dtype is specified explicitly #60436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

Uh oh!

tasfia8 commented Nov 21, 2024 •

edited

Loading

Uh oh!

tasfia8 commented Nov 21, 2024 •

edited

Loading

Uh oh!

jorisvandenbossche commented Nov 23, 2024

Uh oh!

jorisvandenbossche left a comment

Uh oh!

jorisvandenbossche Nov 23, 2024

Uh oh!

tasfia8 Nov 26, 2024 •

edited

Loading

Uh oh!

jorisvandenbossche Nov 26, 2024

Uh oh!

jorisvandenbossche Nov 23, 2024

Uh oh!

Uh oh!

tasfia8 commented Nov 26, 2024

Uh oh!

tasfia8 commented Nov 28, 2024

Uh oh!

Uh oh!

Uh oh!

Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

Uh oh!

Conversation

tasfia8 commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tasfia8 commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Nov 23, 2024

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Nov 23, 2024

Choose a reason for hiding this comment

Uh oh!

tasfia8 Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Nov 26, 2024

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Nov 23, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tasfia8 commented Nov 26, 2024

Uh oh!

tasfia8 commented Nov 28, 2024

Uh oh!

Uh oh!

tasfia8 commented Nov 21, 2024 •

edited

Loading

tasfia8 commented Nov 21, 2024 •

edited

Loading

tasfia8 Nov 26, 2024 •

edited

Loading