Skip to content

ENH: Series.str.get_dummies() raise on string type (follow up to PR #59577) #59786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Jan 22, 2025
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
e6f9527
Add prefix, prefix_sep, dummy_na, and dtype args to StringMethods get…
aaronchucarroll Aug 21, 2024
dafb61d
Fix import issue
aaronchucarroll Aug 21, 2024
bb79ef2
Fix typing of dtype
aaronchucarroll Aug 21, 2024
24be84f
Fix NaN type issue
aaronchucarroll Aug 21, 2024
09b2fad
Support categorical string backend
aaronchucarroll Aug 21, 2024
50ed90c
Fix dtype type hints
aaronchucarroll Aug 21, 2024
9e95485
Add dtype to get_dummies docstring
aaronchucarroll Aug 21, 2024
9a47768
Fix get_dummies dtype docstring
aaronchucarroll Aug 21, 2024
0c94bff
Merge branch 'main' into stringmethods-get-dummies
aaronchucarroll Aug 22, 2024
9702bf7
remove changes for unnecessary args
aaronchucarroll Sep 3, 2024
8793516
Merge branch 'stringmethods-get-dummies' of https://github.com/aaronc…
aaronchucarroll Sep 3, 2024
bad1038
Merge branch 'main' into stringmethods-get-dummies
aaronchucarroll Sep 3, 2024
163fe09
parametrize dtype tests
aaronchucarroll Sep 5, 2024
3d75fdc
Merge branch 'stringmethods-get-dummies' of https://github.com/aaronc…
aaronchucarroll Sep 5, 2024
d68bece
support pyarrow and nullable dtypes
aaronchucarroll Sep 5, 2024
c2aa7d5
Merge branch 'main' into stringmethods-get-dummies
aaronchucarroll Sep 5, 2024
0fd2401
fix pyarrow import error
aaronchucarroll Sep 5, 2024
920c865
skip pyarrow tests when not present
aaronchucarroll Sep 5, 2024
800f787
split pyarrow tests
aaronchucarroll Sep 5, 2024
d8149e6
Merge branch 'main' into stringmethods-get-dummies
aaronchucarroll Sep 5, 2024
6cbc3e8
parametrize pyarrow tests
aaronchucarroll Sep 7, 2024
532e139
change var name to dummies_dtype
aaronchucarroll Sep 7, 2024
cd5c2ab
fix string issue
aaronchucarroll Sep 7, 2024
822b3f4
consolidate conditionals
aaronchucarroll Sep 7, 2024
ba05a8d
add tests for str and pyarrow strings
aaronchucarroll Sep 7, 2024
37dddb8
skip pyarrow string tests if not present
aaronchucarroll Sep 7, 2024
6fbe183
add info to whatsnew doc
aaronchucarroll Sep 9, 2024
87a1ee8
change func to meth in doc info
aaronchucarroll Sep 9, 2024
8706af6
proposed changes from review
aaronchucarroll Sep 9, 2024
9d7d7f8
raise ValueError on str types
aaronchucarroll Sep 12, 2024
fa41092
Merge branch 'main' into stringmethods-get-dummies
aaronchucarroll Sep 13, 2024
3d20d2b
fix typo from merge conflict
aaronchucarroll Sep 15, 2024
392782c
Merge branch 'main' into stringmethods-get-dummies
aaronchucarroll Sep 15, 2024
870457e
fix precommit issue
aaronchucarroll Sep 15, 2024
7ec0b5f
Merge branch 'main' into stringmethods-get-dummies
aaronchucarroll Sep 16, 2024
8e18ee0
fix minor issues
aaronchucarroll Oct 7, 2024
798a8ea
Merge branch 'main' into stringmethods-get-dummies
aaronchucarroll Oct 7, 2024
3ac320e
Merge branch 'main' into stringmethods-get-dummies
aaronchucarroll Oct 7, 2024
fbdddbb
revert previous string_dtype change
aaronchucarroll Oct 7, 2024
8eec58e
Merge branch 'stringmethods-get-dummies' of https://github.com/aaronc…
aaronchucarroll Oct 7, 2024
a912bd8
test current setup
aaronchucarroll Oct 8, 2024
bd8c059
test remove pyarrow import
aaronchucarroll Oct 8, 2024
961eb6c
Merge remote-tracking branch 'upstream/main' into stringmethods-get-d…
jorisvandenbossche Jan 13, 2025
151316d
update import and test
jorisvandenbossche Jan 13, 2025
f829533
Merge remote-tracking branch 'upstream/main' into stringmethods-get-d…
jorisvandenbossche Jan 14, 2025
171d381
Merge remote-tracking branch 'upstream/main' into stringmethods-get-d…
jorisvandenbossche Jan 22, 2025
8b26e8c
restrict to numeric or boolean
jorisvandenbossche Jan 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,8 @@ Other enhancements
- :meth:`Series.cummin` and :meth:`Series.cummax` now supports :class:`CategoricalDtype` (:issue:`52335`)
- :meth:`Series.plot` now correctly handle the ``ylabel`` parameter for pie charts, allowing for explicit control over the y-axis label (:issue:`58239`)
- :meth:`DataFrame.plot.scatter` argument ``c`` now accepts a column of strings, where rows with the same string are colored identically (:issue:`16827` and :issue:`16485`)
- :meth:`Series.map` can now accept kwargs to pass on to func (:issue:`59814`)
- :meth:`Series.str.get_dummies` now accepts a ``dtype`` parameter to specify the dtype of the resulting DataFrame (:issue:`47872`)
- :meth:`pandas.concat` will raise a ``ValueError`` when ``ignore_index=True`` and ``keys`` is not ``None`` (:issue:`59274`)
- :meth:`str.get_dummies` now accepts a ``dtype`` parameter to specify the dtype of the resulting DataFrame (:issue:`47872`)
- Multiplying two :class:`DateOffset` objects will now raise a ``TypeError`` instead of a ``RecursionError`` (:issue:`59442`)
- Restore support for reading Stata 104-format and enable reading 103-format dta files (:issue:`58554`)
- Support passing a :class:`Iterable[Hashable]` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
Expand Down
2 changes: 0 additions & 2 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -2409,8 +2409,6 @@ def _str_get_dummies(self, sep: str = "|", dtype: NpDtype | None = None):
else:
dummies_dtype = np.bool_
dummies = np.zeros(n_rows * n_cols, dtype=dummies_dtype)
if dtype == str:
dummies[:] = False
dummies[indices] = True
dummies = dummies.reshape((n_rows, n_cols))
result = type(self)(pa.array(list(dummies)))
Expand Down
6 changes: 5 additions & 1 deletion pandas/core/strings/accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -2482,12 +2482,16 @@ def get_dummies(
1 False False False
2 True False True
"""
from pandas.core.dtypes.common import is_string_dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.core.dtypes.common is already being imported from at the top of this file. Can you move this import up there.


from pandas.core.frame import DataFrame

if is_string_dtype(dtype):
raise ValueError("string dtype not supported, please use a numeric dtype")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of explicitly disallowing specifically string dtype, should we allow any numeric dtype? (so raise an error if not is_numeric_dtype(dtype))
Because right now this would also allow the user to specify they want datetime dtype, for example, which also does not make much sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach thought here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we can restrict to just numeric here.

# we need to cast to Series of strings as only that has all
# methods available for making the dummies...
result, name = self._data.array._str_get_dummies(sep, dtype)
if is_extension_array_dtype(dtype) or isinstance(dtype, ArrowDtype):
if is_extension_array_dtype(dtype):
return self._wrap_result(
DataFrame(result, columns=name, dtype=dtype),
name=name,
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/strings/object_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -421,7 +421,7 @@ def _str_get_dummies(self, sep: str = "|", dtype: NpDtype | None = None):
dummies_dtype = _dtype
else:
dummies_dtype = np.bool_
dummies = np.empty((len(arr), len(tags2)), dtype=dummies_dtype)
dummies = np.empty((len(arr), len(tags2)), dtype=dummies_dtype, order="F")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronchucarroll @rhshadrach do you remember why this change was included?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see that is based on a comment from @mroeschke at #59577 (comment)


def _isin(test_elements: str, element: str) -> bool:
return element in test_elements
Expand Down
34 changes: 8 additions & 26 deletions pandas/tests/strings/test_get_dummies.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import numpy as np
import pytest

from pandas._config import using_string_dtype

import pandas.util._test_decorators as td

from pandas import (
Expand All @@ -13,11 +11,6 @@
_testing as tm,
)

try:
import pyarrow as pa
except ImportError:
pa = None


def test_get_dummies(any_string_dtype):
s = Series(["a|b", "a|c", np.nan], dtype=any_string_dtype)
Expand Down Expand Up @@ -98,30 +91,19 @@ def test_get_dummies_with_pyarrow_dtype(any_string_dtype, dtype):


# GH#47872
@pytest.mark.xfail(using_string_dtype(), reason="TODO(infer_string)")
def test_get_dummies_with_str_dtype(any_string_dtype):
s = Series(["a|b", "a|c", np.nan], dtype=any_string_dtype)
result = s.str.get_dummies("|", dtype=str)
expected = DataFrame(
[["T", "T", "F"], ["T", "F", "T"], ["F", "F", "F"]],
columns=list("abc"),
dtype=str,
)
tm.assert_frame_equal(result, expected)
with pytest.raises(
ValueError, match="string dtype not supported, please use a numeric dtype"
):
s.str.get_dummies("|", dtype=str)


# GH#47872
@td.skip_if_no("pyarrow")
def test_get_dummies_with_pa_str_dtype(any_string_dtype):
s = Series(["a|b", "a|c", np.nan], dtype=any_string_dtype)
result = s.str.get_dummies("|", dtype="str[pyarrow]")
expected = DataFrame(
[
["true", "true", "false"],
["true", "false", "true"],
["false", "false", "false"],
],
columns=list("abc"),
dtype="str[pyarrow]",
)
tm.assert_frame_equal(result, expected)
with pytest.raises(
ValueError, match="string dtype not supported, please use a numeric dtype"
):
s.str.get_dummies("|", dtype="str[pyarrow]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test above is sufficient (so keep test_get_dummies_with_str_dtype but can remove test_get_dummies_with_pa_str_dtype