Skip to content

ENH: add NDFrame.select_str #27340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/reference/frame.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Attributes and underlying data
DataFrame.get_dtype_counts
DataFrame.get_ftype_counts
DataFrame.select_dtypes
DataFrame.select_str
DataFrame.values
DataFrame.get_values
DataFrame.axes
Expand Down
1 change: 1 addition & 0 deletions doc/source/reference/series.rst
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ Reindexing / selection / label manipulation
Series.rename_axis
Series.reset_index
Series.sample
Series.select_str
Series.set_axis
Series.take
Series.tail
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3347,6 +3347,11 @@ def select_dtypes(self, include=None, exclude=None):
* To select Pandas datetimetz dtypes, use ``'datetimetz'`` (new in
0.20.0) or ``'datetime64[ns, tz]'``

See Also
--------
DataFrame.select_str
DataFrame.loc

Examples
--------
>>> df = pd.DataFrame({'a': [1, 2] * 3,
Expand Down
83 changes: 83 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -4640,6 +4640,89 @@ def _reindex_with_indexers(

return self._constructor(new_data).__finalize__(self)

def select_str(
self, *, startswith=None, endswith=None, regex=None, flags=0, axis=None
):
"""
Select rows or columns of dataframe from the string labels in the selected axis.

Only one of keywords arguments `startswith`, `endswith` and `regex` can be used.

Parameters
----------
startswith: str, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just boil the signature down to regex?

Copy link
Contributor Author

@topper-123 topper-123 Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, startswith and endswith are not cardinal points for me, just thought they are often used, so nicer for people who don't want/know how to use regexes.

Test if the start of each string element matches a pattern.
Equivalent to :meth:`str.startswith`.
endswith: str, optional
Test if the end of each string element matches a pattern.
Equivalent to :meth:`str.endsswith`.
regex : str, optional
Keep labels from axis for which re.search(regex, label) is True.
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE. Can only be used with parameter regex.
axis : int or string axis name
The axis to filter on. By default this is the info axis,
'index' for Series, 'columns' for DataFrame.

Returns
-------
same type as input object

See Also
--------
DataFrame.loc
DataFrame.select_dtypes

Examples
--------
>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
... index=['mouse', 'rabbit'],
... columns=['one', 'two', 'three'])

>>> df.select_str(startswith='t')
two three
mouse 2 3
rabbit 5 6

>>> # select columns by regular expression
>>> df.select_str(regex=r'e$', axis=1)
one three
mouse 1 3
rabbit 4 6

>>> # select rows containing 'bbi'
>>> df.select_str(regex=r'bbi', axis=0)
one two three
rabbit 4 5 6
"""
import re

num_kw = com.count_not_none(startswith, endswith, regex)
if num_kw != 1:
raise TypeError(
"Only one of keywords arguments `startswith`, `endswith` and "
"`regex` can be used."
)
if regex is None and flags != 0:
raise ValueError("Can only be used togehter with parameter 'regex'")

if axis is None:
axis = self._info_axis_name
labels = self._get_axis(axis)

if startswith is not None:
mapped = labels.str.startswith(startswith)
elif endswith is not None:
mapped = labels.str.endsswith(endswith)
else: # regex
matcher = re.compile(regex, flags=flags)

def f(x):
return matcher.search(x) is not None

mapped = labels.map(f)
return self.loc(axis=axis)[mapped]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would instead implement

df.loc(regex=)[....] as an api, which turns the indexers into regexes.


def filter(self, items=None, like=None, regex=None, axis=None):
"""
Subset rows or columns of dataframe according to labels in
Expand Down
18 changes: 18 additions & 0 deletions pandas/tests/frame/test_axis_select_reindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -806,6 +806,24 @@ def test_align_series_combinations(self):
tm.assert_series_equal(res1, exp2)
tm.assert_frame_equal(res2, exp1)

def test_select_str(self, float_frame):
fcopy = float_frame.copy()
fcopy["AA"] = 1

# regex
selected = fcopy.select_str(regex="[A]+")
assert len(selected.columns) == 2
assert "AA" in selected

# doesn't have to be at beginning
df = DataFrame(
{"aBBa": [1, 2], "BBaBB": [1, 2], "aCCa": [1, 2], "aCCaBB": [1, 2]}
)

result = df.select_str(regex="BB")
exp = df[[x for x in df.columns if "BB" in x]]
assert_frame_equal(result, exp)

def test_filter(self, float_frame, float_string_frame):
# Items
filtered = float_frame.filter(["A", "B", "E"])
Expand Down