Skip to content

add Series.str.remove(pre|suf)fix #43328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 6, 2021
Merged

add Series.str.remove(pre|suf)fix #43328

merged 13 commits into from
Sep 6, 2021

Conversation

janosh
Copy link
Contributor

@janosh janosh commented Aug 31, 2021

Based on work in #39226

Copy link
Member

@alimcmaster1 alimcmaster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an excellent first PR. Could we also add benchmarks as suggested by @simonjayhawkins here: #39226 (comment)

Ref: https://pandas.pydata.org/docs/development/contributing_codebase.html#running-the-performance-test-suite

@@ -535,6 +535,26 @@ def test_strip_lstrip_rstrip_args(any_string_dtype):
tm.assert_series_equal(result, expected)


def test_remove_suffix_prefix(any_string_dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we parameterize these tests cases using https://docs.pytest.org/en/6.2.x/parametrize.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean like this?

@pytest.mark.parametrize(
    "prefix,expected",
    [("x", ["xABCxx", "x BNSD", "LDFJH xx"]), ("xx ", ["xxABCxx", "BNSD", "LDFJH xx"])],
)
def test_removeprefix(any_string_dtype, prefix, expected):
    ser = Series(["xxABCxx", "xx BNSD", "LDFJH xx"], dtype=any_string_dtype)

    result = ser.str.removeprefix(prefix)
    ser_expected = Series(expected, dtype=any_string_dtype)
    tm.assert_series_equal(result, ser_expected)


@pytest.mark.parametrize(
    "suffix,expected",
    [("x", ["xxABCx", "xx BNSD", "LDFJH x"]), ("xx ", ["xxABCxx", "xx BNSD", "LDFJH"])],
)
def test_removesuffix(any_string_dtype, suffix, expected):
    ser = Series(["xxABCxx", "xx BNSD", "LDFJH xx"], dtype=any_string_dtype)

    result = ser.str.removesuffix(suffix)
    ser_expected = Series(expected, dtype=any_string_dtype)
    tm.assert_series_equal(result, ser_expected)

It's actually longer and I find it harder to read. But your call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure why the strings need to be so complex. A minimal case is surely isomorphic..

@pytest.mark.parametrize(
    "prefix, expected", [("a", ["b", " b c", "bc"]), ("ab", ["", "a b c", "bc"])],
)
def test_removeprefix(any_string_dtype, prefix, expected):
    ser = Series(["ab", "a b c", "bc], dtype=any_string_dtype)
    result = ser.str.removeprefix(prefix)
    ser_expected = Series(expected, dtype=any_string_dtype)
    tm.assert_series_equal(result, ser_expected)

@pytest.mark.parametrize(
    "suffix, expected", [("c", ["ab", "a b ", "b"]), ("bc", ["ab", "a b c", ""])],
)
def test_removesuffix(any_string_dtype, prefix, expected):
    ser = Series(["ab", "a b c", "bc"], dtype=any_string_dtype)
    result = ser.str.removesuffix(suffix)
    ser_expected = Series(expected, dtype=any_string_dtype)
    tm.assert_series_equal(result, ser_expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure why the strings need to be so complex.

Don't know either. I just copied those strings from the other test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets simplify them then - to what @attack68 suggested above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, already have. Just waiting for feedback on return value type hints and perf benchmark before pushing.

@@ -414,6 +414,33 @@ def _str_lstrip(self, to_strip=None):
def _str_rstrip(self, to_strip=None):
return self._str_map(lambda x: x.rstrip(to_strip))

def _str_removeprefix(self, prefix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add type hints for these newly added methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the inputs or return value as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both would be great :)

@janosh
Copy link
Contributor Author

janosh commented Aug 31, 2021

I tried asv continuous -f 1.1 -E virtualenv upstream/master HEAD -b strings but got

· Unknown commit upstream/master

Would be good to add a hint to the docs to run git fetch upstream master in that case.

Also, the following hint in asv_bench/asv.conf.json appears to be wrong:

// The Pythons you'd like to test against. If not provided, defaults
// to the current version of Python used to run asv.
// "pythons": ["2.7", "3.4"],
"pythons": ["3.8"],

I got an error

· No executable found for python 3.8
· No environments selected

Changing to "3.9" and rerunning, I get this error

·· Failure creating environment for virtualenv-py3.9-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-scipy-sqlalchemy-tables-xlrd-xlsxwriter-xlwt
Traceback (most recent call last):
File "/Users/janosh/.venv/py39/bin/asv", line 8, in
sys.exit(main())
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/main.py", line 38, in main
result = args.func(args)
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/commands/init.py", line 49, in run_from_args
return cls.run_from_conf_args(conf, args)
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/commands/continuous.py", line 75, in run_from_conf_args
return cls.run(
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/commands/continuous.py", line 114, in run
result = Run.run(
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/commands/run.py", line 294, in run
Setup.perform_setup(environments, parallel=parallel)
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/commands/setup.py", line 89, in perform_setup
list(map(_create, environments))
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/commands/setup.py", line 21, in _create
env.create()
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/environment.py", line 704, in create
self._setup()
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/plugins/virtualenv.py", line 148, in _setup
self._install_requirements()
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/plugins/virtualenv.py", line 172, in _install_requirements
self._run_pip(args, timeout=self._install_timeout, env=env)
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/plugins/virtualenv.py", line 177, in _run_pip
return self.run_executable('python', ['-mpip'] + list(args), **kwargs)
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/environment.py", line 949, in run_executable
return util.check_output([exe] + args, **kwargs)
File "/Users/janosh/.venv/py39/lib/python3.9/site-packages/asv/util.py", line 754, in check_output
raise ProcessError(args, retcode, stdout, stderr)
asv.util.ProcessError: Command '/Users/janosh/Repos/pandas/asv_bench/env/94b401bb918963c7d4d914702f131a3f/bin/python -mpip install -v --upgrade numpy Cython==0.29.21 matplotlib sqlalchemy scipy numba numexpr pyarrow tables openpyxl xlsxwriter xlrd xlwt odfpy jinja2' returned non-zero exit status 1

Does the Python interpreter have to be in a clean virtualenv or something like that?

@janosh
Copy link
Contributor Author

janosh commented Sep 4, 2021

I think I need some help here fixing this error:

+ pytest -m 'not slow and not network and not clipboard' pandas --junitxml=test-data.xml
ImportError while loading conftest '/pandas/pandas/conftest.py'.
pandas/__init__.py:46: in <module>
    from pandas.core.api import (
pandas/core/api.py:29: in <module>
    from pandas.core.arrays import Categorical
pandas/core/arrays/__init__.py:7: in <module>
    from pandas.core.arrays.categorical import Categorical
pandas/core/arrays/categorical.py:113: in <module>
    from pandas.core.strings.object_array import ObjectStringArrayMixin
pandas/core/strings/__init__.py:31: in <module>
    from pandas.core.strings.base import BaseStringArrayMethods
pandas/core/strings/base.py:11: in <module>
    from pandas.core.series import Series
pandas/core/series.py:91: in <module>
    from pandas.core import (
pandas/core/generic.py:111: in <module>
    from pandas.core import (
pandas/core/indexing.py:57: in <module>
    from pandas.core.indexes.api import (
pandas/core/indexes/api.py:11: in <module>
    from pandas.core.indexes.base import (
pandas/core/indexes/base.py:128: in <module>
    from pandas.core.arrays import (
E   ImportError: cannot import name 'Categorical' from partially initialized module 'pandas.core.arrays' (most likely due to a circular import) (/pandas/pandas/core/arrays/__init__.py)

@@ -8,6 +8,8 @@

from pandas._typing import Scalar

from pandas.core.series import Series
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import of Series is added just for the type annotations and causing the circular import?

try

from typing import TYPE_CHECKING
...
if TYPE_CHECKING:
    from pandas import Series

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add to docs:

  • doc/source/reference/series.rst
  • doc/source/user_guide/text.rst

merge master and ping on green

janosh and others added 3 commits September 5, 2021 14:58
…es a space before the colon separating the parameter name and type
@janosh
Copy link
Contributor Author

janosh commented Sep 5, 2021

@jreback 3 tests timed out after 60 min. Is that equivalent to 'green' or anything still to do?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment. ping on greenish

@@ -335,6 +335,19 @@ regular expression object will raise a ``ValueError``.
---------------------------------------------------------------------------
ValueError: case and flags cannot be set when pat is a compiled regex

``removeprefix`` and ``removesuffix`` have the same effect as ``str.removeprefix`` and ``str.removesuffix`` added in Python 3.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionadded 1.4. tag here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need this instead of the one on L349

@jreback jreback added this to the 1.4 milestone Sep 6, 2021
@jreback jreback merged commit 0a9f9ee into pandas-dev:master Sep 6, 2021
@jreback
Copy link
Contributor

jreback commented Sep 6, 2021

thanks @janosh very nice!

and @erfannariman for the original!

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add string method remove prefix and suffix, python 3.9
5 participants