base/test_unique.py: regression test for bad unicode string #34851

suvayu · 2020-06-17T13:27:02Z

closes BUG: Series.unique segfaults on invalid unicode #34550
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

I ran the tests with pandas 1.0.4, where it segfaults as expected (shown below), and master passes the test.

tests/test_mytest.py::test_unique2[idx_or_series_w_bad_unicode0] Fatal Python error: Segmentation fault

Current thread 0x00007f09d2dc4740 (most recent call first):
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pandas/core/algorithms.py", line 382 in unique
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pandas/core/base.py", line 1246 in unique
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 1974 in unique
  File "/home/suvali/tmp/pandas-tests/tests/test_mytest.py", line 16 in test_unique2
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/python.py", line 182 in pytest_pyfunc_call
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/python.py", line 1477 in runtest
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/runner.py", line 217 in <lambda>
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/runner.py", line 244 in from_call
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_runtest_hook
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/runner.py", line 186 in call_and_report
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/runner.py", line 100 in runtestprotocol
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/main.py", line 247 in _main
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/main.py", line 191 in wrap_session
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/opt/conda/envs/py38/lib/python3.8/site-packages/_pytest/config/__init__.py", line 124 in main
  File "/opt/conda/envs/py38/bin/pytest", line 11 in <module>
Segmentation fault (core dumped)

dsaxton · 2020-06-18T14:38:09Z

pandas/tests/base/test_unique.py

+    unique_values = list(dict.fromkeys(obj.values))
+    if isinstance(obj, pd.Index):
+        expected = pd.Index(unique_values, dtype=obj.dtype)
+        if is_datetime64tz_dtype(obj.dtype):
+            expected = expected.normalize()
+        tm.assert_index_equal(result, expected)
+    else:
+        expected = np.array(unique_values, dtype=obj.dtype)
+        tm.assert_numpy_array_equal(result, expected)


I think this could be simpler. Since the sole unique value is hard-coded above we can just explicitly construct the expected output using that value and object dtype (shouldn't need to check for datetime).

Thanks for your comment @dsaxton. I actually wanted to replicate the basic test earlier in the same file, test_unique(..). It takes a fixture, and I wasn't sure how to pass a fixture and my examples simultaneously to pytest.mark.parametrize (my search led me to a feature request for pytest with lots of messy workarounds). So I guess the options are:

I simplify this as you suggest, or

we figure out a way to pass these examples along with the fixture.

This is my first PR here, so a bit of guidance would be great :)

Sure thing, testing PRs are always appreciated. It's probably fine to create this as a simple one-off rather than including invalid unicode in every test that uses the fixture.

Okay, I'll simplify this one and push an update. Thanks a lot :)

dsaxton · 2020-06-18T14:44:19Z

pandas/tests/base/test_unique.py

+def test_unique_bad_unicode(idx_or_series_w_bad_unicode):
+    # regression test for #34550
+    obj = idx_or_series_w_bad_unicode
+    obj = np.repeat(obj, range(1, len(obj) + 1))


Are we trying to add duplicates here? If so probably makes sense to add them up top.

dsaxton · 2020-06-18T18:28:16Z

pandas/tests/base/test_unique.py

+    result = obj.unique()
+
+    # dict.fromkeys preserves the order
+    unique_values = list(dict.fromkeys(obj.values))


You can remove this and then simply pd.Index(["\ud38d"], dtype=object) below

Sorry I missed that part of your comment, fixed now :)

suvayu · 2020-06-18T20:05:50Z

The test failures are in pandas/tests/io/json/test_pandas.py, a warning from contextlib has changed from FutureWarning to ResourceWarning

I didn't touch this test. Should I fix it here, or is that for another PR?

dsaxton · 2020-06-18T22:01:31Z

I didn't touch this test. Should I fix it here, or is that for another PR?

Seems unrelated (could be a flakey test), try merging master to restart CI and see if it goes away

closes #34550

jreback · 2020-06-18T23:08:46Z

thanks @suvayu

dsaxton reviewed Jun 18, 2020

View reviewed changes

dsaxton added the Testing pandas testing functions or related to the test suite label Jun 18, 2020

dsaxton reviewed Jun 18, 2020

View reviewed changes

base/test_unique.py: regression test for bad unicode string

8069b0c

closes #34550

jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jun 18, 2020

jreback added this to the 1.1 milestone Jun 18, 2020

jreback merged commit e6e0889 into pandas-dev:master Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

base/test_unique.py: regression test for bad unicode string #34851

base/test_unique.py: regression test for bad unicode string #34851

suvayu commented Jun 17, 2020 •

edited

Loading

dsaxton Jun 18, 2020

suvayu Jun 18, 2020

dsaxton Jun 18, 2020

suvayu Jun 18, 2020

dsaxton Jun 18, 2020

dsaxton Jun 18, 2020

suvayu Jun 18, 2020

suvayu commented Jun 18, 2020 •

edited

Loading

dsaxton commented Jun 18, 2020

jreback commented Jun 18, 2020

base/test_unique.py: regression test for bad unicode string #34851

base/test_unique.py: regression test for bad unicode string #34851

Conversation

suvayu commented Jun 17, 2020 • edited Loading

dsaxton Jun 18, 2020

Choose a reason for hiding this comment

suvayu Jun 18, 2020

Choose a reason for hiding this comment

dsaxton Jun 18, 2020

Choose a reason for hiding this comment

suvayu Jun 18, 2020

Choose a reason for hiding this comment

dsaxton Jun 18, 2020

Choose a reason for hiding this comment

dsaxton Jun 18, 2020

Choose a reason for hiding this comment

suvayu Jun 18, 2020

Choose a reason for hiding this comment

suvayu commented Jun 18, 2020 • edited Loading

dsaxton commented Jun 18, 2020

jreback commented Jun 18, 2020

suvayu commented Jun 17, 2020 •

edited

Loading

suvayu commented Jun 18, 2020 •

edited

Loading