Skip to content

Series.is_unique has errors on objects with __ne__ defined #20661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Dr-Irv opened this issue Apr 11, 2018 · 5 comments · Fixed by #20691
Closed

Series.is_unique has errors on objects with __ne__ defined #20661

Dr-Irv opened this issue Apr 11, 2018 · 5 comments · Fixed by #20691
Labels
Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 11, 2018

Code Sample, a copy-pastable example if possible

In [2]: import pandas as pd
   ...:
   ...: class Foo(object):
   ...:     def __init__(self, val):
   ...:         self._value = val
   ...:
   ...:     def __ne__(self, other):
   ...:         raise Exception("NEQ not supported")
   ...:

In [3]: li = [Foo(i) for i in range(5)]

In [4]: s = pd.Series(li, index=[i for i in range(5)])

In [5]: s.is_unique
Exception ignored in: 'util._checknan'
Traceback (most recent call last):
  File "<ipython-input-2-91ad489145cd>", line 8, in __ne__
Exception: NEQ not supported
Exception ignored in: 'util._checknan'
Traceback (most recent call last):
  File "<ipython-input-2-91ad489145cd>", line 8, in __ne__
Exception: NEQ not supported
Exception ignored in: 'util._checknan'
Traceback (most recent call last):
  File "<ipython-input-2-91ad489145cd>", line 8, in __ne__
Exception: NEQ not supported
Exception ignored in: 'util._checknan'
Traceback (most recent call last):
  File "<ipython-input-2-91ad489145cd>", line 8, in __ne__
Exception: NEQ not supported
Exception ignored in: 'util._checknan'
Traceback (most recent call last):
  File "<ipython-input-2-91ad489145cd>", line 8, in __ne__
Exception: NEQ not supported
Out[5]: True

Problem description

I'm working with a third party library that has a class that, for good reasons, raises an Exception when the method __ne__ is called. I want to put that class into a pandas Series. Eventually, I hope to do this with the new ExtensionArray feature. Anyhow, I uncovered an issue in my debugger (pydev) which was throwing an exception when I looked into the corresponding Series.

So if you have an object that defines __ne__, then is_unique will fail (as shown above). I can't figure out where util._checknan is being called, but I think it is deep in the cython area.

I think this will need to work correctly if I want to use ExtensionArray to hold these kinds of objects.

Expected Output

Out[5]: True

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 402ad45
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0.dev0+683.g402ad45da.dirty
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.1
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

First, if you would implement it via the ExtensionArray interface, you have the control over the unique() method (which is_unique relies upon under the hood), and so you can make sure that does the right thing.
That should solve your specific issue (but please try if that is indeed the case).

So if you have an object that defines ne, then is_unique will fail (as shown above).

It doesn't "really" seem to fail: you actually get the correct result (True in this case), the "exceptions" you see are only printed, not actually raised (they are ignored, not sure how that works).

The errors are coming from inside the PyObjectHashTable, which is used for custom python objects:

if not _checknan(val):

The _checknan basically does val != val to check it is a nan value or not. Whether it would be easy to let that function deal with such objects properly, I don't know.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Apr 12, 2018

I understand the idea of handling this by implementing the ExtensionArray interface and creating a unique() method. But other people who use this third party library might just create a Series of these objects, and would run into the same issue as reported above.

So I found this pandas issue #18111 that mentioned _checknan and so I implemented the suggestion of @jreback there to replace _checknan with checknull. It passes all the tests on my local machine.

What I don't know how to do is to use my example code above to create a test for this particular change, as the exception info is printed and the exception is caught in the example. Any suggestions for that?

@jorisvandenbossche
Copy link
Member

But other people who use this third party library might just create a Series of these objects, and would run into the same issue as reported above.

Yes, it is true that it still is an annoying (although rathe corner case I think) issue that would be nice to solve for actual object dtype series.

If the _checknull works fine, certainly welcome to do a PR. Might be good to do some performance checks (a %timeit before/after should be enough I think)

What I don't know how to do is to use my example code above to create a test for this particular change, as the exception info is printed and the exception is caught in the example. Any suggestions for that?

I think we should be able to run the test in a context capturing the stdout (and then check there is nothing outputted). Can you check if https://docs.python.org/3/library/contextlib.html#contextlib.redirect_stdout works?
We will need to have our own implementation as that is python 3 only feature, but I guess we will find some example on the internet.

@jorisvandenbossche
Copy link
Member

We will need to have our own implementation as that is python 3 only feature, but I guess we will find some example on the internet.

For example https://eli.thegreenplace.net/2015/redirecting-all-kinds-of-stdout-in-python/

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Apr 13, 2018

@jorisvandenbossche I found this pytest method for capturing output. https://docs.pytest.org/en/2.8.7/capture.html . So I've written a test that fails with master, and works with the change from _checknan to checknull. No time difference observed. PR coming very soon.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Compat pandas objects compatability with Numpy or Python functions labels Apr 14, 2018
@jreback jreback added this to the 0.23.0 milestone Apr 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants