Skip to content

Unexpected results when filtering with .isin (some fields contain python datastructures) #20883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #3
atc0m opened this issue Apr 30, 2018 · 12 comments · Fixed by #50019
Closed
Tracked by #3
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@atc0m
Copy link

atc0m commented Apr 30, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd

data = [
    {'id': 1, 'content': [{'values': 3}]},
    {'id': 2, 'content': u'whats going on'},
    {'id': 3, 'content': u'whaaaaaaaaat'},
    {'id': 4, 'content': [{'values': 4}]}
]

if __name__ == '__main__':
    df = pd.DataFrame.from_dict(data)
    v = [u'whats going on', u'whaaaaaat']
    print df[df.content.isin(v)]
    v = [u'whats going on', u'what']
    print df[df.content.isin(v)]

Problem description

The first print statement executes sucessfully, filtering to the single row 'id': 2, 'content': u'whats going on', however the second filter throws an error even though the only difference is the length of one of the elements in the list v.

Output for the code snippet above:

          content  id
1  whats going on   2
/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/indexes/range.py:473: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
  return max(0, -(-(self._stop - self._start) // self._step))
Traceback (most recent call last):
  File "test_pandas.py", line 15, in <module>
    print df[df.content.isin(v)]
  File "/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/series.py", line 2804, in isin
    return self._constructor(result, index=self.index).__finalize__(self)
  File "/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/series.py", line 264, in __init__
    raise_cast_failure=True)
  File "/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/series.py", line 3269, in _sanitize_array
    if len(subarr) != len(index) and len(subarr) == 1:
  File "/home/attila/digital/env/local/lib/python2.7/site-packages/pandas/core/indexes/range.py", line 473, in __len__
    return max(0, -(-(self._stop - self._start) // self._step))
TypeError: unhashable type: 'list'
INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: None.None

pandas: 0.22.0
pytest: 2.9.2
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.14.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

I have a different output:

In [7]: df.content.isin(v)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: unhashable type: 'list'

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
<ipython-input-7-5a60788e7bc7> in <module>()
----> 1 df.content.isin(v)

~/sandbox/pandas/pandas/core/series.py in isin(self, values)
   3576         Name: animal, dtype: bool
   3577         """
-> 3578         result = algorithms.isin(self, values)
   3579         return self._constructor(result, index=self.index).__finalize__(self)
   3580

~/sandbox/pandas/pandas/core/algorithms.py in isin(comps, values)
    444             comps = comps.astype(object)
    445
--> 446     return f(comps, values)
    447
    448

~/sandbox/pandas/pandas/core/algorithms.py in <lambda>(x, y)
    419
    420     # faster for larger cases to use np.in1d
--> 421     f = lambda x, y: htable.ismember_object(x, values)
    422
    423     # GH16012

~/sandbox/pandas/pandas/_libs/hashtable_func_helper.pxi in pandas._libs.hashtable.ismember_object()
    470
    471     kh_destroy_pymap(table)
--> 472     return result.view(np.bool_)
    473
    474

SystemError: <built-in method view of numpy.ndarray object at 0x1078a93f0> returned a result with an error set

In general, nested data like this aren't well supported at the moment. The upcoming 0.23 release is laying some groundwork to better-support this, but it'll take some time.

@mroeschke mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves DataFrame DataFrame data structure labels Jan 13, 2019
@apiszcz
Copy link

apiszcz commented Apr 24, 2019

Similar issue, with a single value in the sdf.id.values, the following error occurs, with 2 or more values no error.

(Pdb) df.isin(sdf.id.values)
*** SystemError: <built-in method view of numpy.ndarray object at 0x..........> returned a result with an error set

@JavierClearImageAI
Copy link

JavierClearImageAI commented Jun 10, 2019

Still not working in Pandas version '0.24.2'. I am having the same error than @TomAugspurger using python 3.7.3. It worked perfectly in python 2.7.15.

Any idea to sort this out?

@TomAugspurger
Copy link
Contributor

I don't think anyone has investigated deeply. Could you @javi-clear-image-ai?

@JavierClearImageAI
Copy link

JavierClearImageAI commented Jun 27, 2019

I don't think anyone has investigated deeply. Could you @javi-clear-image-ai?

I did (a bit), but without much luck. I ended up moving from pandas to numpy (df.values) and working with the numpy array. It worked for me, so that would be the walk around I would suggest for the moment.

@toobaz
Copy link
Member

toobaz commented Jun 29, 2019

Simpler test case:

pd.Series([0, [1, 2]]).isin(['a', 'b'])

(so unrelated to indexing, or DataFrame).

@toobaz toobaz added Series Series data structure and removed DataFrame DataFrame data structure Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 29, 2019
@fclesio
Copy link

fclesio commented Aug 5, 2019

The problem in my case was because my column instead to be an str was an element/object in pandas, i.e. my data was an array and I was using a list to perform the comparison directly.

I just pass

# Iterate on the top of words in a column
for textract_value in textract_keywords:
    textract_value = str(textract_value).lower()
    for handwerkskammer in handwerkskammer_name:
        handwerkskammer = str(handwerkskammer).lower()

        if textract_value == handwerkskammer:
            print(f'Contains: {handwerkskammer}') 

This solved my problem.

@simonjayhawkins simonjayhawkins added Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Series Series data structure labels Jul 26, 2020
@jbrockmendel jbrockmendel added the isin isin method label Oct 30, 2020
@matthewmturner
Copy link

I think im having a similar issue. I have the following filter filter = {'ListingType': ['coachListing']} to apply on a df with this structure:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   ListingId                11 non-null     object
 1   MaverickListingId        0 non-null      object
 2   CreatedAt                11 non-null     object
 3   ListingState             11 non-null     object
 4   AutoApproveTransactions  0 non-null      object
 5   Biography                0 non-null      object
 6   Geolocation              0 non-null      object
 7   ListingTitle             11 non-null     object
 8   ListingDescription       4 non-null      object
 9   ListingActivities        8 non-null      object
 10  UserId                   0 non-null      object
 11  ListingType              11 non-null     string

But when using df.isin(filter) i get the following:

TypeError: unhashable type: 'list'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/matth/Versd/MetaVersd/metaversd/setup_db_load.py", line 41, in <module>
    main()
  File "/home/matth/Versd/MetaVersd/metaversd/setup_db_load.py", line 33, in main
    run_sharetribe_jobs(env, config, today, ST_SOURCES)
  File "/home/matth/Versd/MetaVersd/metaversd/etl/sources/sharetribe.py", line 276, in run_sharetribe_jobs
    data.process()
  File "/home/matth/Versd/MetaVersd/metaversd/etl/sources/source.py", line 64, in process
    self.upload_db()
  File "/home/matth/Versd/MetaVersd/metaversd/etl/sources/source.py", line 49, in upload_db
    data = self.clean()
  File "/home/matth/Versd/MetaVersd/metaversd/etl/sources/sharetribe.py", line 74, in clean
    mask = clean_data.isin(filters)
  File "/home/matth/miniconda3/envs/metaversd/lib/python3.7/site-packages/pandas/core/frame.py", line 9204, in isin
    axis=1,
  File "/home/matth/miniconda3/envs/metaversd/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 284, in concat
    sort=sort,
  File "/home/matth/miniconda3/envs/metaversd/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 328, in __init__
    objs = list(objs)
  File "/home/matth/miniconda3/envs/metaversd/lib/python3.7/site-packages/pandas/core/frame.py", line 9202, in <genexpr>
    for i, col in enumerate(self.columns)
  File "/home/matth/miniconda3/envs/metaversd/lib/python3.7/site-packages/pandas/core/frame.py", line 9222, in isin
    algorithms.isin(self.values.ravel(), values).reshape(self.shape),
  File "/home/matth/miniconda3/envs/metaversd/lib/python3.7/site-packages/pandas/core/algorithms.py", line 465, in isin
    return f(comps, values)
  File "pandas/_libs/hashtable_func_helper.pxi", line 454, in pandas._libs.hashtable.ismember_object
SystemError: <built-in method view of numpy.ndarray object at 0x7f7f67d1e9e0> returned a result with an error set

Using isin on the series directly works without issue, but that doesnt really fit my use case where different data sources will be filtering on different columns/values so i wanted to be able to use isin on the full df with different dicts.

Also of note, for the sake of it i tried updating the filter value from a list to just the string i wanted ,"coachListing", and received the same TypeError: unhashable type: 'list'

Finally, this column was originally an "object" dtype and i converted it to a string - the error happened in both cases.

Any help / insight is much appreciated!

@panda-byte
Copy link

Using pandas 1.2.4 with Python 3.9.2, it's still not possible to use Series.isin() / DataFrame.isin() with unhashable types, which I don't think is to be expected from the documentation.

I think most people would expect isin() to behave like applying the Python keyword in elementwise, which is not the case:
pd.Series([1, [2]]).isin([1]) throws the above-mentioned error, while pd.Series([1, [2]]).apply(lambda x: x in [1]) works and returns the expected result.
In my opinion, the latter should be a fallback option, or the documentation should state that only hashable types are allowed for isin().

@mroeschke
Copy link
Member

The original issue looks to work on master. Could use a test

In [3]: import pandas as pd
   ...:
   ...: data = [
   ...:     {'id': 1, 'content': [{'values': 3}]},
   ...:     {'id': 2, 'content': u'whats going on'},
   ...:     {'id': 3, 'content': u'whaaaaaaaaat'},
   ...:     {'id': 4, 'content': [{'values': 4}]}
   ...: ]

In [4]:     df = pd.DataFrame.from_dict(data)
   ...:     v = [u'whats going on', u'whaaaaaat']

In [5]: df[df.content.isin(v)]
Out[5]:
   id         content
1   2  whats going on

In [6]: v = [u'whats going on', u'what']

In [7]: df[df.content.isin(v)]
Out[7]:
   id         content
1   2  whats going on

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). isin isin method labels Jun 19, 2021
@panda-byte
Copy link

Using pandas 1.4.1 with Python 3.10.2, the issue seems to be resolved. The original example code as well as the minimal example I posted earlier (pd.Series([1, [2]]).isin([1])) work now as expected. I assume this issue can be closed.

@ltoniazzi
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

13 participants