Skip to content

pd.Series.equals() fails for Series containing iterable objects #20676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bfollinprm opened this issue Apr 13, 2018 · 5 comments · Fixed by #35237
Closed

pd.Series.equals() fails for Series containing iterable objects #20676

bfollinprm opened this issue Apr 13, 2018 · 5 comments · Fixed by #35237
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@bfollinprm
Copy link

bfollinprm commented Apr 13, 2018

Code Sample, a copy-pastable example if possible

pd.Series(
    [np.array([1, 2]), np.array([1, 2])]
).equals(pd.Series(
    [np.array([1, 2]), np.array([1, 2])]
))

# Evaluates to False

Problem description

For objects that overload equality with elementwise equality (this includes np.array and pd.Series objects), pd.Series.equals() catches

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

and returns False, behavior inherited from np.equal applied to object arrays. While you can argue this is a numpy bug, in numpy equality is purposefully strict for dtype=='object'. In pandas, it is a common enough procedure to tabulate arrays and lists under a single column (which may contain a named (time)series, image array, or an unpacked NLP document), and this behavior of equality is unintuitive.

Expected Output

True

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1048-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.2
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.4
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 13, 2018

Storing lists or arrays in Series/DataFrame is generally not fully supported (in the sense that: it works for basic things, but may fail for corner cases (eg indexing, setting values, ..) and it's not well tested, and it's not something the core devs will typically give priority to).

But if you have an relative straightforward fix, we will be happy to take a PR.

Note that numpy also does not handle equality well for object arrays of arrays:

In [62]: s = pd.Series([np.array([1, 2]), np.array([1, 2])])

In [63]: s.values
Out[63]: array([array([1, 2]), array([1, 2])], dtype=object)

In [64]: s.values == s.values
/home/joris/miniconda3/envs/dev/bin/ipython:1: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
  #!/home/joris/miniconda3/envs/dev/bin/python
Out[64]: True

@jorisvandenbossche
Copy link
Member

Just a thought: it might be that exploring an ExtensionArray (new in master: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#extending-pandas-with-custom-types) might be a way to fix this.

@jreback
Copy link
Contributor

jreback commented Apr 13, 2018

yeah this is really a numpy issue as that is how we compare object arrays. If we need to peer inside those arrays we have to do an inference check, which could be done here. Storing list-likes inside a pandas cell is not supported at all and causes lots of pain in our codebase (think indexing). Using an extension array would be a real nice usecase here.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 13, 2018
@jreback jreback added this to the Someday milestone Apr 13, 2018
@mroeschke
Copy link
Member

This looks to work on master. I guess it could use a test

In [50]: pd.Series(
    ...:     [np.array([1, 2]), np.array([1, 2])]
    ...: ).equals(pd.Series(
    ...:     [np.array([1, 2]), np.array([1, 2])]
    ...: ))
Out[50]: True

In [51]: pd.__version__
Out[51]: '1.1.0.dev0+1216.gd4d58f960'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 10, 2020
@avinashpancham
Copy link
Contributor

take

@simonjayhawkins simonjayhawkins modified the milestones: Someday, 1.1 Jul 12, 2020
@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants