Skip to content

DOC: Improve DataFrame.equals docstring comparing index & column with extension dtypes #46507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
kukushking opened this issue Mar 25, 2022 · 2 comments · Fixed by #56458
Closed
3 tasks done
Labels
Docs NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@kukushking
Copy link

kukushking commented Mar 25, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

pd.__version__
Out[27]: '1.3.5'

df = pd.DataFrame({"i": pd.Series([1, 2, 2], dtype=pd.Int64Dtype())})
df2 = pd.DataFrame({"i": pd.Series([1, 2, 2], dtype="int64")})
df.equals(df2)
Out[10]: False
df.dtypes
Out[11]: 
i    Int64
dtype: object
df2.dtypes
Out[12]: 
i    int64
dtype: object
df.index.dtype
Out[13]: dtype('int64')
df2.index.dtype
Out[14]: dtype('int64')

df = pd.DataFrame({"i": pd.Series([1, 2, 2], dtype=pd.Int64Dtype())}).set_index("i")
df2 = pd.DataFrame({"i": pd.Series([1, 2, 2], dtype="int64")}).set_index("i")
df.equals(df2)
Out[15]: True    # <<< df's are equal now
df.dtypes
Out[16]: Series([], dtype: object)
df2.dtypes
Out[17]: Series([], dtype: object)
df.index.dtype
Out[18]: dtype('O')
df2.index.dtype
Out[19]: dtype('int64')




pd.__version__
Out[70]: '1.4.1'

df = pd.DataFrame({"i": pd.Series([1, 2, 2], dtype=pd.Int64Dtype())})
df2 = pd.DataFrame({"i": pd.Series([1, 2, 2], dtype="int64")})
df.equals(df2)
Out[71]: False
df.dtypes
Out[72]: 
i    Int64
dtype: object
df2.dtypes
Out[73]: 
i    int64
dtype: object
df.index.dtype
Out[74]: dtype('int64')
df2.index.dtype
Out[75]: dtype('int64')

df = pd.DataFrame({"i": pd.Series([1, 2, 2], dtype=pd.Int64Dtype())}).set_index("i")
df2 = pd.DataFrame({"i": pd.Series([1, 2, 2], dtype="int64")}).set_index("i")
df.equals(df2)
Out[76]: False    # <<< df's are no longer equal
df.dtypes
Out[77]: Series([], dtype: object)
df2.dtypes
Out[78]: Series([], dtype: object)
df.index.dtype
Out[79]: Int64Dtype()
df2.index.dtype
Out[80]: dtype('int64')

Issue Description

There is a difference between df.equals in 1.3.5 & 1.4.1 and it looks related to the use of extension vs non-extension types in indexes. Please see the example.

I found similar issues but they all seem to touch only what seems a very specific part or what looks like a bigger change, like: #46164 is only about equality between EA's.

There is no clear statement about this change in changelog so opening an issue.

For reference, in my library I am writing the data into the storage, and when reading it back, the data frame that I read is no longer equal to the one I wrote.

Expected Behavior

I expect equality to not change unless it's stated in the changelog.

Installed Versions

1.3.5

commit : 66e3805
python : 3.8.8.final.0
python-bits : 64
OS : Darwin
OS-release : 21.1.0
Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_GB.UTF-8
pandas : 1.3.5
numpy : 1.22.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.3.0
Cython : 0.29.28
pytest : 7.1.1
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.1.0
IPython : 7.32.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fsspec : 2022.02.0
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 6.0.1
pyxlsb : None
s3fs : 0.4.2
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

1.4.1

commit : 06d2301
python : 3.8.8.final.0
python-bits : 64
OS : Darwin
OS-release : 21.1.0
Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_GB.UTF-8
pandas : 1.4.1
numpy : 1.22.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.3.0
Cython : 0.29.28
pytest : 7.1.1
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.1.0
IPython : 7.32.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 6.0.1
pyreadstat : None
pyxlsb : None
s3fs : 0.4.2
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@mroeschke
Copy link
Member

The related whatsnew note was unfortunately not generalized enough, but the applicable note was

Bug in FloatingArray.equals() failing to consider two arrays equal if they contain np.nan values (GH44382)

I think the DataFrame.equals docstring could be improved as it states

Corresponding columns must be of the same dtype.

to

Corresponding columns and index must be of the same dtype.

with an example as well

@mroeschke mroeschke added Docs NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2022
@mroeschke mroeschke changed the title BUG: Df equality changes when indexing using extension types DOC: Improve DataFrame.equals docstring comparing index & column with extension dtypes Jul 6, 2022
@kukushking
Copy link
Author

Thanks @mroeschke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants