-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: merging DataFrames on a column containing just NaN values triggers address violation in safe_sort
#59421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. I am not able to reproduce on main with Linux / Python3.12. Can you double check on a clean branch? |
Sorry, I haven't been able to reproduce on main using gcc with asan flags (previously I was using clang and a slightly odd setup). However, I can reliably reproduce with enabling cython bounds checking in _libs/algos_take_helper.pxi.in:
I compiled inside a clean virtualenv using:
The repro code above then results in the traceback:
The |
Thanks - can reproduce on my end with the change. I think this might be due to this line: pandas/pandas/core/reshape/merge.py Line 2785 in d0cb205
When all values are NA, we end up trying to take from an empty array because |
I agree that
PR #59489 fixes this. |
…n-empty codes (#59489) * Fix out-of-bounds violations in safe_sort for empty arrays. Previously we masked `codes` referring to out-of-bounds elements to 0 and then fixed them after to -1 using `np.putmask`. However, this results in out-of-bounds access in `take_nd` if the array is empty. Instead, set all out-of-bounds indices in `codes` to -1 immediately, as these can be handled by `take_nd`. * Remove dead code. `use_na_sentinel` cannot be truthy inside an else branch where it is falsy. * Add test based upon #59421
…n-empty codes (pandas-dev#59489) * Fix out-of-bounds violations in safe_sort for empty arrays. Previously we masked `codes` referring to out-of-bounds elements to 0 and then fixed them after to -1 using `np.putmask`. However, this results in out-of-bounds access in `take_nd` if the array is empty. Instead, set all out-of-bounds indices in `codes` to -1 immediately, as these can be handled by `take_nd`. * Remove dead code. `use_na_sentinel` cannot be truthy inside an else branch where it is falsy. * Add test based upon pandas-dev#59421
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Related to #55984
Merging DataFrames on a column containing all NaN values results in a
This was not present in 2.1.4 and I think was introduced in #55984 (which fixed other address violations).
Found using asan, can also seen by enabling bounds_checking on
take_1d_*
in algos_take_helper.pxi.inMy understanding of the cause is:
uniques
is an empty array in_factorize_keys
- https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/merge.py#L2706safe_sort
assumes that the array being sorted is at least size 1 - https://github.com/pandas-dev/pandas/blob/main/pandas/core/algorithms.py#L1531take_nd
assumes the indexer contains no out-of-bounds indices, but an index of 0 is out of bounds in this case.I am not familiar with pandas internals but changing the mask on https://github.com/pandas-dev/pandas/blob/main/pandas/core/algorithms.py#L1531 to
avoids this out-of-bounds access. Is this a suitable fix? If so, I can prepare a pull request.
Expected Behavior
No array bounds access errors, should produce
Installed Versions
INSTALLED VERSIONS
commit : 642d244
python : 3.11.9
python-bits : 64
OS : Linux
OS-release : 6.6.15-2rodete2-amd64
Version : #1 SMP PREEMPT_DYNAMIC Debian 6.6.15-2rodete2 (2024-03-19)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 3.0.0.dev0+1287.g642d244606.dirty
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 24.0
Cython : 3.0.11
sphinx : 8.0.2
IPython : 8.26.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.0
fastparquet : 2024.5.0
fsspec : 2024.6.1
html5lib : 1.1
hypothesis : 6.108.8
gcsfs : 2024.6.1
jinja2 : 3.1.4
lxml.etree : 5.2.2
matplotlib : 3.9.0
numba : 0.60.0
numexpr : 2.10.1
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.9
pymysql : 1.4.6
pyarrow : 17.0.0
pyreadstat : 1.2.7
pytest : 8.3.2
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.6.1
scipy : 1.14.0
sqlalchemy : 2.0.31
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.7.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: