Skip to content

BUG: pandas is using absolute from pandas._libs cimport and is shipping broken private .pxd files #51875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
rgommers opened this issue Mar 9, 2023 · 0 comments
Labels
Bug Internals Related to non-user accessible pandas implementation

Comments

@rgommers
Copy link
Contributor

rgommers commented Mar 9, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Put this in `tmp.pyx`:

# In Cython code - any use of `_libs.khash` will trigger this
from pandas._libs.khash cimport kh_int64_t

Then run cython tmp.pyx. That will result in:

Error compiling Cython file:
------------------------------------------------------------
...
    bint kh_exist_strbox(kh_strbox_t*, khiter_t) nogil

    khuint_t kh_needed_n_buckets(khuint_t element_n) nogil


include "khash_for_primitive_helper.pxi"
^
------------------------------------------------------------

/home/rgommers/mambaforge/envs/pandas-dev/lib/python3.8/site-packages/pandas/_libs/khash.pxd:129:0: 'khash_for_primitive_helper.pxi' not found

Issue Description

I found this when following up on #49115 (comment):

Cython.Compiler.Errors.InternalError: Internal compiler error: 'khash_for_primitive_helper.pxi' not found

There are a couple of related issues that interact here:

  1. pandas is shipping lots of files in wheels that should not be there. In particular, .pxd and .pyx files in pandas/_libs.
  2. Use of absolute cimport's which should probably be relative
  3. Use of include <name>.pxi" in .pxd files. This should be replaced by shared declarations in a common .pxd file (see the warning in http://docs.cython.org/en/latest/src/userguide/language_basics.html#the-include-statement-and-include-files)

For (1), if you download any pandas 1.5.3 wheel, you'll see in pandas/_libs:

khash.pxd
khash_for_primitive_helper.pxi.in

And, notably, khash.pxd contains include "khash_for_primitive_helper.pxi" - and that file is not present (only the pxi.in template is). So basically a broken private .pxd here. Which is then picked up during the build in gh-49115 because of absolute from pandas._libs.khash cimport ... statements inside pandas itself.

That particular issue probably shows up in the Meson build but not during the setup.py-based build because in the latter the .pxi file is generated in-place rather than in the build dir. However, as my reproducer above shows, this is a bit of a house of cards, because the absolute from pandas._libs imports are actually broken.

Expected Behavior

Expected is that the .pxds aren't shipped, so anyone trying to access private .pxd files will get a clear exception. This will be automatically fixed when the Meson build is merged. However, that still leaves potential issues in any environments that already have pandas installed.

My suggestion is to:

  • Use relative cimports for accessing anything within pandas (needs testing, because Cython's cimport mechanism is very fragile all around).
  • Get rid of the .pxi.in and replace it with the recommended .pxd method.

Installed Versions

INSTALLED VERSIONS

commit : 2e218d1
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.1-arch1-1
Version : #1 SMP PREEMPT_DYNAMIC Sun, 26 Feb 2023 03:39:23 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.3
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.1
hypothesis : 6.68.2
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.8
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : 1.3.6
brotli :
fastparquet : 2023.2.0
fsspec : 2023.1.0
gcsfs : 2023.1.0
matplotlib : 3.6.3
numba : 0.56.4
numexpr : 2.8.3
odfpy : None
openpyxl : 3.1.0
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : 1.2.1
pyxlsb : 1.0.10
s3fs : 2023.1.0
scipy : 1.10.1
snappy :
sqlalchemy : 2.0.4
tables : 3.7.0
tabulate : 0.9.0
xarray : 2023.1.0
xlrd : 2.0.1
xlwt : None
zstandard : 0.19.0
tzdata : None

@rgommers rgommers added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 9, 2023
@rgommers rgommers mentioned this issue Mar 10, 2023
5 tasks
@lithomas1 lithomas1 added Internals Related to non-user accessible pandas implementation and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

No branches or pull requests

2 participants