Skip to content

Fix compilation under cython 3. #41530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

rainwoodman
Copy link

Internal clean up to improve Cython 3 compatibility.

See #34213.

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label May 17, 2021
@@ -23,6 +23,15 @@ from pandas._libs.util cimport (

from pandas._libs.lib import is_scalar

# Accessing the data member of ndarray is deprecated, but we depend on it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this adapts to the letter of the deprecation but not the spirit?

@WillAyd
Copy link
Member

WillAyd commented May 18, 2021

There is more background to this in #34014 where the discussion seems to ideally want to get us out of this modification altogether. Are there any alternatives to this you can think of?

@rainwoodman
Copy link
Author

The assumption here is Cython 3.0 support lands before numpy's removal of data member access. This is very likely the case.

To fix the dependency on deprecated of data member access (which becomes orthogonal to cython 3.0 support after this PR, so perhaps we shall start a new issue), either approaches described the bug is viable.

Along pathway 1, a quick idea that may work is to rewrite reduction.pyx using cython memoryview objects. The python side (if there is any user to these delegate ndarray objects), we can add an accessor to create ndarrays on the fly. The cython side can swap to operating on memoryview, or some form of customized surrogate that allows resetting the pointer. This way we can do the port without needing to modify / understand too much of the magic inside reduction.pyx.

@jbrockmendel
Copy link
Member

The assumption here is Cython 3.0 support lands before numpy's removal of data member access. This is very likely the case.

is there reason for optimism that cython3 is coming anytime soon?

Along pathway 1, a quick idea that may work is to rewrite reduction.pyx using cython memoryview objects. The python side (if there is any user to these delegate ndarray objects), we can add an accessor to create ndarrays on the fly

Option 2 here is much simpler: just create new ndarrays by slicing; the perf penalty isn't all that bad.

The cython side can swap to operating on memoryview,

it isn't clear to me that this is feasible. we're dealing with User Defined Functions

@rainwoodman
Copy link
Author

rainwoodman commented May 18, 2021 via email

@pep8speaks
Copy link

pep8speaks commented May 18, 2021

Hello @rainwoodman! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-19 17:27:36 UTC

@rainwoodman
Copy link
Author

Not sure if the workflow is added correctly.

Current list of failures on my local machine:

FAILED pandas/tests/io/test_pickle.py::test_pickles[/home/feyu/source/pandas/pandas/tests/io/data/legacy_pickle/1.1.0/1.1.0_x86_64_darwin_3.8.5.pickle]
FAILED pandas/tests/plotting/frame/test_frame_color.py::TestDataFrameColor::test_invalid_colormap - AssertionError:...
ERROR pandas/tests/io/test_parquet.py::TestParquetPyArrow::test_s3_roundtrip_explicit_fs
ERROR pandas/tests/io/test_parquet.py::TestParquetPyArrow::test_s3_roundtrip
ERROR pandas/tests/io/test_parquet.py::TestParquetFastParquet::test_s3_roundtrip
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext0]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext1]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext2]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext3]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext4]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext5]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext6]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext7]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext0]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext1]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext2]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext3]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext4]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext5]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext6]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext7]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[None]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[gzip]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[bz2]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[zip]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[xz]
ERROR pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_read_s3_jsonl
ERROR pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_to_s3
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3n_bucket
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3a_bucket
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_nrows
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_chunked
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_chunked_python
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_python
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_infer_s3_compression
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_nrows_python
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_read_s3_fails
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_read_csv_handles_boto_s3_object
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_read_csv_chunked_download
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_read_s3_with_hash_in_key

None of them look relevant to the cython-3 compiler change.

@jbrockmendel
Copy link
Member

I was not familiar with the internals of pandas -- Is there a Cython/C-API for User Defined Functions, or they must go through python?

There is no such API. The f passed to SeriesGrouper, SeriesBinGrouper, and apply_frame_axis0 is a UDF that we pass either Series or DataFrame objects to. The whole Slider business is to do faster slicing effectively obj.iloc[start:end], but not actually creating new Series/DataFrame objects at each iteration

@WillAyd
Copy link
Member

WillAyd commented May 19, 2021

FWIW I am slightly -1 on this as is. I would prefer we figure out a way to fix now rather than waiting for this to become an issue with a future numpy release

@rainwoodman
Copy link
Author

Thanks @jbrockmendel for the explanation.

@WillAyd I have gather enough confidence to take a closer look over modernizing reduction.pyx, picking up from the last attempt with segfaults at #34014. I'll give it a go.

@rainwoodman
Copy link
Author

Is there a way to mutate a cached_index's indices without creating the index? The equivalence to cached_series._mgr.set_values. (I'll keep looking in indexes/base.py, but if someone already know..)

If we have that function then we can likely write something along these lines:

class Slicer:
   def get_buf(self):
      return np.array(self.values_memory_view[self.start: self.start+end])

def update_cached_objs(...):
   cached_index._mgr.set_values(islicer.get_buf()))
   cached_series._mgr.set_values(vslicer.get_buf()))

This way we avoid the cost of creating cached_index and cached_series, which are likely more expensive than creating a ndarray from a memoryview.

@jbrockmendel
Copy link
Member

Is there a way to mutate a cached_index's indices without creating the index? The equivalence to cached_series._mgr.set_values. (I'll keep looking in indexes/base.py, but if someone already know..)

that's what we're doing when we set _index_data in reduction.pyx

@rainwoodman
Copy link
Author

rainwoodman commented May 24, 2021

Thanks. Now I see the example resettting _index_data in apply_frame_axis_0 (via BlockSlider):

        object.__setattr__(self.index, '_index_data', self.idx_slider.buf)      
        self.index._engine.clear_mapping()                                      
        self.index._cache.clear()  # e.g. inferred_freq must go       

SeriesGrouper and SeriesBinGrouper are the ones that reuse the ndarray by tweaking the data pointer.

I am trying to understand how _index_data works, but I don't see any consumers of _index_data other than reduction.pyx. Here is what I found:

$ grep -R _index_data pandas/*
pandas/core/indexes/base.py:        # _index_data is a (temporary?) fix to ensure that the direct data
pandas/core/indexes/base.py:        result._index_data = values
pandas/core/indexes/extension.py:        # For groupby perf. See note in indexes/base about _index_data
pandas/core/indexes/extension.py:        result._index_data = values._ndarray
... (reductions.pyx and tests)

So it appears that that line of setattr in BlockSlider has no effect on the other states of the Index object; -- or I am missing some code logic that uses _index_data?

@jbrockmendel
Copy link
Member

So it appears that that line of setattr in BlockSlider has no effect on the other states of the Index object; -- or I am missing some code logic that uses _index_data?

_index_data is basically just an alias for _data at this point; for a while it was used in some Index subclasses but i think those have all been taken out

@rainwoodman
Copy link
Author

After setattr, _index_data and _data diverges to two objects, and they are no longer aliases.

So to reset the index (index.set_values), I shall look into a "state-consistent" way of resetting the _data attribute, which is the one actually consumed as an internal state of an index.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 24, 2021
@jbrockmendel
Copy link
Member

@rainwoodman are you still working on this?

@mroeschke
Copy link
Member

Thanks for the PR, but it appears this PR has gone stale. It may be easier now that apply_frame_axis0 has been recently removed. Closing due to inactivity but please let us know if you're still interested in working on this and we can reopen.

@mroeschke mroeschke closed this Aug 17, 2021
@rainwoodman
Copy link
Author

rainwoodman commented Aug 17, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants