Fix compilation under cython 3. #41530

rainwoodman · 2021-05-17T22:56:43Z

Internal clean up to improve Cython 3 compatibility.

pandas/_libs/util.pxd

jbrockmendel · 2021-05-18T01:30:45Z

pandas/_libs/reduction.pyx

@@ -23,6 +23,15 @@ from pandas._libs.util cimport (

 from pandas._libs.lib import is_scalar

+# Accessing the data member of ndarray is deprecated, but we depend on it.


i guess this adapts to the letter of the deprecation but not the spirit?

WillAyd · 2021-05-18T15:00:10Z

There is more background to this in #34014 where the discussion seems to ideally want to get us out of this modification altogether. Are there any alternatives to this you can think of?

rainwoodman · 2021-05-18T17:00:01Z

The assumption here is Cython 3.0 support lands before numpy's removal of data member access. This is very likely the case.

To fix the dependency on deprecated of data member access (which becomes orthogonal to cython 3.0 support after this PR, so perhaps we shall start a new issue), either approaches described the bug is viable.

Along pathway 1, a quick idea that may work is to rewrite reduction.pyx using cython memoryview objects. The python side (if there is any user to these delegate ndarray objects), we can add an accessor to create ndarrays on the fly. The cython side can swap to operating on memoryview, or some form of customized surrogate that allows resetting the pointer. This way we can do the port without needing to modify / understand too much of the magic inside reduction.pyx.

jbrockmendel · 2021-05-18T17:34:40Z

The assumption here is Cython 3.0 support lands before numpy's removal of data member access. This is very likely the case.

is there reason for optimism that cython3 is coming anytime soon?

Along pathway 1, a quick idea that may work is to rewrite reduction.pyx using cython memoryview objects. The python side (if there is any user to these delegate ndarray objects), we can add an accessor to create ndarrays on the fly

Option 2 here is much simpler: just create new ndarrays by slicing; the perf penalty isn't all that bad.

The cython side can swap to operating on memoryview,

it isn't clear to me that this is feasible. we're dealing with User Defined Functions

rainwoodman · 2021-05-18T18:15:35Z

On Tue, May 18, 2021 at 10:35 AM jbrockmendel ***@***.***> wrote: The assumption here is Cython 3.0 support lands before numpy's removal of data member access. This is very likely the case. is there reason for optimism that cython3 is coming anytime soon?

I am not a cython maintainer, but it certainly would add to the evidence 3.0 is ready to be out of alpha if it can compile most downstream packages cleanly.

Along pathway 1, a quick idea that may work is to rewrite reduction.pyx using cython memoryview objects. The python side (if there is any user to these delegate ndarray objects), we can add an accessor to create ndarrays on the fly Option 2 here <#34213 (comment)> is much simpler: just create new ndarrays by slicing; the perf penalty isn't all that bad. The cython side can swap to operating on memoryview, it isn't clear to me that this is feasible. we're dealing with User Defined Functions

Sorry, I was not familiar with the internals of pandas -- Is there a Cython/C-API for User Defined Functions, or they must go through python?

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#41530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABBWTGYF4I4A5FSYOQVNJ3TOKQMLANCNFSM45BLLRQQ> .

pep8speaks · 2021-05-18T18:24:44Z

Hello @rainwoodman! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-19 17:27:36 UTC

rainwoodman · 2021-05-18T18:38:20Z

Not sure if the workflow is added correctly.

Current list of failures on my local machine:

FAILED pandas/tests/io/test_pickle.py::test_pickles[/home/feyu/source/pandas/pandas/tests/io/data/legacy_pickle/1.1.0/1.1.0_x86_64_darwin_3.8.5.pickle]
FAILED pandas/tests/plotting/frame/test_frame_color.py::TestDataFrameColor::test_invalid_colormap - AssertionError:...
ERROR pandas/tests/io/test_parquet.py::TestParquetPyArrow::test_s3_roundtrip_explicit_fs
ERROR pandas/tests/io/test_parquet.py::TestParquetPyArrow::test_s3_roundtrip
ERROR pandas/tests/io/test_parquet.py::TestParquetFastParquet::test_s3_roundtrip
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext0]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext1]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext2]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext3]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext4]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext5]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext6]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_url[engine_and_read_ext7]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext0]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext1]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext2]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext3]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext4]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext5]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext6]
ERROR pandas/tests/io/excel/test_readers.py::TestReaders::test_read_from_s3_object[engine_and_read_ext7]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[None]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[gzip]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[bz2]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[zip]
ERROR pandas/tests/io/json/test_compression.py::test_with_s3_url[xz]
ERROR pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_read_s3_jsonl
ERROR pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_to_s3
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3n_bucket
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3a_bucket
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_nrows
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_chunked
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_chunked_python
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_python
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_infer_s3_compression
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_parse_public_s3_bucket_nrows_python
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_read_s3_fails
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_read_csv_handles_boto_s3_object
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_read_csv_chunked_download
ERROR pandas/tests/io/parser/test_network.py::TestS3::test_read_s3_with_hash_in_key

None of them look relevant to the cython-3 compiler change.

jbrockmendel · 2021-05-18T20:19:49Z

I was not familiar with the internals of pandas -- Is there a Cython/C-API for User Defined Functions, or they must go through python?

There is no such API. The f passed to SeriesGrouper, SeriesBinGrouper, and apply_frame_axis0 is a UDF that we pass either Series or DataFrame objects to. The whole Slider business is to do faster slicing effectively obj.iloc[start:end], but not actually creating new Series/DataFrame objects at each iteration

WillAyd · 2021-05-19T13:28:17Z

FWIW I am slightly -1 on this as is. I would prefer we figure out a way to fix now rather than waiting for this to become an issue with a future numpy release

rainwoodman · 2021-05-19T17:32:26Z

Thanks @jbrockmendel for the explanation.

@WillAyd I have gather enough confidence to take a closer look over modernizing reduction.pyx, picking up from the last attempt with segfaults at #34014. I'll give it a go.

rainwoodman · 2021-05-24T01:22:12Z

Is there a way to mutate a cached_index's indices without creating the index? The equivalence to cached_series._mgr.set_values. (I'll keep looking in indexes/base.py, but if someone already know..)

If we have that function then we can likely write something along these lines:

class Slicer:
   def get_buf(self):
      return np.array(self.values_memory_view[self.start: self.start+end])

def update_cached_objs(...):
   cached_index._mgr.set_values(islicer.get_buf()))
   cached_series._mgr.set_values(vslicer.get_buf()))

This way we avoid the cost of creating cached_index and cached_series, which are likely more expensive than creating a ndarray from a memoryview.

jbrockmendel · 2021-05-24T01:52:04Z

Is there a way to mutate a cached_index's indices without creating the index? The equivalence to cached_series._mgr.set_values. (I'll keep looking in indexes/base.py, but if someone already know..)

that's what we're doing when we set _index_data in reduction.pyx

rainwoodman · 2021-05-24T16:15:51Z

Thanks. Now I see the example resettting _index_data in apply_frame_axis_0 (via BlockSlider):

        object.__setattr__(self.index, '_index_data', self.idx_slider.buf)      
        self.index._engine.clear_mapping()                                      
        self.index._cache.clear()  # e.g. inferred_freq must go

SeriesGrouper and SeriesBinGrouper are the ones that reuse the ndarray by tweaking the data pointer.

I am trying to understand how _index_data works, but I don't see any consumers of _index_data other than reduction.pyx. Here is what I found:

$ grep -R _index_data pandas/*
pandas/core/indexes/base.py:        # _index_data is a (temporary?) fix to ensure that the direct data
pandas/core/indexes/base.py:        result._index_data = values
pandas/core/indexes/extension.py:        # For groupby perf. See note in indexes/base about _index_data
pandas/core/indexes/extension.py:        result._index_data = values._ndarray
... (reductions.pyx and tests)

So it appears that that line of setattr in BlockSlider has no effect on the other states of the Index object; -- or I am missing some code logic that uses _index_data?

jbrockmendel · 2021-05-24T16:20:54Z

So it appears that that line of setattr in BlockSlider has no effect on the other states of the Index object; -- or I am missing some code logic that uses _index_data?

_index_data is basically just an alias for _data at this point; for a while it was used in some Index subclasses but i think those have all been taken out

rainwoodman · 2021-05-24T16:35:00Z

After setattr, _index_data and _data diverges to two objects, and they are no longer aliases.

So to reset the index (index.set_values), I shall look into a "state-consistent" way of resetting the _data attribute, which is the one actually consumed as an internal state of an index.

github-actions · 2021-06-24T00:02:58Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jbrockmendel · 2021-07-26T19:08:13Z

@rainwoodman are you still working on this?

mroeschke · 2021-08-17T01:51:34Z

Thanks for the PR, but it appears this PR has gone stale. It may be easier now that apply_frame_axis0 has been recently removed. Closing due to inactivity but please let us know if you're still interested in working on this and we can reopen.

rainwoodman · 2021-08-17T03:10:17Z

Thanks for closing. If by the time I revisit this the issue still persists, I'll file a new PR. ;)

…

On Mon, Aug 16, 2021 at 6:51 PM Matthew Roeschke ***@***.***> wrote: Closed #41530 <#41530>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#41530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABBWTEAXIZOSLM44PEARQ3T5G6DDANCNFSM45BLLRQQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

Fix compilation under cython 3.

05d3fe2

jreback reviewed May 17, 2021

View reviewed changes

pandas/_libs/util.pxd Outdated Show resolved Hide resolved

jreback added the Compat pandas objects compatability with Numpy or Python functions label May 17, 2021

revert pandas._libs removal

4f30113

jbrockmendel reviewed May 18, 2021

View reviewed changes

Workaround cython/cython#4172

88d6839

rainwoodman force-pushed the cython-3 branch from eb61cca to 88d6839 Compare May 18, 2021 18:27

Add github workflow for testing on cython 3(alpha)

3000692

rainwoodman added 2 commits May 19, 2021 10:26

fix format with pre-commit command.

cccdfd7

update pip install syntax.

b9ddb29

github-actions bot added the Stale label Jun 24, 2021

mroeschke closed this Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix compilation under cython 3. #41530

Fix compilation under cython 3. #41530

rainwoodman commented May 17, 2021

jbrockmendel May 18, 2021

WillAyd commented May 18, 2021

rainwoodman commented May 18, 2021

jbrockmendel commented May 18, 2021

rainwoodman commented May 18, 2021 via email •

edited

Loading

pep8speaks commented May 18, 2021 •

edited

Loading

rainwoodman commented May 18, 2021

jbrockmendel commented May 18, 2021

WillAyd commented May 19, 2021 •

edited

Loading

rainwoodman commented May 19, 2021

rainwoodman commented May 24, 2021

jbrockmendel commented May 24, 2021

rainwoodman commented May 24, 2021 •

edited

Loading

jbrockmendel commented May 24, 2021

rainwoodman commented May 24, 2021

github-actions bot commented Jun 24, 2021

jbrockmendel commented Jul 26, 2021

mroeschke commented Aug 17, 2021

rainwoodman commented Aug 17, 2021 via email

		@@ -23,6 +23,15 @@ from pandas._libs.util cimport (

		from pandas._libs.lib import is_scalar

		# Accessing the data member of ndarray is deprecated, but we depend on it.

Fix compilation under cython 3. #41530

Fix compilation under cython 3. #41530

Conversation

rainwoodman commented May 17, 2021

jbrockmendel May 18, 2021

Choose a reason for hiding this comment

WillAyd commented May 18, 2021

rainwoodman commented May 18, 2021

jbrockmendel commented May 18, 2021

rainwoodman commented May 18, 2021 via email • edited Loading

pep8speaks commented May 18, 2021 • edited Loading

Comment last updated at 2021-05-19 17:27:36 UTC

rainwoodman commented May 18, 2021

jbrockmendel commented May 18, 2021

WillAyd commented May 19, 2021 • edited Loading

rainwoodman commented May 19, 2021

rainwoodman commented May 24, 2021

jbrockmendel commented May 24, 2021

rainwoodman commented May 24, 2021 • edited Loading

jbrockmendel commented May 24, 2021

rainwoodman commented May 24, 2021

github-actions bot commented Jun 24, 2021

jbrockmendel commented Jul 26, 2021

mroeschke commented Aug 17, 2021

rainwoodman commented Aug 17, 2021 via email

rainwoodman commented May 18, 2021 via email •

edited

Loading

pep8speaks commented May 18, 2021 •

edited

Loading

WillAyd commented May 19, 2021 •

edited

Loading

rainwoodman commented May 24, 2021 •

edited

Loading