-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Update setting data pointers for Cython 3 #34014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Cython alpha was released today: https://pypi.org/project/Cython/#history Locally I get below error when building the extensions
cc @scoder |
My guess is this is the result of cython/cython#3571 |
So IIUC assigning the data pointer has been deprecated since 2013. Quick fix might be to not use Cython alpha in CI, but in any case we probably need to fix this. Not sure there is a true replacement in the more modern API that allows for assigning the data pointer at least according to the following mailing list: https://mail.python.org/pipermail/numpy-discussion/2013-November/068172.html ref cython/cython#2498 and ping @rgommers as well |
That looks like the one evil hack that smells a bit when you implement it but otherwise works perfectly, and that drops on your feet as soon as you've forgotten about it. You already CC-ed NumPy, they are the right people to ask here. CC-ing @mattip as well, who came up with the property wrapper for these fields for Cython. I'm happy that you didn't ask for an upstream revert of the change. :) |
pandas/pandas/_libs/reduction.pyx Lines 102 to 106 in 18a7e97
is also very fishy. It also seems to be at the root of a very strange behavior when group by apply is used in statsmodels/statsmodels#6603 . In this example, the wrong data was ending up in the OLS regression because it looks like it has the wrong buffer somehow. This seems to be not completely safe in the case of a deferred operation. |
Maybe rewrite that to use |
The |
Of course you could just write |
ok I think we need to fix the cython version again, @alimcmaster1 would you mind doing a PR for this. We should certainly address this older code paths. IIRC @jbrockmendel was almost able to remove this entire module, but has some perf issues. |
If we don't fix these for 1.1.0 (or 1.0.4), we might consider putting an upper bound on Cython in our pyproject.toml: https://github.com/pandas-dev/pandas/blob/master/pyproject.toml. I think that's the only situation where this would affect users. People installing pandas from source after Cython 3.0 is released. |
Yah, we need to decide how big a perf penalty we're willing to accept in exchange for losing the maintenance burdens in this module. |
In the case of dummy_buf it is whether the code path is correct for any functions that defer evaluation outside of the apply, which can result in incorrect results. This seems like a bigger issue. |
Do you have a code sample to reproduce that issue? Can see if #34080 fixes it once green |
Return
The first return may vary since the data seems to be from an empty array which has random values. |
Not that simple, but it was how this issue was discovered. |
What's the status here? We have a few places to update (can we compile a list in the original post? I'll start it) that we need to update before we can build pandas with Cython 3.x? Is there anything that has to be done here for pandas 1.1.0, other than perhaps setting an upper bound for Cython in our pyproject.toml? |
I think we need to invest some time to get things to work with Cython 3. Specifically this comment #34014 (comment) I started down that path in #34080 just never saw through |
Re-upping this. Just ripping out libreduction entirely asvs affected are:
Updated Shown asvs removing all of libreduction instead of just Reducer |
Can't you use |
Not immediately. It looks to affect |
AFAICT this code
should be equivalent (ex the creation of a new ndarray object) to
but when I try this I get a bunch of test failures and a segfault that i cant reproduce in isolation. Am I misunderstanding what |
Is this a blocker for 1.1? |
I don’t think it needs to block a release as it’s a development dependency, though obviously one we want to keep current
…Sent from my iPhone
On Jul 6, 2020, at 10:53 AM, jbrockmendel ***@***.***> wrote:
Is this a blocker for 1.1?
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I think if we don't update this for 1.1, then we'll want to set an upper bound on Cython in our pyproject.toml so that users can compile pandas 1.1 from source after Cython 3.0 is released. |
To advocate for fixing sooner rather than later, in addition to being a technical issue for future Cython, it is also a bug when the function being applied has any deferred operations. |
We know that pandas doesn't work with Cython 3.0 (pandas-dev#34213, pandas-dev#34014) This sets the maximum supported version of Cython in our pyproject.toml to ensure that pandas 1.1.0 can continue to be built from source without Cython pre-installed after Cython 3.0 is released.
This reverts commit 687a0ce.
Here is one example that does not set data pointer in the cdef class Slider:
cdef:
ndarray values, buf
Py_ssize_t stride
char *orig_data
def __init__(self, ndarray values, ndarray buf):
assert values.ndim == 1
assert values.dtype == buf.dtype
if not values.flags.contiguous:
values = values.copy()
self.values = values.copy() # the original values would be modified if not copied
self.buf = buf
self.stride = values.strides[0]
self.orig_data = self.buf.data
self.buf.data = self.values.data
self.buf.strides[0] = self.stride
cdef move(self, int start, int end):
# self.buf.data = self.values.data + self.stride * start
self.buf.shape[0] = end - start # length must be set first
self.buf[:] = self.values[start:end]
cdef reset(self):
self.buf.data = self.orig_data
self.buf.shape[0] = 0 The two pointer setting statements in the Unfortunately I don't know enough about cython and other parts of the code to figure this out. We might need to look at how the Slider's buf is used to find another way for this. |
I spent some more time looking at what this is doing today and came up with the following notes (I apologize if I am saying really obvious things) The
The mutated array is then used to update "cached objects" in their calling class by updating the pandas side block manager details. I suspect that this is the source of the stats model issues mentioned above as the code is aggressively changing things underneath the eventual user-exposed objects. The change that has broken things is than cython now disallows relpacing the guts of a numpy array (which seem fair!). My guess is that the performance gains come from both not memory thrashing and not falling back to the python layer. The cython docs says that when you do I am not super clear how the numpy nbiter interface works, but it looks like it is focused on getting an iterator over single elements, or at least fixed steps through the array, where as for this code we need iteration over variable size windows. It looks like the way to do this with memory views ( https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html#efficient-indexing-with-memoryviews ) but those seem to require knowing what the type is up front. My suspicion is that the right solution here is to do something like what @mattip suggested above and in These classes appear to only be used internally to the reduction module so I do not think there are any back-compatibility with completely re-writing them. attn @scoder for guidance on which of these methods (or one I do not see) is the best path. |
also, I think this should be milestoned to an actual release. |
w/o a dedicated resource to do this it's impossible to actually say it will be in a particular release |
This would certainly be nice. Is there any indication that cython will be releasing 3.0 anytime soon? The mailing list has been quiet; i havent been following the issue tracker recently. I still think #34997 was a reasonable approach, but i never got it working. More eyeballs may be helpful. IIRC one of the hangups was that the non-cython path behaves slightly differently from the cython path when it comes to unwrapping results at each step in the loop. If that were resolved, then we would at least have the option of ripping out the cython version altogether. |
closed #43505 |
Well, "rip it all out" is one way to fix it ! |
Need to update the following locations
bufarr.data
chunk.data
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=34888&view=logs&j=33ccdd52-c922-5ef2-8209-78215e36d994&t=046ffbec-62c2-54e3-e88a-7745f4292bb6&l=457
just started failing.
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage
if anyone has insights
The text was updated successfully, but these errors were encountered: