-
Notifications
You must be signed in to change notification settings - Fork 21
groupby sorting - don't specify, or forbid? #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks, @MarcoGorelli -- I agree with everything you've said here. I do think that a user-facing dataframe API has much more value than a purely developer-facing API. And while I recognize that Pandas will likely always provide a superset of that API, I also think that it should be at minimum an implementation of the standard, and as flexible an implementation of the standard as possible. So I think this would be the best outcome here:
|
Hey - when I wrote
I meant if pandas were to implement the standard in the separate namespace - it would be too disruptive of a change for the pandas main namespace. I think this is probably the opposite of what you want - in which case, would making |
It would be great if the standard explicitly made no promises or guarantees regarding the output ordering of operations like groupby and didn't include a |
Sure, but if the standard makes no promises about groupby sorting, then there's nothing stopping pandas from keeping its current groupby behaviour. And then that doesn't solve your issue that people come to cuDF expecting to also sort in groupby, which you said you've like to change |
My takeaway from last week is that "The Standard" is completely irrelevant and all anyone really cares about is the pandas behavior, where cudf wants an API change. The place to ask for that is on the pandas tracker, and the way to convince people is to argue it will benefit pandas users. |
From my perspective, the challenge is that there's a parameter with a default value and documented behavior from Pandas as opposed to it being an implementation detail. I.E. the docs for the
This dictates a few different things:
In an ideal world, the API we're trying to define wouldn't have a |
The conversation from last week was a bit all over the place, but my main takeaway was that the API we're trying to build now is targeted at library / applications developers much moreso than end users and that the first dataframe implementation of the standard that library / applications developers will build against will likely be Pandas. This leads to a concern that anywhere that Pandas supports more than the standard, that developers may use those features while thinking they are targeting the standard. They may then be disappointed when they don't get the implementation portability they were looking for. For targeting end users, we threw around a lot of frustrations, ideas, concerns, and more. A lot of that surrounded the existing inertia of Pandas and how we can give Pandas users a happy path towards alternative dataframe implementations that are mainly offering enhanced performance and scalability as compared to Pandas. I agree that if our conversations lead down to "we want to change Pandas" then those conversations should then move to the Pandas issue tracker. I don't think we're there yet and have just really started the "end user" conversations. |
OK, sure. Let's say the standard doesn't have a Pandas would then implement a superset of the API (like Ashwin said) and would keep its How does this help cuDF? |
If it's all going to be in the main namespace then I think ultimately it causes problems and makes the work we're doing here less valuable. I think it would be an unreasonable expectation to have people double checking all of their work to see if they're using parameters that are in Pandas but aren't in the standard. People would likely just continue writing Pandas code as they always have and other dataframe implementations will continue to get pressure to match Pandas API and behaviors. If it's a separate namespace and Pandas doesn't implement the To be clear, this is beyond just cuDF. My personal experience is with cuDF, Dask (both as a producer and consumer of dataframe APIs), Spark, Arrow, and Ibis (again both as a producer and consumer), all of which can be multi-threaded and/or distributed. This often makes following existing behaviors of Pandas challenging and a significant detriment to performance in many situations. |
Sorry, I'm confused, and I feel like I'm hearing contradictory statements. On the call, Ashwin said
And here, you're saying that you want a separate namespace in which groupby doesn't have a sort keyword. Which is it? Do you want the main pandas namespace to change so that the transition to cuDF is smoother, or do you want a separate namespace in which everything can be compatible from the get-go? (I'm not sure how my tone is coming across, so just to be clear: I'm perfectly calm, and don't mean anything I've written in an aggressive/sarcastic way. I'm genuinely confused 😄 , and trying my best to make sense of what we're trying to achieve here) |
I can't speak for Ashwin. If there isn't a separate namespace in Pandas, I could only imagine having some package or something to validate you're following the standard and not using functions / keywords that aren't part of the standard. Without that, it will put undue burden on library / application developers to build against the standard effectively. I completely understand and think it's completely reasonable if the Pandas community did not want to support a separate namespace. In an ideal world purely from the cuDF and other libraries perspectives, Pandas only supports the standard API and all dataframe implementations only support the standard API so users get full portability. In the real world, that isn't in Pandas users' best interests and isn't a feasible thing to do.
Your tone has been perfectly cordial 😄. I appreciate the pushback to make sure that we're actually going to drive impact as opposed to being a perfect example of the xkcd comic (https://xkcd.com/927/). |
I applaud @MarcoGorelli for not being a pessimistic Pete but I have no such compunction. We are getting nowhere and designing an API no one will use. This is a waste of time and we should cut our losses. Put a bow on what we have now and declare this endeavor a rousing success. |
@kkraus14 @jbrockmendel @MarcoGorelli
So, IMHO, not the perfect standard is better than no standard. We should not pick on something small, like sort, as there are more significant issues to agree upon. For Pandas - I think having a standard package -> Pandas namespace -> main namespace path is good enough, as long the transition from the first step to the second is not years. From my team side, I commit to helping build, support, and maintain the Standard API namespace in Pandas. For other libraries, cuDF, Modin, Vaex, Koalas, etc., targeting performance sort is less critical, but trying to be compatible with Pandas can be a problem, so Pandas as a big bear can keep the current semantics, as there are tons of code already written, meanwhile stating that sort is not compliant, but you can use |
FYI @aregm is my boss at Intel. By complete coincidence, I 100% agree with him. [An imperfect] standard is better than no standard. So let's punt on sorting and be done. |
Right, let's try to get back on-topic. First, I need to address
This isn't my decision, and it's not Brock's either - it's a decision for the pandas community. Unless a compelling case with strong use cases it made, this will get voted down. If I am to try to make such a compelling case, then first I need to be sold on the initiative myself, and that's why I'm asking all these questions.
I chose the Right, back to
Maximally explicit answers please, there's already too much confusion and contradictory statements. Thanks 🙏 |
A separate package / namespace is a desire but not an explicit requirement. The idea being to ease developer experience in building against the standard. Assuming Pandas is the first / main dataframe development target, without a separate namespace, folks can easily write code that deviates from the standard and loses the dataframe implementation portability they were aiming / looking for. Alternatively, instead of a separate package / namespace for Pandas, we could imagine having some type of stubs or other validation type of package where someone can use it during their development cycle to ensure they're not deviating from the standard. This stub or validation package could then be a community effort built as part of defining the standard or something like that? Ultimately, I don't think it's reasonable that developers check every function call and every parameter against documentation to ensure they aren't using Pandas specific features. Without some guard rails in place, I think it will be a very poor developer experience trying to build something that follows the standard without using implementation specific functionality.
Assuming that there isn't a separate package / namespace / option / stub / validation, it still defines and documents a standard API and set of behaviors that developers could be pointed to as a way to more or less guarantee portability between Pandas and cuDF, and therefore, CPU and NVIDIA GPU. The common case that folks have talked about is wanting to be able to develop on something like an M1/M2 Macbook which can't have an NVIDIA GPU and then move to a workstation or server with NVIDIA GPU with some level of confidence. |
To add to this: if the desired end goal is to also reach this third step (inclusion in the main pandas namespace), I would recommend that we reconsider our current approach for the standard API (design new methods vs define a subset of pandas, this has come up before ..). Now, I personally don't have a strong opinion on how important this desire of inclusion in the main namespace actually is, and not going the pandas-subset way gives more freedom to design a better developer-oriented API (eg #89). Note on the actual |
Some missing context: Areg wants an implementation of the standard to live within a separate namespace within pandas (short-term). The pandas team thinks this might as well live in a separate package, which wouldn't be their (our) responsibility to maintain. Areg says "well what if I promise to have Intel maintain it?". I suggested a compromise that the pandas team might agree to would be a) implement it as a separate package that b) pandas will document as the suggested standard-api-with-pandas then c) demonstrate that it will be maintained and d) if it gets sufficient adoption then e) we will consider upstreaming it to a separate namespace within pandas. I personally would be OK with this (I think (d) is incredibly unlikely, but if it comes true...) but the pandas team has not agreed to anything. The "-> main namespace" part is pure fantasy on Areg's part. Last time he described it to me the path was something like "the standard is so overwhelmingly popular that pandas is compelled adopt it in the main namespace," and I made a joke about that being as likely as me dating Jennifer Lawrence. |
I would push for there to be documentation as part of the standard that explicitly makes ordered-ness an implementation detail and not guaranteed behavior as part of the standard (at least to start, I think we could always add ordered-ness parameters / guarantees at a later evolution of the standard). |
The validation package is the idea I like the most out of the ones I've heard so far. The way I'm imagining it, you would run it on your codebase and get some errors like:
and so on. Once you satisfy all of its pedantic checks, you should get something which is portable. Is this the kind of thing you had in mind Keith? |
Yup, something like this is exactly what I imagined. Minus the |
This goes back to the original topic of this issue - if the standard doesn't guarantee some behaviour, but pandas happens to have it, then other libraries will still feel the need to copy pandas to meet users' expectations (I believe you said words to this effect). To guarantee portability, I'm suggesting that this should go stronger: If the expectation is that users will read the docs to see what's guaranteed and what's not, then you don't need a validation package, because the docs will already have told them everything. But realistically, people won't read the docs, and that's where the validation package would come into play. But for it to work, it may be necessary to force a little extra explicitness from users *only in the standard / validation package. The top namespaces wouldn't need to change |
I believe the wording I used was that if the standard doesn't guarantee and/or document some behavior then it will cause pressure on downstream libraries to just match Pandas semantics regardless of the maintenance burden or performance implications of doing so. I believe that if we documented the lack of guarantee in the standard that would give something for libraries that can't support certain ordered-ness guarantees something to point to in order to explain and justify their behavior. Especially for distributed implementations, the cost of maintaining ordered-ness can be extremely high and most of the time people aren't dependent on that ordered-ness. |
Ok, clear thanks - if all we need is documentation, then we don't need a validation package, right? All you need is to define what's part of the standard, and then document how each package differs from it |
I think there's two separate things here: APIs and behaviors. For APIs, I think a validation package would make a huge difference for developers trying to build against the standard. Otherwise, they'll have to manually go through their code extremely carefully to make sure they didn't deviate from the standard as far as the functions and parameters they use. For behaviors, I don't think there's any easy ways for us to build tooling to guarantee developers that they're following the behavior guarantees of the standard, where documentation is the best answer that I see that at least gives folks something to guide them. |
If the validation package would make a huge difference, then why not take it one step further and also have the validation package require them to be explicit about whether they expect Documentation could then inform them that |
Because, from a purely logical results perspective, there's no difference between doing One argument that I could see is that exposing the |
It wouldn't require every DataFrame library to support sorting in groupby - the standard could just say that they all need to have By not saying anything about |
I vaguely remember early discussions we had regarding wanting to avoid "optional" functions and parameters because many of the dataframe library maintainers on the calls felt that "optional" pieces would quickly become effectively required pieces based on other libraries supporting the "optional" pieces.
Even if we have the |
Just for my understanding, what's the difference between pandas implementing a superset of the standard and something being optional? |
I think my main issue is that I haven't understood whether the objective is:
I've added this to Thursday's agenda |
I think the way forward here is to keep the API as it is and ask Keith to make a PR describing sortedness-expectations in the GroupBy class-level docstring https://github.com/data-apis/dataframe-api/blob/main/spec/API_specification/dataframe_api/groupby_object.py#L7 |
Plenty of the API's methods aren't a subset of pandas (e.g. get_column_by_name), so I think that makes it pretty clear that it's option 2 that we're going for. That's fine, and many of the messages here make it clear that this would not be a way to influence changes in the pandas main namespace. As long as that expectation is clear, and that cuDF / modin / vaex / etc. would still derive value from the exercise, all is good I'll change my agenda item to focus more on use-cases then - I'd really like it if we could find a strong use-case, try implementing the standard in a separate package (even if in a quick and hacky way for now) and checking that it would allow that use-case to support multiple DataFrame libraries. This would really help with motivation and with determining what to include/exclude |
Some of my thoughts:
|
@kkraus14 would you be up for making a PR documenting the no-sortedness-assumption in the GroupBy docstring? |
How about a minimal reference implementation using pandas, but which lives in a separate package and which is maintained by the community? |
I think it is still useful to explicitly confirm with more people that we indeed want to focus on option 2 from #102 (comment), and not on option 1
Also if it is a subset, I assume there can be documentation about (non) guarantees for certain behaviours, like the whole discussion in this issue? The current proposal for |
+1. I'll note that, from the lessons we learned from the array API work, this strictly minimal reference implementation is necessary. It can live as a standalone package (we currently have it as
That sounds like an important exercise indeed. I'll note that, depending on the level of incompatibility, a content manager or some such thing may still make sense, to switch deviating APIs to "standard-compliant mode". That may be a usability thing - just like switching imports, it's a toggle to enable the standard. It may just be an easier one.
I will note that the willingness of the pandas team to break backwards compatibility seems to have increased over the past couple of years (probably directly correlated to having more developer bandwidth). Even if a standard will never live in pandas' main namespace, it would be a great outcome if they take away a few lessons from the standard and avoid/change semantics that are bad for parallel/GPU/distributed/JIT-compiled execution. |
All, I apologize for not responding earlier to this issue (have been out sick for the last few days - still am). I am happy that there is alignment on (2) as the end-goal for this standard. I feel that will be a valuable outcome for cuDF. |
Awesome, thanks. For posterity, I'll post one phrase from the last call, which I'd like to this make clear the initial goals:
|
closed by #104 |
From the last call, the cuDF folks stated that a goal of theirs is to let users prototype with pandas on CPU, and then switch to GPU with cuDF and have their code "just work".
One issue they currently face is
groupby
: pandas sorts by default, so that's what their users expectEven if pandas were to adopt the standard in its main namespace, this would not accomplish their goal. This is because whilst the standard doesn't specify whether
groupby
should sort, it doesn't forbid it either - thus, pandas could continue sorting ingroupby
by default whilst respecting the standard, but thencuDF
would be no better off.cuDF also said that if pandas were to implement the standard in a separate namespace, then users wouldn't necessarily want to look up the standard and would expect things to just work as they're used to (i.e. pandas main namespace).
The only way I can think of of accomplishing cuDF's goal, without changes in cuDF, would be:
groupby
I don't mean to be a "Pessimistic Pete", but the second one seems unlikely to land. Some deviations from pandas in cuDF may be warranted.
My suggestion is:
groupby
, thus forcing users to specify a value forsort
. Users would be required to type an extra 2 words, but they'd probably be better off because of itgroupby
. It's been mentioned that developers might only test their code using pandas and then expect it to work with other DataFrame libraries, so this would reduce the chances of surprises. And so long as it's in a separate namespace, there's no risk of breaking millions of users' codeEDIT: on the last point - maybe the standard doesn't need to forbid sorting, but the pandas implementation of it shouldn't sort by default
The text was updated successfully, but these errors were encountered: