Skip to content

groupby sorting - don't specify, or forbid? #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Feb 18, 2023 · 43 comments
Closed

groupby sorting - don't specify, or forbid? #102

MarcoGorelli opened this issue Feb 18, 2023 · 43 comments

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Feb 18, 2023

From the last call, the cuDF folks stated that a goal of theirs is to let users prototype with pandas on CPU, and then switch to GPU with cuDF and have their code "just work".

One issue they currently face is groupby: pandas sorts by default, so that's what their users expect

Even if pandas were to adopt the standard in its main namespace, this would not accomplish their goal. This is because whilst the standard doesn't specify whether groupby should sort, it doesn't forbid it either - thus, pandas could continue sorting in groupby by default whilst respecting the standard, but then cuDF would be no better off.

cuDF also said that if pandas were to implement the standard in a separate namespace, then users wouldn't necessarily want to look up the standard and would expect things to just work as they're used to (i.e. pandas main namespace).

The only way I can think of of accomplishing cuDF's goal, without changes in cuDF, would be:

  1. the Standard forbids sorting in groupby
  2. pandas adopts the standard in its main namespace

I don't mean to be a "Pessimistic Pete", but the second one seems unlikely to land. Some deviations from pandas in cuDF may be warranted.

My suggestion is:

  • for cuDF's end-user groupby issue, they could remove the default from groupby, thus forcing users to specify a value for sort. Users would be required to type an extra 2 words, but they'd probably be better off because of it
  • the Standard stays developer-oriented, rather end-user-oriented
  • to minimise surprise to users of the Standard, that the standard explicitly forbid sorting in groupby. It's been mentioned that developers might only test their code using pandas and then expect it to work with other DataFrame libraries, so this would reduce the chances of surprises. And so long as it's in a separate namespace, there's no risk of breaking millions of users' code

EDIT: on the last point - maybe the standard doesn't need to forbid sorting, but the pandas implementation of it shouldn't sort by default

@MarcoGorelli
Copy link
Contributor Author

cc @shwina @kkraus14 for visibility (I don't know if you receive notifications from this repo, so tagging you just in case)

@shwina
Copy link
Contributor

shwina commented Feb 22, 2023

Thanks, @MarcoGorelli -- I agree with everything you've said here.

I do think that a user-facing dataframe API has much more value than a purely developer-facing API. And while I recognize that Pandas will likely always provide a superset of that API, I also think that it should be at minimum an implementation of the standard, and as flexible an implementation of the standard as possible.

So I think this would be the best outcome here:

maybe the standard doesn't need to forbid sorting, but the pandas implementation of it shouldn't sort by default

@MarcoGorelli
Copy link
Contributor Author

Hey - when I wrote

the pandas implementation of it shouldn't sort by default

I meant if pandas were to implement the standard in the separate namespace - it would be too disruptive of a change for the pandas main namespace.

I think this is probably the opposite of what you want - in which case, would making sort a required argument in cudf.groupby be feasible?

@kkraus14
Copy link
Collaborator

  • to minimise surprise to users of the Standard, that the standard explicitly forbid sorting in groupby. It's been mentioned that developers might only test their code using pandas and then expect it to work with other DataFrame libraries, so this would reduce the chances of surprises. And so long as it's in a separate namespace, there's no risk of breaking millions of users' code

EDIT: on the last point - maybe the standard doesn't need to forbid sorting, but the pandas implementation of it shouldn't sort by default

It would be great if the standard explicitly made no promises or guarantees regarding the output ordering of operations like groupby and didn't include a sort parameter in the groupby api. Implementers of the standard could choose to sort or provide ordering semantics in any way that they wish within the details of their implementation, but the standard shouldn't require or guarantee any of that behavior.

@MarcoGorelli
Copy link
Contributor Author

Sure, but if the standard makes no promises about groupby sorting, then there's nothing stopping pandas from keeping its current groupby behaviour.

And then that doesn't solve your issue that people come to cuDF expecting to also sort in groupby, which you said you've like to change

@jbrockmendel
Copy link
Contributor

My takeaway from last week is that "The Standard" is completely irrelevant and all anyone really cares about is the pandas behavior, where cudf wants an API change. The place to ask for that is on the pandas tracker, and the way to convince people is to argue it will benefit pandas users.

@kkraus14
Copy link
Collaborator

And then that doesn't solve your issue that people come to cuDF expecting to also sort in groupby, which you said you've like to change

From my perspective, the challenge is that there's a parameter with a default value and documented behavior from Pandas as opposed to it being an implementation detail. I.E. the docs for the sort parameter say:

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

This dictates a few different things:

  • Default behavior is to sort by the grouping keys with sort=True
  • Pandas guarantees that the input dataframe order is maintained within each group for "payload" values regardless of the sort value
  • The fact that the input dataframe order is maintained for grouping keys when sort=False isn't documented behavior
    • It's unclear to me if this is an implementation detail or if this is guaranteed behavior from Pandas

In an ideal world, the API we're trying to define wouldn't have a sort keyword and would explicitly make no promises regarding the ordering of grouping keys or "payload" values as many dataframe implementations don't / can't support these behaviors.

@kkraus14
Copy link
Collaborator

My takeaway from last week is that "The Standard" is completely irrelevant and all anyone really cares about is the pandas behavior, where cudf wants an API change. The place to ask for that is on the pandas tracker, and the way to convince people is to argue it will benefit pandas users.

The conversation from last week was a bit all over the place, but my main takeaway was that the API we're trying to build now is targeted at library / applications developers much moreso than end users and that the first dataframe implementation of the standard that library / applications developers will build against will likely be Pandas. This leads to a concern that anywhere that Pandas supports more than the standard, that developers may use those features while thinking they are targeting the standard. They may then be disappointed when they don't get the implementation portability they were looking for.

For targeting end users, we threw around a lot of frustrations, ideas, concerns, and more. A lot of that surrounded the existing inertia of Pandas and how we can give Pandas users a happy path towards alternative dataframe implementations that are mainly offering enhanced performance and scalability as compared to Pandas. I agree that if our conversations lead down to "we want to change Pandas" then those conversations should then move to the Pandas issue tracker. I don't think we're there yet and have just really started the "end user" conversations.

@MarcoGorelli
Copy link
Contributor Author

In an ideal world, the API we're trying to define wouldn't have a sort keyword and would explicitly make no promises regarding the ordering of grouping keys or "payload" values as many dataframe implementations don't / can't support these behaviors.

OK, sure. Let's say the standard doesn't have a sort keyword.

Pandas would then implement a superset of the API (like Ashwin said) and would keep its sort keyword with its current behaviour.

How does this help cuDF?

@kkraus14
Copy link
Collaborator

OK, sure. Let's say the standard doesn't have a sort keyword.

Pandas would then implement a superset of the API (like Ashwin said) and would keep its sort keyword with its current behaviour.

How does this help cuDF?

If it's all going to be in the main namespace then I think ultimately it causes problems and makes the work we're doing here less valuable. I think it would be an unreasonable expectation to have people double checking all of their work to see if they're using parameters that are in Pandas but aren't in the standard. People would likely just continue writing Pandas code as they always have and other dataframe implementations will continue to get pressure to match Pandas API and behaviors.

If it's a separate namespace and Pandas doesn't implement the sort keyword in that namespace then I think that there would be enough justification to get library / application developers interested in having dataframe portability to not be reliant on the implementation details of Pandas with regards to its ordering behaviors.

To be clear, this is beyond just cuDF. My personal experience is with cuDF, Dask (both as a producer and consumer of dataframe APIs), Spark, Arrow, and Ibis (again both as a producer and consumer), all of which can be multi-threaded and/or distributed. This often makes following existing behaviors of Pandas challenging and a significant detriment to performance in many situations.

@MarcoGorelli
Copy link
Contributor Author

Sorry, I'm confused, and I feel like I'm hearing contradictory statements.

On the call, Ashwin said

You're right in that maintaining two namespaces is not tenable for the long run. The real canary is pandas. What pandas does, everyone will generally follow. Ideally, pandas would exclusively support the standard API. My hope is that, even if not 1,2,3 years, that eventually it would happen.

And here, you're saying that you want a separate namespace in which groupby doesn't have a sort keyword.

Which is it? Do you want the main pandas namespace to change so that the transition to cuDF is smoother, or do you want a separate namespace in which everything can be compatible from the get-go?

(I'm not sure how my tone is coming across, so just to be clear: I'm perfectly calm, and don't mean anything I've written in an aggressive/sarcastic way. I'm genuinely confused 😄 , and trying my best to make sense of what we're trying to achieve here)

@kkraus14
Copy link
Collaborator

On the call, Ashwin said

You're right in that maintaining two namespaces is not tenable for the long run. The real canary is pandas. What pandas does, everyone will generally follow. Ideally, pandas would exclusively support the standard API. My hope is that, even if not 1,2,3 years, that eventually it would happen.

And here, you're saying that you want a separate namespace in which groupby doesn't have a sort keyword.

Which is it? Do you want the main pandas namespace to change so that the transition to cuDF is smoother, or do you want a separate namespace in which everything can be compatible from the get-go?

I can't speak for Ashwin.

If there isn't a separate namespace in Pandas, I could only imagine having some package or something to validate you're following the standard and not using functions / keywords that aren't part of the standard. Without that, it will put undue burden on library / application developers to build against the standard effectively. I completely understand and think it's completely reasonable if the Pandas community did not want to support a separate namespace.

In an ideal world purely from the cuDF and other libraries perspectives, Pandas only supports the standard API and all dataframe implementations only support the standard API so users get full portability. In the real world, that isn't in Pandas users' best interests and isn't a feasible thing to do.

(I'm not sure how my tone is coming across, so just to be clear: I'm perfectly calm, and don't mean anything I've written in an aggressive/sarcastic way. I'm genuinely confused 😄 , and trying my best to make sense of what we're trying to achieve here)

Your tone has been perfectly cordial 😄. I appreciate the pushback to make sure that we're actually going to drive impact as opposed to being a perfect example of the xkcd comic (https://xkcd.com/927/).

@jbrockmendel
Copy link
Contributor

I applaud @MarcoGorelli for not being a pessimistic Pete but I have no such compunction. We are getting nowhere and designing an API no one will use. This is a waste of time and we should cut our losses. Put a bow on what we have now and declare this endeavor a rousing success.

@aregm
Copy link

aregm commented Feb 24, 2023

@kkraus14 @jbrockmendel @MarcoGorelli
I want to reiterate the original goals when we started this WG:

  • Provide developer-facing APIs to standardize among many dataframe library developers
  • Keep it minimal and simple
  • Agree on the core parts and do not rathole on the disagreements - document them and provide reasonable expectations and exclusions.

So, IMHO, not the perfect standard is better than no standard. We should not pick on something small, like sort, as there are more significant issues to agree upon.

For Pandas - I think having a standard package -> Pandas namespace -> main namespace path is good enough, as long the transition from the first step to the second is not years. From my team side, I commit to helping build, support, and maintain the Standard API namespace in Pandas.

For other libraries, cuDF, Modin, Vaex, Koalas, etc., targeting performance sort is less critical, but trying to be compatible with Pandas can be a problem, so Pandas as a big bear can keep the current semantics, as there are tons of code already written, meanwhile stating that sort is not compliant, but you can use pandas.standard_df.sort for compliant one.
We can't produce an ideal standard that will satisfy everyone - C++ is a good example. It took implementers ~5 years to accept ARM, ~10 years to produce the first standard from ARM, and another ~10 to bring everyone to the same turf. And there are still incompatibilities. Luckily, we are wiser and less fragmented, so I suggest focusing on pushing the standard out.

@jbrockmendel
Copy link
Contributor

FYI @aregm is my boss at Intel. By complete coincidence, I 100% agree with him. [An imperfect] standard is better than no standard. So let's punt on sorting and be done.

@MarcoGorelli
Copy link
Contributor Author

Right, let's try to get back on-topic. First, I need to address

I think having a standard package -> Pandas namespace -> main namespace path is good enough, as long the transition from the first step to the second is not years

This isn't my decision, and it's not Brock's either - it's a decision for the pandas community. Unless a compelling case with strong use cases it made, this will get voted down. If I am to try to make such a compelling case, then first I need to be sold on the initiative myself, and that's why I'm asking all these questions.

We should not pick on something small, like sort

I chose the sort argument in groupby as this one was specifically brought up by cuDF as one of their top 3 desired changes in pandas.


Right, back to groupby-sort. Let's say the standard has groupby without a sort argument. As per one of the messages above, pandas implements a superset of the standard, in which groupby has a sort argument. I then have the following outstanding questions:

  • why would this require a separate package / namespace / option?
  • how would this help cuDF / modin / any other package?

Maximally explicit answers please, there's already too much confusion and contradictory statements. Thanks 🙏

@kkraus14
Copy link
Collaborator

  • why would this require a separate package / namespace / option?

A separate package / namespace is a desire but not an explicit requirement. The idea being to ease developer experience in building against the standard. Assuming Pandas is the first / main dataframe development target, without a separate namespace, folks can easily write code that deviates from the standard and loses the dataframe implementation portability they were aiming / looking for.

Alternatively, instead of a separate package / namespace for Pandas, we could imagine having some type of stubs or other validation type of package where someone can use it during their development cycle to ensure they're not deviating from the standard. This stub or validation package could then be a community effort built as part of defining the standard or something like that?

Ultimately, I don't think it's reasonable that developers check every function call and every parameter against documentation to ensure they aren't using Pandas specific features. Without some guard rails in place, I think it will be a very poor developer experience trying to build something that follows the standard without using implementation specific functionality.

  • how would this help cuDF / modin / any other package?

Assuming that there isn't a separate package / namespace / option / stub / validation, it still defines and documents a standard API and set of behaviors that developers could be pointed to as a way to more or less guarantee portability between Pandas and cuDF, and therefore, CPU and NVIDIA GPU. The common case that folks have talked about is wanting to be able to develop on something like an M1/M2 Macbook which can't have an NVIDIA GPU and then move to a workstation or server with NVIDIA GPU with some level of confidence.

@jorisvandenbossche
Copy link
Member

I think having a standard package -> Pandas namespace -> main namespace path is good enough, as long the transition from the first step to the second is not years

This isn't my decision, and it's not Brock's either - it's a decision for the pandas community. Unless a compelling case with strong use cases it made, this will get voted down.

To add to this: if the desired end goal is to also reach this third step (inclusion in the main pandas namespace), I would recommend that we reconsider our current approach for the standard API (design new methods vs define a subset of pandas, this has come up before ..).
Unless the standard API is a subset of the current pandas namespace (plus some (behavioral) changes that we deem important and have a realistic chance of getting support from the pandas community), I think the chance of inclusion in the main namespace (ever) is quite small.

Now, I personally don't have a strong opinion on how important this desire of inclusion in the main namespace actually is, and not going the pandas-subset way gives more freedom to design a better developer-oriented API (eg #89).


Note on the actual sort discussion: even if pandas (or the standard api namespace in pandas) would no longer sort for groupby (sort=False), pandas still gives a deterministic order, i.e. the order of appearance in your dataframe. So there is still some form of ordered-ness that people might rely on and that can break when moving to cudf, etc.

@jbrockmendel
Copy link
Contributor

I think having a standard package -> Pandas namespace -> main namespace path is good enough, as long the transition from the first step to the second is not years

This isn't my decision, and it's not Brock's either - it's a decision for the pandas community. Unless a compelling case with strong use cases it made, this will get voted down. If I am to try to make such a compelling case, then first I need to be sold on the initiative myself, and that's why I'm asking all these questions.

Some missing context: Areg wants an implementation of the standard to live within a separate namespace within pandas (short-term). The pandas team thinks this might as well live in a separate package, which wouldn't be their (our) responsibility to maintain. Areg says "well what if I promise to have Intel maintain it?". I suggested a compromise that the pandas team might agree to would be a) implement it as a separate package that b) pandas will document as the suggested standard-api-with-pandas then c) demonstrate that it will be maintained and d) if it gets sufficient adoption then e) we will consider upstreaming it to a separate namespace within pandas.

I personally would be OK with this (I think (d) is incredibly unlikely, but if it comes true...) but the pandas team has not agreed to anything.

The "-> main namespace" part is pure fantasy on Areg's part. Last time he described it to me the path was something like "the standard is so overwhelmingly popular that pandas is compelled adopt it in the main namespace," and I made a joke about that being as likely as me dating Jennifer Lawrence.

@kkraus14
Copy link
Collaborator

Note on the actual sort discussion: even if pandas (or the standard api namespace in pandas) would no longer sort for groupby (sort=False), pandas still gives a deterministic order, i.e. the order of appearance in your dataframe. So there is still some form of ordered-ness that people might rely on and that can break when moving to cudf, etc.

I would push for there to be documentation as part of the standard that explicitly makes ordered-ness an implementation detail and not guaranteed behavior as part of the standard (at least to start, I think we could always add ordered-ness parameters / guarantees at a later evolution of the standard).

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Feb 27, 2023

The validation package is the idea I like the most out of the ones I've heard so far. The way I'm imagining it, you would run it on your codebase and get some errors like:

t.py:1:4: found DataFrame.groupby without `sort` argument, which is not guaranteed to work the same way across
         DataFrame libraries. Pass `sort=False` to accept this behaviour, or `sort=True` if you need sorted data
t.py:24:154: found DataFrame.sort_index. Note that the Index is not guaranteed to be available in all DataFrame
        libraries - make a column with your current index and sort by that instead
t.py:185:1: found DataFrame.update, which is not guaranteed to be available in all DataFrame libraries

and so on. Once you satisfy all of its pedantic checks, you should get something which is portable. Is this the kind of thing you had in mind Keith?

@kkraus14
Copy link
Collaborator

Yup, something like this is exactly what I imagined.

Minus the sort argument in DataFrame.groupby, which I'd like to exclude from the standard (at least initially) and document that there's no guarantees about ordered-ness (at least initially).

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Feb 27, 2023

This goes back to the original topic of this issue - if the standard doesn't guarantee some behaviour, but pandas happens to have it, then other libraries will still feel the need to copy pandas to meet users' expectations (I believe you said words to this effect).

To guarantee portability, I'm suggesting that this should go stronger: sort is a required* argument to groupby, so that users have to be explicit about what their expectation is. And the validation package would flag if DataFrame.groupby is called without a sort argument - like this, you would actually get portability

If the expectation is that users will read the docs to see what's guaranteed and what's not, then you don't need a validation package, because the docs will already have told them everything.

But realistically, people won't read the docs, and that's where the validation package would come into play. But for it to work, it may be necessary to force a little extra explicitness from users


*only in the standard / validation package. The top namespaces wouldn't need to change

@kkraus14
Copy link
Collaborator

I believe the wording I used was that if the standard doesn't guarantee and/or document some behavior then it will cause pressure on downstream libraries to just match Pandas semantics regardless of the maintenance burden or performance implications of doing so.

I believe that if we documented the lack of guarantee in the standard that would give something for libraries that can't support certain ordered-ness guarantees something to point to in order to explain and justify their behavior. Especially for distributed implementations, the cost of maintaining ordered-ness can be extremely high and most of the time people aren't dependent on that ordered-ness.

@MarcoGorelli
Copy link
Contributor Author

Ok, clear thanks - if all we need is documentation, then we don't need a validation package, right? All you need is to define what's part of the standard, and then document how each package differs from it

@kkraus14
Copy link
Collaborator

I think there's two separate things here: APIs and behaviors.

For APIs, I think a validation package would make a huge difference for developers trying to build against the standard. Otherwise, they'll have to manually go through their code extremely carefully to make sure they didn't deviate from the standard as far as the functions and parameters they use.

For behaviors, I don't think there's any easy ways for us to build tooling to guarantee developers that they're following the behavior guarantees of the standard, where documentation is the best answer that I see that at least gives folks something to guide them.

@MarcoGorelli
Copy link
Contributor Author

If the validation package would make a huge difference, then why not take it one step further and also have the validation package require them to be explicit about whether they expect groupby to sort or not?

Documentation could then inform them that sort=True is expensive and might not be what they really want or need

@kkraus14
Copy link
Collaborator

Because, from a purely logical results perspective, there's no difference between doing df.groupby(...).sort_values(...) and df.groupby(..., sort=True), except that every DataFrame that implements the standard will now need to support the sort keyword in their groupby implementation as well.

One argument that I could see is that exposing the sort keyword could allow some libraries to possibly take a more efficient codepath for their actual groupby implementation knowing that they want the output sorted, like using a sort based groupby instead of a hash table based groupby, but I'm not sure if we want things like that in scope for the standard.

@MarcoGorelli
Copy link
Contributor Author

It wouldn't require every DataFrame library to support sorting in groupby - the standard could just say that they all need to have sort, and anything other than an explicit sort=False would be flagged by the validation package. DataFrame libraries not wanting to implement sorting in groupby could just raise NotImplementedError for sort=True.

By not saying anything about sort, we risk negatively affecting portability - is it realistic to assume that all users of the standard will read the docs, as opposed to just prototyping with pandas and then declaring themselves to be done when the validation package passes the checks?

@kkraus14
Copy link
Collaborator

It wouldn't require every DataFrame library to support sorting in groupby - the standard could just say that they all need to have sort, and anything other than an explicit sort=False would be flagged by the validation package. DataFrame libraries not wanting to implement sorting in groupby could just raise NotImplementedError for sort=True.

I vaguely remember early discussions we had regarding wanting to avoid "optional" functions and parameters because many of the dataframe library maintainers on the calls felt that "optional" pieces would quickly become effectively required pieces based on other libraries supporting the "optional" pieces.

By not saying anything about sort, we risk negatively affecting portability - is it realistic to assume that all users of the standard will read the docs, as opposed to just prototyping with pandas and then declaring themselves to be done when the validation package passes the checks?

Even if we have the sort parameter, Pandas has behavior of when sort=False that it maintains the input ordering whereas other libraries would presumably not want that restriction. No matter what there's going to be inherent behaviors of Pandas and other libraries that someone could erroneously build against. The best we can do in these situations is to do our best to explicitly document the behavior guarantees and non-guarantees.

@MarcoGorelli
Copy link
Contributor Author

Just for my understanding, what's the difference between pandas implementing a superset of the standard and something being optional?

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Feb 28, 2023

I think my main issue is that I haven't understood whether the objective is:

  • 1:

    • a clearly documented standard, which would be a subset of pandas;
    • a validation package which tells developers of when they're using
      something which is not part of the standard (e.g. .index) and thus
      might not be portable.
  • or 2:

    • a clearly documented standard, which some incompatibilities with pandas;
    • implementations of the standard for each library, possibly in separate
      packages / namespaces (in particular, the pandas one would be separate, at least initially).
    • (there'd be no need for a validation package here because each package, being separate, would implement
      only the standard, and so everything should be portable)

I've added this to Thursday's agenda

@jbrockmendel
Copy link
Contributor

I think the way forward here is to keep the API as it is and ask Keith to make a PR describing sortedness-expectations in the GroupBy class-level docstring https://github.com/data-apis/dataframe-api/blob/main/spec/API_specification/dataframe_api/groupby_object.py#L7

@MarcoGorelli
Copy link
Contributor Author

keep the API as it is

Plenty of the API's methods aren't a subset of pandas (e.g. get_column_by_name), so I think that makes it pretty clear that it's option 2 that we're going for.

That's fine, and many of the messages here make it clear that this would not be a way to influence changes in the pandas main namespace. As long as that expectation is clear, and that cuDF / modin / vaex / etc. would still derive value from the exercise, all is good

I'll change my agenda item to focus more on use-cases then - I'd really like it if we could find a strong use-case, try implementing the standard in a separate package (even if in a quick and hacky way for now) and checking that it would allow that use-case to support multiple DataFrame libraries. This would really help with motivation and with determining what to include/exclude

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 1, 2023

I think my main issue is that I haven't understood whether the objective is:

  • 1:

    • a clearly documented standard, which would be a subset of pandas;
    • a validation package which tells developers of when they're using
      something which is not part of the standard (e.g. .index) and thus
      might not be portable.
  • or 2:

    • a clearly documented standard, which some incompatibilities with pandas;
    • implementations of the standard for each library, possibly in separate
      packages / namespaces (in particular, the pandas one would be separate, at least initially).
    • (there'd be no need for a validation package here because each package, being separate, would implement
      only the standard, and so everything should be portable)

I've added this to Thursday's agenda

Some of my thoughts:

  • I don't have strong opinions about whether the API is a subset of Pandas or some incompatibilities with Pandas as long as we're not locked into Pandas defaults or behaviors.
  • For option 2: I don't think each package would implement only the standard. I.E. you could imagine a new DataFrame library wanting to use the standard as its primary API and then introduce some domain specific functionality on top of it that doesn't make sense to standardize. In order to allow developers to productively build against the standard, we either need a minimal reference implementation that implements only the standard or a validation package. If we go the minimal reference implementation route, Pandas would probably make the most sense in a separate namespace, but it would obviously depend on what the Pandas community wants to do because it would be a non-trivial amount of work. If the Pandas community understandably doesn't want to take on the effort, then I think a validation package would still be needed.

@jbrockmendel
Copy link
Contributor

@kkraus14 would you be up for making a PR documenting the no-sortedness-assumption in the GroupBy docstring?

@MarcoGorelli
Copy link
Contributor Author

If we go the minimal reference implementation route, Pandas would probably make the most sense in a separate namespace, but it would obviously depend on what the Pandas community wants to do because it would be a non-trivial amount of work. If the Pandas community understandably doesn't want to take on the effort, then I think a validation package would still be needed.

How about a minimal reference implementation using pandas, but which lives in a separate package and which is maintained by the community?

kkraus14 added a commit to kkraus14/dataframe-api that referenced this issue Mar 1, 2023
@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 1, 2023

@kkraus14 would you be up for making a PR documenting the no-sortedness-assumption in the GroupBy docstring?

PR opened here: #104

@jorisvandenbossche
Copy link
Member

I'll change my agenda item to focus more on use-cases then

I think it is still useful to explicitly confirm with more people that we indeed want to focus on option 2 from #102 (comment), and not on option 1
(because as a starter, I think you can then be sure this will never live in pandas main namespace, which is something some people still see as an end goal)

I don't have strong opinions about whether the API is a subset of Pandas or some incompatibilities with Pandas as long as we're not locked into Pandas defaults or behaviors.

Also if it is a subset, I assume there can be documentation about (non) guarantees for certain behaviours, like the whole discussion in this issue? The current proposal for groupby() is (I think) basically a subset of the current pandas API, and at the same time we want to document that the sorting behaviour of pandas is not part of the guarantees the standard provides.
So I think both options should give that freedom to deviate from certain aspects of pandas (although option 1 of course limits that freedom a bit more).

@rgommers
Copy link
Member

rgommers commented Mar 2, 2023

In order to allow developers to productively build against the standard, we either need a minimal reference implementation that implements only the standard or a validation package. If we go the minimal reference implementation route, Pandas would probably make the most sense in a separate namespace, but it would obviously depend on what the Pandas community wants to do because it would be a non-trivial amount of work. If the Pandas community understandably doesn't want to take on the effort, then I think a validation package would still be needed.

+1. I'll note that, from the lessons we learned from the array API work, this strictly minimal reference implementation is necessary. It can live as a standalone package (we currently have it as numpy.array_api, but it could be split out - doesn't matter too much). As long as it's available and with a clear purpose of validating that you're only using standardized API and that therefore your code is very likely to be portable.

I think it is still useful to explicitly confirm with more people that we indeed want to focus on option 2 from #102 (comment), and not on option 1

That sounds like an important exercise indeed.

I'll note that, depending on the level of incompatibility, a content manager or some such thing may still make sense, to switch deviating APIs to "standard-compliant mode". That may be a usability thing - just like switching imports, it's a toggle to enable the standard. It may just be an easier one.

(because as a starter, I think you can then be sure this will never live in pandas main namespace, which is something some people still see as an end goal)

I will note that the willingness of the pandas team to break backwards compatibility seems to have increased over the past couple of years (probably directly correlated to having more developer bandwidth). Even if a standard will never live in pandas' main namespace, it would be a great outcome if they take away a few lessons from the standard and avoid/change semantics that are bad for parallel/GPU/distributed/JIT-compiled execution.

@shwina
Copy link
Contributor

shwina commented Mar 2, 2023

All, I apologize for not responding earlier to this issue (have been out sick for the last few days - still am).

I am happy that there is alignment on (2) as the end-goal for this standard. I feel that will be a valuable outcome for cuDF.

rgommers pushed a commit that referenced this issue Mar 2, 2023
#104)

Update docs for groupby based on the discussion in gh-102
@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Mar 4, 2023

Awesome, thanks.

For posterity, I'll post one phrase from the last call, which I'd like to this make clear the initial goals:

If we manage to develop some standard which allows for fully portable code, people would still consider this to be a success even if it doesn't also make it into the main pandas namespace

@MarcoGorelli
Copy link
Contributor Author

closed by #104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants