Skip to content

DOC: missed behavior explaination of sort=False for groupby #47529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
Tracked by #6
easysam opened this issue Jun 28, 2022 · 4 comments · Fixed by #52736
Closed
1 task done
Tracked by #6

DOC: missed behavior explaination of sort=False for groupby #47529

easysam opened this issue Jun 28, 2022 · 4 comments · Fixed by #52736
Milestone

Comments

@easysam
Copy link

easysam commented Jun 28, 2022

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Documentation problem

The docs miss an explanation of sort=False for groupby. Does the order among groups with their keys follow the order of appearance of the keys in the original data frame? Or the groups may be out of order?

Suggested fix for documentation

When setting sort=False for groupby. One may want the order among groups with their keys follows the order of appearance of the keys in the original data frame. Can this be guaranteed?

@easysam easysam added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2022
@datapythonista datapythonista removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 28, 2022
@datapythonista
Copy link
Member

Thanks for reporting this @easysam. I'm not sure if sort=False means that you'll get the order in what the keys are found, or an arbitrary order. Do you mind running some tests to see the behavior and update the docstring accordingly? That would be helpful for others with the same question. Thanks!

@rhshadrach
Copy link
Member

What happens to the keys is also missing from the User Guide, I think it would be good to add a description there:

https://pandas.pydata.org/pandas-docs/dev/user_guide/groupby.html#groupby-sorting

@rhshadrach rhshadrach added this to the Contributions Welcome milestone Jun 29, 2022
@easysam
Copy link
Author

easysam commented Jun 29, 2022

@datapythonista @rhshadrach
I try to answer this question by understanding the source code.

codes, uniques = algorithms.factorize(

It seems that the algorithms.factorize is used to calculate the unique keys. The algorithms.factorize use the hashtable.
_hashtables = {

However, I met several ".pxi.in" files in the hashtable source code. For example: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/hashtable_class_helper.pxi.in
I want to know how to use .pxi.in files to generate .pxi files. Is there any tutorials or docs?

I also post this problem in stackoverflow, hoping to help others.
https://stackoverflow.com/questions/72798626/what-is-a-pxi-in-file-and-how-to-use-it

@rhshadrach
Copy link
Member

rhshadrach commented Jun 29, 2022

The pxi.in files are built here:

pandas/setup.py

Lines 77 to 97 in f4ca4d3

class build_ext(_build_ext):
@classmethod
def render_templates(cls, pxifiles):
for pxifile in pxifiles:
# build pxifiles first, template extension must be .pxi.in
assert pxifile.endswith(".pxi.in")
outfile = pxifile[:-3]
if (
os.path.exists(outfile)
and os.stat(pxifile).st_mtime < os.stat(outfile).st_mtime
):
# if .pxi.in is not updated, no need to output .pxi
continue
with open(pxifile) as f:
tmpl = f.read()
pyxcontent = Tempita.sub(tmpl)
with open(outfile, "w") as f:
f.write(pyxcontent)

But algorithms.factorize will code the unique values by order of appearance when sort=False; we have a lot of testing on it in pandas.tests.test_algos as well as the groupby/indexing tests. The one exception is null values (they are always given the largest code regardless of appearance), but that should be fixed by #46601.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants