Skip to content

DOC/DEV: document approach to bisect regressions #35685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Aug 12, 2020 · 11 comments · Fixed by #48968
Closed

DOC/DEV: document approach to bisect regressions #35685

jorisvandenbossche opened this issue Aug 12, 2020 · 11 comments · Fixed by #48968

Comments

@jorisvandenbossche
Copy link
Member

@simonjayhawkins you have been doing great work on finding the causes of bugs/regressions!
To make it easier for other contributors to also do this, I think it would be nice to document this is the contributing docs (even though it is relying on git, using it with pandas (which needs to be rebuilt for each commit) requires a bit of custom setup I think?)

@MarcoGorelli
Copy link
Member

This would be good...I've got a local script mybisect.sh :

python setup.py build_ext --inplace -j 8;
python pandas/tests/myscript.py

and then put whatever example I'm trying to check in pandas/tests/myscript.py. Then I do

git bisect start
git bisect bad
git bisect good <some older commit>
git bisect run bash mybisect.sh

. It took me more than I'm proud to admit to figure this out, so making a note of this (or of whatever the proper way of doing it is) would be useful IMO

@simonjayhawkins
Copy link
Member

good idea. I'll use this issue to post a few notes in the first instance.

as well as finding bug/regressions it can also be useful to know when issues are fixed, e.g. #35650

In that issue the old behaviour raised ValueError and the new behaviour didn't. (more detail on exit status will be needed in doc)

so the script needed to exit with a non-zero exit code for the new behaviour and with a zero exit code for the old behaviour

so in this case, this script was used (print statements not needed)

import sys
import numpy as np
import pandas as pd

print(pd.__version__)

arr = np.array([np.datetime64("2015-11-20T15:06:58.000")])
arr.flags.writeable = False
try:
    res = pd.factorize(arr)
    print(res)
except:
    pass
else:
    sys.exit(1)

@dsaxton
Copy link
Member

dsaxton commented Aug 13, 2020

I've been wondering how exactly @simonjayhawkins does this but was too afraid to ask 😊. Great idea to document this.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Aug 14, 2020

for an regression which is not fixed on master and now raises an Exception can just use code sample (this will give a non-zero exit status for a bad commit and zero exit status for good commit eg. #35700 (comment)

(don't need to assert correct result but probably better to document a better practice)

~/test.py

import pandas as pd
print(pd.__version__)
df = pd.DataFrame({"a": [1, 2, 3] * 10000, "b": ["x", "y", "z"] * 10000})
res = df == "x"
print(res)

and execute the following commands in the shell (git bash on Windows) see examples in https://git-scm.com/docs/git-bisect#_examples

git bisect start HEAD v1.0.5
git bisect run /bin/bash -c "python setup.py build_ext -i;python ~/test.py"
git bisect reset
python setup.py build_ext -i  # need to rebuild after reset

@jbrockmendel
Copy link
Member

We could even put something like this into a script in scripts/, possibly even accessible in via make

@simonjayhawkins
Copy link
Member

possibly even accessible in via make

if it for dev use that's probably ok. but if we also want to encourage users to bisect the issues themselves we need to be Windows friendly or have a guide that includes Linux/Mac and Windows instructions.

@mzeitlin11
Copy link
Member

Thanks for writing these examples up so clearly here @simonjayhawkins, made it very easy to get started! A potentially useful addition is prepending CFLAGS="-O0" to the run command, which made bisection much faster for me because of greatly decreased build time.

@mroeschke mroeschke added the Admin Administrative tasks related to the pandas project label Aug 8, 2021
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Sep 21, 2022

A potentially useful addition is prepending CFLAGS="-O0" to the run command

Is this only true for bisecting? Any reason to not use it all the time?


Anyway, here's a fully worked notebook I use to find regressions: https://www.kaggle.com/code/marcogorelli/pandas-regression-example

@lithomas1 lithomas1 removed the Admin Administrative tasks related to the pandas project label Sep 21, 2022
@MarcoGorelli
Copy link
Member

Right, let's try to move this forwards - I think this could be a nice addition to this page https://github.com/pandas-dev/pandas/blob/0dadc71dd4653e5b858d7b4153df1d7aded4ba46/doc/source/development/maintaining.rst

Doesn't need to be in-depth, I think it just needs:

  • a link to the git bisect documentation
  • a little example like the one Simon gave, e.g.:

If a user reported that pd.Series([1,2,3]).sum() returned 9 in version pandas 1.5.0 and 6 in version 1.4.0, then you could find out which commit changed the result by making a file t.py with

import pandas as pd
assert pd.Series([1,2,3]).sum() == 6

and then running

git bisect start
git bisect good v1.4.0
git bisect bad v1.5.0
git bisect run bash -c "python setup.py build_ext -j 4; python t.py"
git bisect reset
python setup.py build_ext -j 4  # need to rebuild after reset

Marking as good-first-issue then, if anyone wants to work on this feel free to open a pull request and to tag me

@mzeitlin11
Copy link
Member

Is this only true for bisecting? Any reason to not use it all the time?

Gives a tradeoff between compile time and performance. No optimizations gives much slower code, but for a case like bisecting when (usually) there's a bunch of compiling going on (since cython files often change across bisection steps), it can be much faster provided that the bisection test case still runs quickly

@seljaks
Copy link
Contributor

seljaks commented Sep 29, 2022

Hi @MarcoGorelli I'd like to take a crack at writing this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants