Skip to content

Tips and tricks for pandas devs #3156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Mar 24, 2013 · 14 comments
Closed

Tips and tricks for pandas devs #3156

ghost opened this issue Mar 24, 2013 · 14 comments
Labels
Admin Administrative tasks related to the pandas project Docs

Comments

@ghost
Copy link

ghost commented Mar 24, 2013

Working on pandas for a while now, there's a bunch of tools and tricks
I use, here's a list to help pandas devs slip into the zone:

Use ipdb rather then pdb with nose: --ipdb --ipdb-fail

https://github.com/flavioamieiro/nose-ipdb

Because tab-completion is not optional.

Re-running only failed tests

nosetests --with-id --failed will rerun only the tests which failed last
time you ran nosetests --with-id. If you use test_fast.sh

test_fast.sh --failed

will do what you expect after you had some tests fail

Better integration of github and git commandline flow

hub a wrapper around git, with github
sugar. first and foremost:

hub checkout https://github.com/pydata/pandas/pull/1134

adds a remote, fetches it, creates a branch for it, and generally puts your right there.

Note: see comment below for a way to do this with pure git, if you don't
mind thousands of remote branches.

GH issues from the command line

ghi
open/manipulate gh issues from the command line.

I use it to open issues when I hit a bug and want to quickly
open a reminder to fix, without breaking my focus.

Testing across python version locally

tox let's you run the test suites across all python versions using virtualenvs.
Everything is setup in the repo, just install and run.
detox parallelizes tox.

Faster pandas builds/testing

Note: the build cache was baked into setup.py from roughly 0.9.1. as of 0.11.0
it's been factored out into scripts/use_build_cache.py, which rewrites setup.py
to use the build cache. The script has been tested as far back as 0.7.0.

Putting the following in your .bashrc

# Use the pandas build cache
export BUILD_CACHE_DIR="$HOME/tmp/.pandas_build_cache/"
if [ ! -e $BUILD_CACHE_DIR ]; then
    mkdir -p $BUILD_CACHE_DIR ;
fi

echo $BUILD_CACHE_DIR > [pandas repo root dir]/.build_cache_dir
function cdev {
# any recent commit should do
git checkout c69e3aa scripts/use_build_cache.py vb_suite/test_perf.py
scripts/use_build_cache.py $1 # rewire setup.py with build_cache
if [ x"$VIRTUAL_ENV" == x"" ]; then
   _SUDO="sudo"
fi

sudo chown $USER -R .;
$_SUDO python ./setup.py clean;
$_SUDO python ./setup.py develop;
sudo chown $USER -R .;
echo "Restoring setup.py"
git checkout setup.py # restore setup.py
}

c69e3aa can be any recent commit, needs to be bumped if there are updates
to the script.

The pandas build cache code, caches cythonization, compilation and
2to3 artifacts for reuse in subsequent builds.
To compile, use "git reset --hard" to get the commit you're after, then use cdev
to build pandas. setup.py will reuse what it can to speed this up.
Note that setup.py gets overwritten, but also restored when the build completes.
With a warm cache, moving to a given commit takes just a few seconds rather then
then the several minutes of a full compile.

You may also run scripts/use_build_cache.py prior to launching tox to speed up tetsing.

Use ccache

The build cache just described caches things on a very coarse level, if there's
any change to .pyx (cython) files, all the files will be recythonized and rebuilt.
Using ccache (an apt-get+envar away on most distors these days) can speed
up the compilation part by caching the gcc compilation results. Yes, this overlaps
with the caching from the previous section, only it also caches the cythonized
c files.

Benchmarking commits

test_perf.sh let's you compare the performance of one commit against
another or benchmark the current HEAD.
It produces a table of results suitable for posting in a PR, and can serialize
the results dataframe into a pickle file, for analysis in pandas.

It can print summary stats over mutliple runs and all sorts of things.
see test_perf.sh --help,

Easily generate dataframes of different kinds

mkdf let's you easily fabricate dataframes of varying dimensions
and arbitrary data:

from pandas.util.testing import makeCustomDataframe as mkdf
In [12]: mkdf(3,2)
Out[12]: 
C0      C_l0_g0 C_l0_g1
R0                     
R_l0_g0    R0C0    R0C1
R_l0_g1    R1C0    R1C1
R_l0_g2    R2C0    R2C1

# or even...
In [11]: mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=3,data_gen_f=lambda r,c: r*2+c)
Out[11]: 
C0               C_l0_g0  C_l0_g1  C_l0_g2
C1               C_l1_g0  C_l1_g1  C_l1_g2
C2               C_l2_g0  C_l2_g1  C_l2_g2
R0      R1                                
R_l0_g0 R_l1_g0        0        1        2
R_l0_g1 R_l1_g1        2        3        4
R_l0_g2 R_l1_g2        4        5        6
R_l0_g3 R_l1_g3        6        7        8
R_l0_g4 R_l1_g4        8        9       10

# or even
In [19]: mkdf(8,3,r_idx_nlevels=3,r_ndupe_l=[4,2])
Out[19]: 
C0                      C_l0_g0 C_l0_g1 C_l0_g2
R0      R1      R2                             
R_l0_g0 R_l1_g0 R_l2_g0    R0C0    R0C1    R0C2
                R_l2_g1    R1C0    R1C1    R1C2
        R_l1_g1 R_l2_g2    R2C0    R2C1    R2C2
                R_l2_g3    R3C0    R3C1    R3C2
R_l0_g1 R_l1_g2 R_l2_g4    R4C0    R4C1    R4C2
                R_l2_g5    R5C0    R5C1    R5C2
        R_l1_g3 R_l2_g6    R6C0    R6C1    R6C2
                R_l2_g7    R7C0    R7C1    R7C2
ipython startup file

your ipython installation has ~/.ipython/profile_default/startup directory,
put your imports, monkey-patches and utility function there and have them
always available.

Speel checking github issues

issues can quickly become stream of conciousness thing once
you start doing a lot of them, if you'd like an easy way to get red squigglies
when your comment contains silly mistaces, you might consider installing
After the deadline, available as an extension for firefox and chrome.

Handy git commands

There are too many git tricks to cover, but the following are both useful and less commonly known:

Generate a new Hash for the current commit, without any other changes to repo state.

git commit --amend -C HEAD

Report author of given commit hash:

function gauthor {
         git show --format='%an <%ae>' $@ | head -n 1
}

and properly assign authorship of a commit:

git commit --author="$(gauthor foohash)"

where foohash is any previous commit authored by that contributor.

To locate the merge commit that introduced a commit into the branch:
https://github.com/jianli/git-get-merge

@jankatins
Copy link
Contributor

Adding

[remote "origin"]
   ...
   fetch = +refs/pull/*/head:refs/remotes/origin/pr/*

to a remote will pull all pull requests and make them available as branch "pr/xyz"

@ghost
Copy link
Author

ghost commented Mar 24, 2013

Very cool. That's in .git/config, and 'upstream' depending on how you set up your fork remotes.
Also very convenient, with

function gcopr {
  git fetch upstream
  git checkout upstream/pr/$1
}

in your .bashrc

@jreback
Copy link
Contributor

jreback commented Apr 2, 2013

maybe create a milestone, something like info for items like this?

@ghost
Copy link
Author

ghost commented Apr 2, 2013

I'll move it to the docs when I get a chance.

@cpcloud
Copy link
Member

cpcloud commented Apr 30, 2013

Can we add a way to make tox nosetests line configurable? None of the tests I add get run :(

@ghost
Copy link
Author

ghost commented Apr 30, 2013

would like to consolidate all the test_*.sh stuff to a single python script with bells and perhaps whistles.
not a priority though.

@cpcloud
Copy link
Member

cpcloud commented May 4, 2013

FYI, ghi now allows you to see the issues that you created.

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

@y-p should tox.ini and tox_prll.ini be changed to {envbindir}/nosetests --exe pandas -A "not network" instead of {envbindir}/nosetests --exe pandas.tests -A "not network"? then new tests will be run. prolly those tox config files should be added to git ignore as well.

@cpcloud
Copy link
Member

cpcloud commented May 19, 2013

another useful tool is nose-progressive, which runs your tests with much cleaner output and gives a nice if somewhat superfluous progress bar in the terminal.

also scm_breeze gives you a bunch aliases for common git stuff

@cpcloud
Copy link
Member

cpcloud commented Jun 15, 2013

nice way to time tests so that you know which ones are running slow

@cpcloud
Copy link
Member

cpcloud commented Jun 18, 2013

now there's a set of commands in the Makefile in the top-level pandas directory with the following functionality

  • make clean will delete the build and dist directories + all *.pyc and *.so files
  • make clean_pyc just removes *.pyc files
  • make build will build extensions inplace
  • make develop will install pandas in your environment but will place a link to the dev dir so that you can make changes and they will show up immediately
  • make doc will build the documentation from scratch (erases generated and build directories)

@jreback
Copy link
Contributor

jreback commented Jun 18, 2013

like this could go on the tips page

@ghost ghost mentioned this issue Jun 20, 2013
@cpcloud
Copy link
Member

cpcloud commented Jun 29, 2013

should note that ipdb doesn't work with ipython master

@ghost
Copy link
Author

ghost commented Jan 10, 2014

Stale, wiki pages replace this (somewhat). Also, slight whiff of unbecoming hubris.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Admin Administrative tasks related to the pandas project Docs
Projects
None yet
Development

No branches or pull requests

3 participants