Tips and tricks for pandas devs #3156

ghost · 2013-03-24T13:03:56Z

Working on pandas for a while now, there's a bunch of tools and tricks
I use, here's a list to help pandas devs slip into the zone:

Use ipdb rather then pdb with nose: --ipdb --ipdb-fail

https://github.com/flavioamieiro/nose-ipdb

Because tab-completion is not optional.

Re-running only failed tests

nosetests --with-id --failed will rerun only the tests which failed last
time you ran nosetests --with-id. If you use test_fast.sh

test_fast.sh --failed

will do what you expect after you had some tests fail

Better integration of github and git commandline flow

hub a wrapper around git, with github
sugar. first and foremost:

hub checkout https://github.com/pydata/pandas/pull/1134

adds a remote, fetches it, creates a branch for it, and generally puts your right there.

Note: see comment below for a way to do this with pure git, if you don't
mind thousands of remote branches.

GH issues from the command line

ghi
open/manipulate gh issues from the command line.

I use it to open issues when I hit a bug and want to quickly
open a reminder to fix, without breaking my focus.

Testing across python version locally

tox let's you run the test suites across all python versions using virtualenvs.
Everything is setup in the repo, just install and run.
detox parallelizes tox.

Faster pandas builds/testing

Note: the build cache was baked into setup.py from roughly 0.9.1. as of 0.11.0
it's been factored out into scripts/use_build_cache.py, which rewrites setup.py
to use the build cache. The script has been tested as far back as 0.7.0.

Putting the following in your .bashrc

# Use the pandas build cache
export BUILD_CACHE_DIR="$HOME/tmp/.pandas_build_cache/"
if [ ! -e $BUILD_CACHE_DIR ]; then
    mkdir -p $BUILD_CACHE_DIR ;
fi

echo $BUILD_CACHE_DIR > [pandas repo root dir]/.build_cache_dir
function cdev {
# any recent commit should do
git checkout c69e3aa scripts/use_build_cache.py vb_suite/test_perf.py
scripts/use_build_cache.py $1 # rewire setup.py with build_cache
if [ x"$VIRTUAL_ENV" == x"" ]; then
   _SUDO="sudo"
fi

sudo chown $USER -R .;
$_SUDO python ./setup.py clean;
$_SUDO python ./setup.py develop;
sudo chown $USER -R .;
echo "Restoring setup.py"
git checkout setup.py # restore setup.py
}

c69e3aa can be any recent commit, needs to be bumped if there are updates
to the script.

The pandas build cache code, caches cythonization, compilation and
2to3 artifacts for reuse in subsequent builds.
To compile, use "git reset --hard" to get the commit you're after, then use cdev
to build pandas. setup.py will reuse what it can to speed this up.
Note that setup.py gets overwritten, but also restored when the build completes.
With a warm cache, moving to a given commit takes just a few seconds rather then
then the several minutes of a full compile.

You may also run scripts/use_build_cache.py prior to launching tox to speed up tetsing.

Use ccache

The build cache just described caches things on a very coarse level, if there's
any change to .pyx (cython) files, all the files will be recythonized and rebuilt.
Using ccache (an apt-get+envar away on most distors these days) can speed
up the compilation part by caching the gcc compilation results. Yes, this overlaps
with the caching from the previous section, only it also caches the cythonized
c files.

Benchmarking commits

test_perf.sh let's you compare the performance of one commit against
another or benchmark the current HEAD.
It produces a table of results suitable for posting in a PR, and can serialize
the results dataframe into a pickle file, for analysis in pandas.

It can print summary stats over mutliple runs and all sorts of things.
see test_perf.sh --help,

Easily generate dataframes of different kinds

mkdf let's you easily fabricate dataframes of varying dimensions
and arbitrary data:

from pandas.util.testing import makeCustomDataframe as mkdf
In [12]: mkdf(3,2)
Out[12]: 
C0      C_l0_g0 C_l0_g1
R0                     
R_l0_g0    R0C0    R0C1
R_l0_g1    R1C0    R1C1
R_l0_g2    R2C0    R2C1

# or even...
In [11]: mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=3,data_gen_f=lambda r,c: r*2+c)
Out[11]: 
C0               C_l0_g0  C_l0_g1  C_l0_g2
C1               C_l1_g0  C_l1_g1  C_l1_g2
C2               C_l2_g0  C_l2_g1  C_l2_g2
R0      R1                                
R_l0_g0 R_l1_g0        0        1        2
R_l0_g1 R_l1_g1        2        3        4
R_l0_g2 R_l1_g2        4        5        6
R_l0_g3 R_l1_g3        6        7        8
R_l0_g4 R_l1_g4        8        9       10

# or even
In [19]: mkdf(8,3,r_idx_nlevels=3,r_ndupe_l=[4,2])
Out[19]: 
C0                      C_l0_g0 C_l0_g1 C_l0_g2
R0      R1      R2                             
R_l0_g0 R_l1_g0 R_l2_g0    R0C0    R0C1    R0C2
                R_l2_g1    R1C0    R1C1    R1C2
        R_l1_g1 R_l2_g2    R2C0    R2C1    R2C2
                R_l2_g3    R3C0    R3C1    R3C2
R_l0_g1 R_l1_g2 R_l2_g4    R4C0    R4C1    R4C2
                R_l2_g5    R5C0    R5C1    R5C2
        R_l1_g3 R_l2_g6    R6C0    R6C1    R6C2
                R_l2_g7    R7C0    R7C1    R7C2

ipython startup file

your ipython installation has ~/.ipython/profile_default/startup directory,
put your imports, monkey-patches and utility function there and have them
always available.

Speel checking github issues

issues can quickly become stream of conciousness thing once
you start doing a lot of them, if you'd like an easy way to get red squigglies
when your comment contains silly mistaces, you might consider installing
After the deadline, available as an extension for firefox and chrome.

Handy git commands

There are too many git tricks to cover, but the following are both useful and less commonly known:

Generate a new Hash for the current commit, without any other changes to repo state.

git commit --amend -C HEAD

Report author of given commit hash:

function gauthor {
         git show --format='%an <%ae>' $@ | head -n 1
}

and properly assign authorship of a commit:

git commit --author="$(gauthor foohash)"

where foohash is any previous commit authored by that contributor.

To locate the merge commit that introduced a commit into the branch:
https://github.com/jianli/git-get-merge

The text was updated successfully, but these errors were encountered:

jankatins · 2013-03-24T20:29:31Z

Adding

[remote "origin"]
   ...
   fetch = +refs/pull/*/head:refs/remotes/origin/pr/*

to a remote will pull all pull requests and make them available as branch "pr/xyz"

ghost · 2013-03-24T20:40:01Z

Very cool. That's in .git/config, and 'upstream' depending on how you set up your fork remotes.
Also very convenient, with

function gcopr {
  git fetch upstream
  git checkout upstream/pr/$1
}

in your .bashrc

jreback · 2013-04-02T13:51:36Z

maybe create a milestone, something like info for items like this?

ghost · 2013-04-02T15:06:42Z

I'll move it to the docs when I get a chance.

cpcloud · 2013-04-30T00:20:06Z

Can we add a way to make tox nosetests line configurable? None of the tests I add get run :(

ghost · 2013-04-30T02:00:58Z

would like to consolidate all the test_*.sh stuff to a single python script with bells and perhaps whistles.
not a priority though.

cpcloud · 2013-05-04T21:32:53Z

FYI, ghi now allows you to see the issues that you created.

cpcloud · 2013-05-19T02:38:00Z

@y-p should tox.ini and tox_prll.ini be changed to {envbindir}/nosetests --exe pandas -A "not network" instead of {envbindir}/nosetests --exe pandas.tests -A "not network"? then new tests will be run. prolly those tox config files should be added to git ignore as well.

cpcloud · 2013-05-19T03:46:23Z

another useful tool is nose-progressive, which runs your tests with much cleaner output and gives a nice if somewhat superfluous progress bar in the terminal.

also scm_breeze gives you a bunch aliases for common git stuff

cpcloud · 2013-06-15T22:45:19Z

nice way to time tests so that you know which ones are running slow

cpcloud · 2013-06-18T22:24:59Z

now there's a set of commands in the Makefile in the top-level pandas directory with the following functionality

make clean will delete the build and dist directories + all *.pyc and *.so files
make clean_pyc just removes *.pyc files
make build will build extensions inplace
make develop will install pandas in your environment but will place a link to the dev dir so that you can make changes and they will show up immediately
make doc will build the documentation from scratch (erases generated and build directories)

jreback · 2013-06-18T23:57:02Z

like this could go on the tips page

cpcloud · 2013-06-29T23:15:58Z

should note that ipdb doesn't work with ipython master

ghost · 2014-01-10T12:18:32Z

Stale, wiki pages replace this (somewhat). Also, slight whiff of unbecoming hubris.

ghost mentioned this issue Apr 28, 2013

Read html tables into DataFrames #3477

Merged

ghost mentioned this issue Jun 1, 2013

BLD: test_perf.py, add --base-pickle --target-pickle options to test_perf #3734

Merged

jreback mentioned this issue Jun 12, 2013

PERF: have series/panel arithmetic operators use expressions (numexpr) #3765

Closed

hayd mentioned this issue Jun 13, 2013

Change Finance Options signatures and deprecate year/month parameters #3822

Merged

cpcloud mentioned this issue Jun 18, 2013

BLD: Add useful shortcuts to Makefile #3803

Merged

ghost mentioned this issue Jun 20, 2013

ENH add cython tutorial #3965

Merged

jankatins mentioned this issue Jun 30, 2013

Summary new (final, hopefully) statsmodels/statsmodels#915

Closed

cpcloud mentioned this issue Aug 13, 2013

BUG: TimeStamp looses frequency info on arithmetic ops #4547

Closed

ghost closed this as completed Jan 10, 2014

jreback mentioned this issue Apr 3, 2020

TST: GroupBy(..., as_index=True).agg() drops index #33098

Merged

4 tasks

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips and tricks for pandas devs #3156

Tips and tricks for pandas devs #3156

ghost commented Mar 24, 2013

jankatins commented Mar 24, 2013

ghost commented Mar 24, 2013

jreback commented Apr 2, 2013

ghost commented Apr 2, 2013

cpcloud commented Apr 30, 2013

ghost commented Apr 30, 2013

cpcloud commented May 4, 2013

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

cpcloud commented Jun 15, 2013

cpcloud commented Jun 18, 2013

jreback commented Jun 18, 2013

cpcloud commented Jun 29, 2013

ghost commented Jan 10, 2014

Tips and tricks for pandas devs #3156

Tips and tricks for pandas devs #3156

Comments

ghost commented Mar 24, 2013

Use ipdb rather then pdb with nose: --ipdb --ipdb-fail

Re-running only failed tests

Better integration of github and git commandline flow

GH issues from the command line

Testing across python version locally

Faster pandas builds/testing

Use ccache

Benchmarking commits

Easily generate dataframes of different kinds

ipython startup file

Speel checking github issues

Handy git commands

jankatins commented Mar 24, 2013

ghost commented Mar 24, 2013

jreback commented Apr 2, 2013

ghost commented Apr 2, 2013

cpcloud commented Apr 30, 2013

ghost commented Apr 30, 2013

cpcloud commented May 4, 2013

cpcloud commented May 19, 2013

cpcloud commented May 19, 2013

cpcloud commented Jun 15, 2013

cpcloud commented Jun 18, 2013

jreback commented Jun 18, 2013

cpcloud commented Jun 29, 2013

ghost commented Jan 10, 2014