ENH/DOC: stability guide #5027

dengemann · 2013-09-28T23:36:12Z

I've had recently made the experience of implementing and maintaining different down-stream applications supporting or building upon pandas. One thing I've learned is that it's quite painful to write code that e.g. runs for pandas 0.7.3 up to current master. Although 0.7.3 seems rather old given the rapid dev-cycle and the vibrant community, one may not forget that 0.7.3 is not older than 1.5 - 2 years and hence still counts as stable version in quite a few distros (while user-studies show that many people, let alone institutions, are often years behind recent versions ...).
Which guidelines to follow to make life easier when supporting such use cases does not seem too well documented. What would people think about adding such a -- maybe growing -- collection of hints + tips to the docs?

jtratner · 2013-09-28T23:39:36Z

@dengemann Two points:

This seems like a good idea.
Should we leave deprecated methods/classes around longer? (i.e., changes to save/load/io.parsers, etc.)

dengemann · 2013-09-28T23:54:50Z

This seems like a good idea.

Ok, great!

Should we leave deprecated methods/classes around longer? (i.e., changes to save/load/io.parsers, etc.)

Good question. The bad news: Figuring out what works and and what doesn't is tedious and time-consuming. Often, for example, I needed to convince people that pandas is suitable backend for production code, simply because things just did not work the way expected across platforms / versions.
The good news is: mastering the task I outlined is quite doable, as far as I can tell from my use cases.

Before we write polished docs just a few blockers / solutions / the most pertinent points

parsing lines without unique index -- see resolve issue 368 #385 -- use list comprehension + enumerate like:

def check_pandas_version(min_version):
    """ Check minimum Pandas version required

    Parameters
    ----------
    min_version : str
        The version string. Anything that matches
        ``'(\\d+ | [a-z]+ | \\.)'``
    """
    is_good = False if LooseVersion(pd.__version__) < min_version else True
    return is_good


def check_line_index(lines):
    """Check whether lines are safe for parsing
    Parameters
    ----------
    lines : list of str
        A list of strings as returned from a file object
    Returns
    -------
    lines : list of str
        The edited list of strings in case the Pandas version
        is not recent enough.

    """
    if check_pandas_version('0.8'):
        return lines
    else:   # 92mu -- fastest, please don't change
        return [str(x) + ' ' + y for x, y in enumerate(lines)]

Took me quite some time to figure what's wrong. In one application this even lead me to dropping support for older pandas versions. But this is not good. In science update cycles are slower ...

in-place-operations. If you want to be on the safe side, never try to use in-place arguments but overwrite identifiers, e.g. sort, reset_index, etc.
iloc, ix, loc, etc. ... if you can make it with ix it's good. Too often I found myself accessing the underlying arrays + casting, doing nan masking manually ....
jumping between pandas versions for testing ...

I know this may sound slightly accusing, but my goal is to help people using pandas for production code + convince others this is a good idea (which I think is the case).

jtratner · 2013-09-29T00:03:46Z

That's helpful. Minor note - are you sure you mean #385 (which is resolving #368) that's the issue? (which is the issue linked to - just switched from wes to pydata). Wes' comment there was that it was fixed.

cpcloud · 2013-09-29T00:06:31Z

@dengemann

iloc, ix, loc, etc. ... if you can make it with ix it's good. Too often I found myself accessing the underlying arrays + casting, doing nan masking manually ....

Not sure what you mean by this. ix will be supported for the foreseeable future. The others are more clear and slightly faster (since they don't have fallbacks).

dengemann · 2013-09-29T00:06:35Z

... A more systematic approach might be to establish a core set API tests that are validated across versions (unit tests that pass across let's say for the five last releases). Throwing thoughts ...

jtratner · 2013-09-29T00:06:49Z

@dengemann a couple of other questions:

Do you have any (other) examples of things that don't work across versions/platforms? That would be helpful to know. (and, if it were really important, we could consider doing a bugfix release, though I don't know whether platforms take those up faster than new versions).
I'm not totally clear on your example. Where do those lines end up?

dengemann · 2013-09-29T00:07:05Z

Minor note - are you sure you mean #385 (which is resolving #368) that's the issue? (which is the issue linked to - just switched from wes to pydata)

Yes, thanks.

jtratner · 2013-09-29T00:07:56Z

@dengemann so are the issues in #368 actually resolved or not?

dengemann · 2013-09-29T00:10:07Z

Do you have any (other) examples of things that don't work across versions/platforms? That would be helpful to know. (and, if it were really important, we could consider doing a bugfix release, though I don't know whether platforms take those up faster than new versions).

I'm not sure we really need bugfix releases. It's more about making accessible what did not change across time ...

I'm not totally clear on your example. Where do those lines end up?

Sorry. I'm parsing discrete events from files, assemble lists of lines, and then use fast read_table parsing on StingIO objects. These functions serve as preprocessors to warrant the functionality for users using older pandas versions.

jreback · 2013-09-29T00:11:33Z

@dengemann

throwing in my 2c here.

pandas changed quite substantially in 0.8, so supporting less than this is going to be quite nightmarish. In the scientific community support for HDF5 became much more integrated starting in 0.10.1.

Can you elaborate on who is your target audience here? (for < say 0.10.1)

jtratner · 2013-09-29T00:13:11Z

@dengemann if you're saying you need just one function to do compatibility, that doesn't seem too bad. Again, it'd be nice if you could offer more examples. We could put together a gist with instructions for use.

dengemann · 2013-09-29T00:14:52Z

@dengemann so are the issues in #368 actually resolved or not?

Sorry, I've been wrong two times in succession, there was a typo and a mis-read. I was referring to #835 -- this is fixed with pandas > 0.8 I think

dengemann · 2013-09-29T00:19:32Z

pandas changed quite substantially in 0.8, so supporting less than this is going to be quite nightmarish.

@jreback I witnessed this ;-) This is also my reasoning. But still many people run Debian stable or EPD 7.3 which ship pandas 0.7.3 IRRC/AFAIK.

if you're saying you need just one function to do compatibility, that doesn't seem too bad.

Absolutely, this is good news. But I don't want other people to loose one day or two to find out ;-)

Again, it'd be nice if you could offer more examples.

I'll keep you posted. Will be an ongoing issues/

We could put together a gist with instructions for use.

Yes, this was my idea.

jtratner · 2013-09-29T00:22:18Z

@dengemann well, you should really talk to @yarikoptic to see what can be done to get newer versions of pandas into Debian stable and/or if there are any blockers.

cpcloud · 2013-09-29T00:23:03Z

It's more about making accessible what did not change across time ...

Our release notes are quite extensive. While this doesn't tell you what hasn't changed, it's a useful starting point.

dengemann · 2013-09-29T00:28:52Z

Not sure what you mean by this.

This was rather dense description of accumulated tiny experiences. Let me try to unwrap my experience.

iloc / loc are great. Unfortunately they are not back-ported and it will take 1-2 years until I can support the in applications.
So ix; this is good to go in most cases. But my feeling is that the API / multi-indexers behave slightly differently across versions, e.g. with regard to partial indexing.
I need to inquire more on this. Also it's semantics depend on the index values.
If you want to be really safe, go and get your ndarrays out of the data frame and move on manually.

Our release notes are quite extensive. While this doesn't tell you what hasn't changed, it's a useful starting point.

Definitely, we should include a related pointer in a forthcoming doc.

jtratner · 2013-09-29T00:32:03Z

would debian take things up faster if they were backported? We use new version numbers when we make major changes to the public API.

dengemann · 2013-09-29T00:33:46Z

would debian take things up faster if they were backported?

@jtratner good question. --> ping @yarikoptic.

jreback · 2013-09-29T00:36:19Z

@dengemann you bring up some good points, but the very fact that ix had some issues led to extensive discussions and introduction of loc/iloc to make things easier. Even now, ix has issues. Here's one example. In 0.13 all of the indexers change to be sane when working on a floating point index.. Useful for some people.

At the same time as we introduce new features, and squash bugs, I personally (and I know all of the other core dev people) go to great lengths to provide backward compat and notice of changes in the API.

Here's another example, @jtratner and I worked extensively making sure that prior version pickles will work in 0.13, even though Series is not longer an ndarray sub-class.

Another example, @cpcloud harasses the numexpr/PyTables people about a bug in their newly put out version (2.2.1 of numexpr) that completely broke PyTables 2.4! (which they acceded and are now issuing a new version).

dengemann · 2013-09-29T01:13:32Z

@jreback thanks, this is all great and I'm with you. I'm also aware of the efforts taken to maintain and stabilize functionality while promoting development. I tnink you all have done an awesome job with this, that's beyond question. But the task is highly non-trivial too and naturally issues remain. My point was rather to start a discussion on how to add some documentstion for long term support use cases and share a few experiences. I'm happy to make a DOC PR once all points are settled. I'd especially like to generate some more concrete reports + recommendations on dealing with ix, an issuw which I did not fully grasp at this point. There will be incoming news from my side since I'm currently developing a new project for which I chose pandas as backend.

yarikoptic · 2013-09-29T01:36:44Z

re Debian "taking it faster": I guess there is a bit of misunderstanding. Let me describe the life cycle of pandas in Debian:

just few days (or right on the day) after pandas new release I upload it to Debian unstable (or experimental if Debian in freeze or unstable lacking some core dependency, which iirc didn't happen before). I think uploading even "faster" than that is not really needed ;)
then at least 10 days after upload to unstable, if pandas builds fine across all architectures, and there are no critical bugs reported -- it migrates to testing. ATM I would still need to address issues on armel and s390x: https://buildd.debian.org/status/package.php?p=pandas to make it migrate to testing
version in testing becomes a candidate for NEXT Debian stable and a candidate for a backports repository for CURRENT Debian stable.
when stable Debian releases -- version of pandas there is THE version which will be in Debian stable and only security and critical functionality patches would be allowed. There is no way to push a new release of pandas into Debian stable.

But also, besides official Debian -- as soon as I upload to Debian unstable, I upload backport builds of pandas for EVERY compatible Debian and Ubuntu release to NeuroDebian: see http://neuro.debian.net/pkgs/python-pandas.html . So technically speaking -- fresh pandas release is nearly always available only few days after for every recent Debian and Ubuntu release, while official 'stable' releases of those distributions come with the 'matured' versions.

To help with migration of pandas from unstable to testing, which is as currently, usually stalled by problems with building/testing on various less common architectures, I have setup a buildbot on my sparc box for pandas: http://nipy.bic.berkeley.edu/waterfall?category=pandas . As you might see it still needs some attention and I believe there are few outstanding issues here in the tracker to address failing tests on sparc.

jreback · 2013-09-29T01:36:55Z

@dengemann

fair enough

jtratner · 2013-09-29T01:38:07Z

@yarikoptic that's what I thought, thanks for making that explicit :) That said, sparc issues are definitely frustrating.

yarikoptic · 2013-09-29T01:42:59Z

What actually could help if there was at least some releases of pandas which would be maintained for critical bug fixes in addition to bleeding edge new functionality releases. E.g. if there was 0.12.x branch on top of 0.12.0 release which absorbed all critical applicable fixes leading to 0.12.1, etc releases. Then foreseeing upcoming Debian stable release I might have preferred to assure it having this "stable" release in favor over more featureful 0.13.0.

jtratner · 2013-09-29T02:16:11Z

I guess the problem is that maintaining a separate stable branch requires quite a bit of time (as does trying to get things to work on sparc).

mroeschke · 2019-12-23T04:21:29Z

I think this is relatively handled by the new policies doc.

https://github.com/pandas-dev/pandas/blob/master/doc/source/development/policies.rst

If there is anything not sufficiently described in that document, a new issue can be open clarifying points.

armaganthis3 mentioned this issue Jul 14, 2014

GH6848 silently changed series.sort from stable to unstable sort #7750

Closed

mroeschke closed this as completed Dec 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/DOC: stability guide #5027

ENH/DOC: stability guide #5027

dengemann commented Sep 28, 2013

jtratner commented Sep 28, 2013

dengemann commented Sep 28, 2013

jtratner commented Sep 29, 2013

cpcloud commented Sep 29, 2013

dengemann commented Sep 29, 2013

jtratner commented Sep 29, 2013

dengemann commented Sep 29, 2013

jtratner commented Sep 29, 2013

dengemann commented Sep 29, 2013

jreback commented Sep 29, 2013

jtratner commented Sep 29, 2013

dengemann commented Sep 29, 2013

dengemann commented Sep 29, 2013

jtratner commented Sep 29, 2013

cpcloud commented Sep 29, 2013

dengemann commented Sep 29, 2013

jtratner commented Sep 29, 2013

dengemann commented Sep 29, 2013

jreback commented Sep 29, 2013

dengemann commented Sep 29, 2013

yarikoptic commented Sep 29, 2013

jreback commented Sep 29, 2013

jtratner commented Sep 29, 2013

yarikoptic commented Sep 29, 2013

jtratner commented Sep 29, 2013

mroeschke commented Dec 23, 2019

ENH/DOC: stability guide #5027

ENH/DOC: stability guide #5027

Comments

dengemann commented Sep 28, 2013

jtratner commented Sep 28, 2013

dengemann commented Sep 28, 2013

jtratner commented Sep 29, 2013

cpcloud commented Sep 29, 2013

dengemann commented Sep 29, 2013

jtratner commented Sep 29, 2013

dengemann commented Sep 29, 2013

jtratner commented Sep 29, 2013

dengemann commented Sep 29, 2013

jreback commented Sep 29, 2013

jtratner commented Sep 29, 2013

dengemann commented Sep 29, 2013

dengemann commented Sep 29, 2013

jtratner commented Sep 29, 2013

cpcloud commented Sep 29, 2013

dengemann commented Sep 29, 2013

jtratner commented Sep 29, 2013

dengemann commented Sep 29, 2013

jreback commented Sep 29, 2013

dengemann commented Sep 29, 2013

yarikoptic commented Sep 29, 2013

jreback commented Sep 29, 2013

jtratner commented Sep 29, 2013

yarikoptic commented Sep 29, 2013

jtratner commented Sep 29, 2013

mroeschke commented Dec 23, 2019