Skip to content

ENH/DOC: stability guide #5027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dengemann opened this issue Sep 28, 2013 · 26 comments
Closed

ENH/DOC: stability guide #5027

dengemann opened this issue Sep 28, 2013 · 26 comments
Labels

Comments

@dengemann
Copy link
Contributor

I've had recently made the experience of implementing and maintaining different down-stream applications supporting or building upon pandas. One thing I've learned is that it's quite painful to write code that e.g. runs for pandas 0.7.3 up to current master. Although 0.7.3 seems rather old given the rapid dev-cycle and the vibrant community, one may not forget that 0.7.3 is not older than 1.5 - 2 years and hence still counts as stable version in quite a few distros (while user-studies show that many people, let alone institutions, are often years behind recent versions ...).
Which guidelines to follow to make life easier when supporting such use cases does not seem too well documented. What would people think about adding such a -- maybe growing -- collection of hints + tips to the docs?

@jtratner
Copy link
Contributor

@dengemann Two points:

  1. This seems like a good idea.
  2. Should we leave deprecated methods/classes around longer? (i.e., changes to save/load/io.parsers, etc.)

@dengemann
Copy link
Contributor Author

This seems like a good idea.

Ok, great!

Should we leave deprecated methods/classes around longer? (i.e., changes to save/load/io.parsers, etc.)

Good question. The bad news: Figuring out what works and and what doesn't is tedious and time-consuming. Often, for example, I needed to convince people that pandas is suitable backend for production code, simply because things just did not work the way expected across platforms / versions.
The good news is: mastering the task I outlined is quite doable, as far as I can tell from my use cases.

Before we write polished docs just a few blockers / solutions / the most pertinent points

  1. parsing lines without unique index -- see resolve issue 368 #385 -- use list comprehension + enumerate like:
def check_pandas_version(min_version):
    """ Check minimum Pandas version required

    Parameters
    ----------
    min_version : str
        The version string. Anything that matches
        ``'(\\d+ | [a-z]+ | \\.)'``
    """
    is_good = False if LooseVersion(pd.__version__) < min_version else True
    return is_good


def check_line_index(lines):
    """Check whether lines are safe for parsing
    Parameters
    ----------
    lines : list of str
        A list of strings as returned from a file object
    Returns
    -------
    lines : list of str
        The edited list of strings in case the Pandas version
        is not recent enough.

    """
    if check_pandas_version('0.8'):
        return lines
    else:   # 92mu -- fastest, please don't change
        return [str(x) + ' ' + y for x, y in enumerate(lines)]

Took me quite some time to figure what's wrong. In one application this even lead me to dropping support for older pandas versions. But this is not good. In science update cycles are slower ...

  1. in-place-operations. If you want to be on the safe side, never try to use in-place arguments but overwrite identifiers, e.g. sort, reset_index, etc.
  2. iloc, ix, loc, etc. ... if you can make it with ix it's good. Too often I found myself accessing the underlying arrays + casting, doing nan masking manually ....
  3. jumping between pandas versions for testing ...

I know this may sound slightly accusing, but my goal is to help people using pandas for production code + convince others this is a good idea (which I think is the case).

@jtratner
Copy link
Contributor

That's helpful. Minor note - are you sure you mean #385 (which is resolving #368) that's the issue? (which is the issue linked to - just switched from wes to pydata). Wes' comment there was that it was fixed.

@cpcloud
Copy link
Member

cpcloud commented Sep 29, 2013

@dengemann

iloc, ix, loc, etc. ... if you can make it with ix it's good. Too often I found myself accessing the underlying arrays + casting, doing nan masking manually ....

Not sure what you mean by this. ix will be supported for the foreseeable future. The others are more clear and slightly faster (since they don't have fallbacks).

@dengemann
Copy link
Contributor Author

... A more systematic approach might be to establish a core set API tests that are validated across versions (unit tests that pass across let's say for the five last releases). Throwing thoughts ...

@jtratner
Copy link
Contributor

@dengemann a couple of other questions:

  1. Do you have any (other) examples of things that don't work across versions/platforms? That would be helpful to know. (and, if it were really important, we could consider doing a bugfix release, though I don't know whether platforms take those up faster than new versions).
  2. I'm not totally clear on your example. Where do those lines end up?

@dengemann
Copy link
Contributor Author

Minor note - are you sure you mean #385 (which is resolving #368) that's the issue? (which is the issue linked to - just switched from wes to pydata)

Yes, thanks.

@jtratner
Copy link
Contributor

@dengemann so are the issues in #368 actually resolved or not?

@dengemann
Copy link
Contributor Author

Do you have any (other) examples of things that don't work across versions/platforms? That would be helpful to know. (and, if it were really important, we could consider doing a bugfix release, though I don't know whether platforms take those up faster than new versions).

I'm not sure we really need bugfix releases. It's more about making accessible what did not change across time ...

I'm not totally clear on your example. Where do those lines end up?

Sorry. I'm parsing discrete events from files, assemble lists of lines, and then use fast read_table parsing on StingIO objects. These functions serve as preprocessors to warrant the functionality for users using older pandas versions.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2013

@dengemann

throwing in my 2c here.

pandas changed quite substantially in 0.8, so supporting less than this is going to be quite nightmarish. In the scientific community support for HDF5 became much more integrated starting in 0.10.1.

Can you elaborate on who is your target audience here? (for < say 0.10.1)

@jtratner
Copy link
Contributor

@dengemann if you're saying you need just one function to do compatibility, that doesn't seem too bad. Again, it'd be nice if you could offer more examples. We could put together a gist with instructions for use.

@dengemann
Copy link
Contributor Author

@dengemann so are the issues in #368 actually resolved or not?

Sorry, I've been wrong two times in succession, there was a typo and a mis-read. I was referring to #835 -- this is fixed with pandas > 0.8 I think

@dengemann
Copy link
Contributor Author

pandas changed quite substantially in 0.8, so supporting less than this is going to be quite nightmarish.

@jreback I witnessed this ;-) This is also my reasoning. But still many people run Debian stable or EPD 7.3 which ship pandas 0.7.3 IRRC/AFAIK.

if you're saying you need just one function to do compatibility, that doesn't seem too bad.

Absolutely, this is good news. But I don't want other people to loose one day or two to find out ;-)

Again, it'd be nice if you could offer more examples.

I'll keep you posted. Will be an ongoing issues/

We could put together a gist with instructions for use.

Yes, this was my idea.

@jtratner
Copy link
Contributor

@dengemann well, you should really talk to @yarikoptic to see what can be done to get newer versions of pandas into Debian stable and/or if there are any blockers.

@cpcloud
Copy link
Member

cpcloud commented Sep 29, 2013

It's more about making accessible what did not change across time ...

Our release notes are quite extensive. While this doesn't tell you what hasn't changed, it's a useful starting point.

@dengemann
Copy link
Contributor Author

Not sure what you mean by this.

This was rather dense description of accumulated tiny experiences. Let me try to unwrap my experience.

  1. iloc / loc are great. Unfortunately they are not back-ported and it will take 1-2 years until I can support the in applications.

  2. So ix; this is good to go in most cases. But my feeling is that the API / multi-indexers behave slightly differently across versions, e.g. with regard to partial indexing.
    I need to inquire more on this. Also it's semantics depend on the index values.

  3. If you want to be really safe, go and get your ndarrays out of the data frame and move on manually.

Our release notes are quite extensive. While this doesn't tell you what hasn't changed, it's a useful starting point.

Definitely, we should include a related pointer in a forthcoming doc.

@jtratner
Copy link
Contributor

would debian take things up faster if they were backported? We use new version numbers when we make major changes to the public API.

@dengemann
Copy link
Contributor Author

would debian take things up faster if they were backported?

@jtratner good question. --> ping @yarikoptic.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2013

@dengemann you bring up some good points, but the very fact that ix had some issues led to extensive discussions and introduction of loc/iloc to make things easier. Even now, ix has issues. Here's one example. In 0.13 all of the indexers change to be sane when working on a floating point index.. Useful for some people.

At the same time as we introduce new features, and squash bugs, I personally (and I know all of the other core dev people) go to great lengths to provide backward compat and notice of changes in the API.

Here's another example, @jtratner and I worked extensively making sure that prior version pickles will work in 0.13, even though Series is not longer an ndarray sub-class.

Another example, @cpcloud harasses the numexpr/PyTables people about a bug in their newly put out version (2.2.1 of numexpr) that completely broke PyTables 2.4! (which they acceded and are now issuing a new version).

@dengemann
Copy link
Contributor Author

@jreback thanks, this is all great and I'm with you. I'm also aware of the efforts taken to maintain and stabilize functionality while promoting development. I tnink you all have done an awesome job with this, that's beyond question. But the task is highly non-trivial too and naturally issues remain. My point was rather to start a discussion on how to add some documentstion for long term support use cases and share a few experiences. I'm happy to make a DOC PR once all points are settled. I'd especially like to generate some more concrete reports + recommendations on dealing with ix, an issuw which I did not fully grasp at this point. There will be incoming news from my side since I'm currently developing a new project for which I chose pandas as backend.

@yarikoptic
Copy link
Contributor

re Debian "taking it faster": I guess there is a bit of misunderstanding. Let me describe the life cycle of pandas in Debian:

  • just few days (or right on the day) after pandas new release I upload it to Debian unstable (or experimental if Debian in freeze or unstable lacking some core dependency, which iirc didn't happen before). I think uploading even "faster" than that is not really needed ;)
  • then at least 10 days after upload to unstable, if pandas builds fine across all architectures, and there are no critical bugs reported -- it migrates to testing. ATM I would still need to address issues on armel and s390x: https://buildd.debian.org/status/package.php?p=pandas to make it migrate to testing
  • version in testing becomes a candidate for NEXT Debian stable and a candidate for a backports repository for CURRENT Debian stable.
  • when stable Debian releases -- version of pandas there is THE version which will be in Debian stable and only security and critical functionality patches would be allowed. There is no way to push a new release of pandas into Debian stable.

But also, besides official Debian -- as soon as I upload to Debian unstable, I upload backport builds of pandas for EVERY compatible Debian and Ubuntu release to NeuroDebian: see http://neuro.debian.net/pkgs/python-pandas.html . So technically speaking -- fresh pandas release is nearly always available only few days after for every recent Debian and Ubuntu release, while official 'stable' releases of those distributions come with the 'matured' versions.

To help with migration of pandas from unstable to testing, which is as currently, usually stalled by problems with building/testing on various less common architectures, I have setup a buildbot on my sparc box for pandas: http://nipy.bic.berkeley.edu/waterfall?category=pandas . As you might see it still needs some attention and I believe there are few outstanding issues here in the tracker to address failing tests on sparc.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2013

@dengemann

fair enough

@jtratner
Copy link
Contributor

@yarikoptic that's what I thought, thanks for making that explicit :) That said, sparc issues are definitely frustrating.

@yarikoptic
Copy link
Contributor

What actually could help if there was at least some releases of pandas which would be maintained for critical bug fixes in addition to bleeding edge new functionality releases. E.g. if there was 0.12.x branch on top of 0.12.0 release which absorbed all critical applicable fixes leading to 0.12.1, etc releases. Then foreseeing upcoming Debian stable release I might have preferred to assure it having this "stable" release in favor over more featureful 0.13.0.

@jtratner
Copy link
Contributor

I guess the problem is that maintaining a separate stable branch requires quite a bit of time (as does trying to get things to work on sparc).

@mroeschke
Copy link
Member

I think this is relatively handled by the new policies doc.

https://github.com/pandas-dev/pandas/blob/master/doc/source/development/policies.rst

If there is anything not sufficiently described in that document, a new issue can be open clarifying points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants