Skip to content

ENG/BUG: Changed NDFrame.shift to work with arbitrary axis/ndims. #4874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dalejung
Copy link
Contributor

closes #4867. First part of changing Panel.shift is to make NDFrame.shift work with any axis and number of dimensions.

Until Panel uses it, the only case of this is DataFrame.shift(axis='columns') which is added as a test.

@jreback
Copy link
Contributor

jreback commented Sep 18, 2013

looks good...can you add a test matrix, e.g. cycle thru all axes for each object, then test if it works (ideally also with different types of axes, e.g. string, datetime, period, integer)?

you can use the test harness in tests/test_generic.py (just leave the exisitng tests where they are), but need a comprehensive test ....

@dalejung
Copy link
Contributor Author

@jreback So was thinking about the test matrix, was thinking something like this:

targets = {}
targets['dataframe'] = tm.makeDataFrame
targets['time_dataframe'] = tm.makeTimeDataFrame

class ShiftBattery(object):
    def test_shift(self):
        obj = self._construct()
        obj.shift(1)

    def test_shift_negative(self):
        obj = self._construct()
        obj.shift(-3)

def setup_battery(targets, battery):
    battery_name = battery.__name__
    # create a unittest.TestCase subclass for each target
    for target, maker in targets.items():
        cls_name = "Test" + battery_name + '_' + target
        cls = makeClass(cls_name, battery, maker)
        globals()[cls_name] = cls

def makeClass(cls_name, battery, maker):
    cls = type(cls_name, (unittest.TestCase, battery), {})
    cls._construct = lambda self: maker()
    return cls

setup_battery(targets, ShiftBattery)

The idea being that we setup a bunch of targets with different configurations (dtypes, multiindex, etc) and then run them through a battery of tests that should be common to all of them.

The output would look like:

test_shift (__main__.TestShiftBattery_dataframe) ... ok
test_shift_negative (__main__.TestShiftBattery_dataframe) ... ok
test_shift (__main__.TestShiftBattery_time_dataframe) ... ok
test_shift_negative (__main__.TestShiftBattery_time_dataframe) ... ok

yay nay?

edit: modified code

@jreback
Copy link
Contributor

jreback commented Sep 19, 2013

that's fancy! you can also just do it in a function, where you run thru the index types, something like this: https://github.com/pydata/pandas/pull/4850/files#L12R1557

but what you are doing is fine too (as you create separate tests so easy to see where stuff fails)

@jtratner
Copy link
Contributor

Constructing test cases that dynamically makes the tests harder to follow
and harder to maintain. It's really important that test cases should be
simple to trace. If you can't just use a few for loops with arguments (or
helper functions), you could try is creating a baseclass like ShiftBase,
and then subclassing it with different setups for Series, Panel, etc. (or
whatever you're looking at).

@dalejung
Copy link
Contributor Author

I was apprehensive about the indirection too, but I couldn't get test generators/for loops to output enough useful info for each sub-test. Take this example:

ind = pd.date_range(start="2000", freq="D", periods=10)
targets = {}
targets['int_series'] = lambda : pd.Series(range(10))
targets['bool_series'] = lambda : pd.Series(np.random.randn(10) > 0, index=ind)
targets['float_series'] = lambda : pd.Series(np.random.randn(10))
targets['string_series'] = lambda : pd.Series(list('asdfqwerzx'), index=ind)

class ShiftBattery(object):
    def test_check(self):
        obj = self._construct()
        if obj.is_time_series:
            assert False, "Don't support time series"

setup_battery(targets, ShiftBattery)

class TestFor(unittest.TestCase):
    def test_for(self):
        def check(obj):
            if obj.is_time_series:
                assert False, "Don't support time series"

        for obj in targets.values():
            check(obj())

class TestGenerator(object):
    def test_for(self):
        def check(obj):
            if obj.is_time_series:
                assert False, "Don't support time series"

        for obj in targets.values():
            yield check, obj()

For loop:

======================================================================
FAIL: test_for (pandas.tests.test_examples.TestFor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/datacaliber/Dropbox/Projects/trading/python/externals/pandas/pandas/tests/test_examples.py", line 34, in test_for
    check(obj())
  File "/Users/datacaliber/Dropbox/Projects/trading/python/externals/pandas/pandas/tests/test_examples.py", line 31, in check
    assert False, "Don't support time series"
AssertionError: Don't support time series

Generator

======================================================================
FAIL: pandas.tests.test_examples.TestGenerator.test_for(2000-09-19    a
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/datacaliber/Dropbox/Projects/trading/python/externals/nose/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/datacaliber/Dropbox/Projects/trading/python/externals/pandas/pandas/tests/test_examples.py", line 40, in check
    assert False, "Don't support time series"
AssertionError: Don't support time series

======================================================================
FAIL: pandas.tests.test_examples.TestGenerator.test_for(2000-09-19    False
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/datacaliber/Dropbox/Projects/trading/python/externals/nose/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/datacaliber/Dropbox/Projects/trading/python/externals/pandas/pandas/tests/test_examples.py", line 40, in check
    assert False, "Don't support time series"
AssertionError: Don't support time series

Battery:

======================================================================
FAIL: test_check (pandas.tests.test_examples.TestShiftBattery_bool_series)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/datacaliber/Dropbox/Projects/trading/python/externals/pandas/pandas/tests/test_examples.py", line 23, in test_check
    assert False, "Don't support time series"
AssertionError: Don't support time series

======================================================================
FAIL: test_check (pandas.tests.test_examples.TestShiftBattery_string_series)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/datacaliber/Dropbox/Projects/trading/python/externals/pandas/pandas/tests/test_examples.py", line 23, in test_check
    assert False, "Don't support time series"
AssertionError: Don't support time series

----------------------------------------------------------------------

The Battery test is the only one that runs all tests and outputs meaningful nose output. Plus, the Battery doesn't have to know about the targets and can remain simple.

@jtratner
Copy link
Contributor

I would prefer not to inject into globals, when you can use regular inheritance to do the same exact thing.

class ShiftBattery(object):
    def test_shift(self):
        self.obj.shift(1)

    def test_shift_negative(self):
        self.obj.shift(-3)

class TestShiftDataFrame(ShiftBattery, unittest.TestCase):
    def setUp(self):
        self.obj = DataFrame(range(10))

class TestShiftPanel(ShiftBattery, unittest.TestCase):
    def setUp(self):
        self.obj = tm.makePanel()

Sometimes it's nice to make it clear that something works the same way across all objects:

class TestShift(unittest.TestCase):
    def setUp(self):
        self.objects = [tm.makePanel(), tm.makeCustomDataFrame(5, 5)]
    def test_shift1(self):
        for obj in self.objects:
           obj.shift(1)

@jtratner
Copy link
Contributor

As an added bonus, if later you realize that there's an edge case for Panel or Series or something, you can add a test just to that class.

@jtratner
Copy link
Contributor

Alternatively, you can create a helper method for everything

class TestShift(unittest.TestCase):
    def _check_int_shift(self, obj):
        self.obj.shift(1)
        self.obj.shift(-3)
    def test_int_series(self):
        self._check_int_shift(self, int_series)
    def test_bool_series(self):
        self._check_bool_shift(self, bool_series)

@dalejung
Copy link
Contributor Author

@jreback can you take a look at https://gist.github.com/dalejung/6705707. It's just a small util I've been using when generating tests, but I'm not sure if it should go into pandas and if so where.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2013

@dalejung

you could put that in core/indexing.py.

that idiom is used in lots of locations in the library....

if you want to split this off, and then search and replace where something like this is used would be great

here's some places its used now (might be more):

./pandas/core/indexing.py:
  864 :         tup = [slice(None, None) for _ in range(self.ndim)]
./pandas/core/internals.py:
 2298 :         slicer = [slice(None, None) for _ in range(self.ndim)]
 2403 :         slicer = [slice(None, None) for _ in range(self.ndim)]
./pandas/io/pytables.py:
 3793 :         slicer = [ slice(None, None) ] * obj.ndim

also in pandas/tests/testing_index.py, called _axify

@dalejung
Copy link
Contributor Author

cool cool. Is slicer the correct term for this? I couldn't figure out what to call a tuple of slices. I like slicer.

On Sep 25, 2013, at 4:51 PM, jreback [email protected] wrote:

@dalejung

you could put that in core/indexing.py.

that idiom is used in lots of locations in the library....

if you want to split this off, and then search and replace where something like this is used would be great

here's some places its used now (might be more):

./pandas/core/indexing.py:
864 : tup = [slice(None, None) for _ in range(self.ndim)]
./pandas/core/internals.py:
2298 : slicer = [slice(None, None) for _ in range(self.ndim)]
2403 : slicer = [slice(None, None) for _ in range(self.ndim)]
./pandas/io/pytables.py:
3793 : slicer = [ slice(None, None) ] * obj.ndim

Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2013

_axis_slicer? or _slicer ok...

you may want to change the args a bit as sometimes I just pass in an indexer directly for the axis in question

(this is an internal function, so can prob just add a kw, or maybe make it the first arg)

e.g.

def _slicer(indexer=None, start=None, stop=None, axis=None)

@jtratner
Copy link
Contributor

You should change _missing to Ellipsis. Avoids an additional global.
On Sep 25, 2013 5:13 PM, "jreback" [email protected] wrote:

_axis_slicer? or _slicer ok...

you may want to change the args a bit as sometimes I just pass in an
indexer directly for the axis in question

(this is an internal function, so can prob just add a kw, or maybe make it
the first arg)

e.g.

def _slicer(indexer=None, start=None, stop=None, axis=None)


Reply to this email directly or view it on GitHubhttps://github.com//pull/4874#issuecomment-25125266
.

@jtratner
Copy link
Contributor

Also, are you sure that works under Python 3? Some comparisons were
disallowed in Py3

@dalejung
Copy link
Contributor Author

Wouldn't using Ellipsis as a missing sentinel be confusing considering its only real use is in numerical python?

@jtratner
Copy link
Contributor

@dalejung I think it will produce the same thing if you set stop to None (because you no longer need to test for start <= 0) and if you pass start > 0, then you want to be slicing [:3] so you swap. Is there ever a time where None would not have that meaning?

def slicer(start, stop=None, axis=None):
    assert axis is not None, "Must pass an axis to slicer"
    assert isinstance(axis, int), "Must pass an integer axis to slicer"
    if isinstance(start, slice):
       key = start
   else:
       if stop is None and start > 0:
           stop, start = start, stop
       key = slice(start, stop)
   slices = [slice(None) for x in range(axis+1)]
   slices[axis] = key
   return tuple(slices)

@jreback
Copy link
Contributor

jreback commented Sep 25, 2013

@jtratner this is a purleey internal method simply to make easy constructing multiple axis slices (rather than 'manually' doing this in places)

@dalejung
Copy link
Contributor Author

@jtratner the original behaved a bit differently,

slicer(3, axis=0) => (slice(None, 3, None,)
slicer(3, None, axis=0) => (slice(3, None, None),) # explict None doesn't flip.

I'm not really tied to that behavior. Going back to missing sentinels, are we standardized on Ellipsis? AFAIK, its use is predicated on no one using it outside of numerical python. Wouldn't it be confusing with pandas, especially within an indexing related file.

@cpcloud
Copy link
Member

cpcloud commented Sep 25, 2013

I'm weakly -1 on using Ellipsis. What's wrong with None?

@jreback
Copy link
Contributor

jreback commented Sep 26, 2013

@dalejung you saw my comment above? this has to be able to take an indexer (that doesn't happen to be a slice)
(

@dalejung
Copy link
Contributor Author

@jreback yup.

@cpcloud Using a missing sentinel is for when func(x) is semantically different than func(x, None). Basically when None is itself a valid argument.

@cpcloud
Copy link
Member

cpcloud commented Sep 26, 2013

@dalejung Yep .. i see .. hm NotImplemented could be a possibility

@dalejung
Copy link
Contributor Author

@cpcloud well, it's not meant for unimplemented paths. It's for setting default behaviors.

def func(x=None):
    # if x is missing, default to 10
    if x is None:
        x = 10
    ...

The problem with the above setup is that you can't tell the difference between func() and func(None) which is a problem if None is valid for x. In the above, you can never pass in None and have it pass the check. In most cases, a regular if x is None: check is fine because None is not valid and is itself the missing sentinel.

def func(x=_missing):
    if x is _missing:
        x = 10

handles both None and allows default missing behavior. _missing is basically any object that's guaranteed to never be passed in; object(), Ellipsis, etc.

@dalejung dalejung mentioned this pull request Sep 27, 2013
@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

@dalejung how's this coming? waiting for axis slicer, IIRC?

@jreback
Copy link
Contributor

jreback commented Oct 11, 2013

@dalejung hows this coming along?

@jreback jreback closed this Jan 3, 2014
@jreback jreback reopened this Jan 3, 2014
@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

@dalejung want to rebase this?

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@dalejung this should be pretty easy to do now (the underlying NDFrame.shift is fixed to handle axes better)....

pls rebase

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@dalejung still working on this?

@dalejung
Copy link
Contributor Author

dalejung commented Mar 9, 2014

@jreback Actually, the underlying NDFrame.shift change makes this PR redundant.

@dalejung dalejung closed this Mar 9, 2014
@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

hmm

I think the intent was to integrate Panel.shift
which has special handling ATM

@dalejung
Copy link
Contributor Author

@jreback Not you're right. I had split up the Panel.shift into two PRs since they addressed separate tangential issues. #6605 is the panel implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generalize shift infrastructure.
4 participants