CLN: refactor Series.reindex to core/generic #4610

jreback · 2013-08-19T22:16:17Z

Incorrect results

In [1]: s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
In [2]: s.reindex(['a', 'g', 'c', 'f'], method='ffill')
Out[2]: 
a     1
g     5
c   NaN
f     5
dtype: float64

Fixed (and happens to automatically cast as well back to the original dtype)
(this was broken in 0.12)

In [2]: s.reindex(['a', 'g', 'c', 'f'], method='ffill')
Out[2]: 
a    1
g    1
c    3
f    3
dtype: int64

This does NOT automatically downcast here as this is a major
performance bottleneck if we did (this is the same result as in 0.12)

In [3]: >>> s.reindex(['a', 'g', 'c', 'f']).ffill()
Out[3]: 
a    1
g    1
c    3
f    3
dtype: float64

You can force downcasting by passing 'infer' if you want
(this option was not available in 0.12 for Series)

In [4]: >>> s.reindex(['a', 'g', 'c', 'f']).ffill(downcast='infer')
Out[4]: 
a    1
g    1
c    3
f    3
dtype: int64

TST: GH4604, reindexing with a method of 'ffill' gives incorrect results

BUG/CLN: (GH4604) Refactor Series.reindex to core/generic.py allow method= in reindexing
on a Series to work

API/CLN: GH4604 Infer and downcast dtype if appropriate on ffill/bfill
this is for consistency when doing: df.reindex().ffill() and df.reindex(method='ffill')

CLN: allow backfill/pad/interpolate to operate on integers (by float conversion)
provide downcasting back to original dtype where needed core.internals.interpolate

ENH: provide core.index.identical method to compare values and attributes similar to .equals

API: changed back to pre-GH3482 where a reindex with no args will by default copy

jreback · 2013-08-19T22:19:58Z

cc @jtratner

I had to create a core.index.identical method (similar to .equals) but also compares attributes (e.g. name)

we don't have anything like this, right?

do you know python calls for x is y test? (aside from __eq__ or is that it?

cpcloud · 2013-08-19T22:22:56Z

There's an issue for that isn't there? Comparing indexes including meta data

jreback · 2013-08-19T22:26:01Z

prob is
can u link if there is (what I did prob wont solve it)

cpcloud · 2013-08-19T22:44:30Z

could only find #3317, with a just passing mention....should i open a new issue for it? really just a decision....i remember preserving name broke a few tests. can't remember if it was a bunch or just a few though...i might have it in a branch

jtratner · 2013-08-19T22:57:16Z

x is y tests for object identity, so it won't call any methods on the object. It's not (and shouldn't be) guaranteed that Index will remain the same object after transformations, even though it doesn't require copying any of the index attributes.

Example of is checks:

lst = [1, 2, 3]
lst2 = list(lst)
assert not (lst is lst2)

You could potentially add a np.may_share_memory check (but only for non-MultiIndex).

I'm not totally clear on what you're looking for here. I don't think it's a good idea to have identical test for object identity (though, granted, there is some overhead with views of MultiIndex, but not much [approx space complexity O(n+m) where n and m are length of the set of levels and length of the set of labels]).

jtratner · 2013-08-19T23:03:40Z

I'm wrong on space complexity of MI view - it's whatever space python needs to make a list pointing to all the lists of levels and a list pointing to all the lists of labels. (so a MI with 2 sets of levels and 2 lists of labels costs two lists of length 2 to shallow copy) very minimal.

jreback · 2013-08-19T23:05:11Z

@jtratner

its just a very simple issue

if you are reindexing with an index that is identical (and copy is not set), then just return the object

but this HAS to account for an equals index, but with a different name for example
(which relates to the semi-mutable index)

BUG/CLN: (GH4604) Refactor Series.reindex to core/generic.py allow method= in reindexing on a Series to work API/CLN: GH4604 Infer and downcast dtype if appropriate on ffill/bfill this is for consistency when doing: df.reindex().ffill() and df.reindex(method='ffill') CLN: allow backfill/pad/interpolate to operate on integers (by float conversion) provide downcasting back to original dtype where needed core.internals.interpolate ENH: provide core.index.identical method to compare values and attributes similar to .equals API: changed back to pre-GH3482 where a reindex with no args will by default copy

jreback · 2013-08-19T23:12:16Z

makes sense?

In [6]: index = pd.Index(list(range(5)))

In [7]: index2 = pd.Index(list(range(5))).copy(name='foo')

In [8]: index.equals(index)
Out[8]: True

In [9]: index.identical(index)
Out[9]: True

In [10]: index.identical(index2)
Out[10]: False

In [11]: index.equals(index2)
Out[11]: True

In [12]: index.identical(index2)
Out[12]: False

In [13]: df = DataFrame(randn(5,2),index=index)

In [17]: df.reindex(index2).index.identical(index2)
Out[17]: True

In [18]: df.reindex(index2).index.equals(index2)
Out[18]: True

In [20]: df.reindex(index2).index.identical(index)
Out[20]: False

In [21]: df.reindex(index2).index.equals(index)
Out[21]: True

cpcloud · 2013-08-19T23:13:23Z

Does identical also test the freq attribute of the appropriate indexes?

jreback · 2013-08-19T23:14:27Z

yep freq/tz on datetime index, freq on period

jreback · 2013-08-19T23:14:58Z

on MI, levels are already tested (and I think labels) via equals, so it just tests name,

cpcloud · 2013-08-19T23:15:35Z

👍

jreback · 2013-08-19T23:15:56Z

what if identical replaces __eq__ ?

so all current code will work, but then true equality is tested by index == index2 or is that confusing?

cpcloud · 2013-08-19T23:17:34Z

that will break tests... plus that might be kind of a nasty surprise to people who don't religiously keep track of tz and freq

jreback · 2013-08-19T23:18:50Z

fair enough...

cpcloud · 2013-08-19T23:20:27Z

for the record i would like that, but too much back-incompat. would actually fix #3317 i believe

jtratner · 2013-08-19T23:40:15Z

Oh okay, I wasn't totally clear on your question. Looks good :)

jreback · 2013-08-20T20:17:34Z

@jtratner @cpcloud pls give this a once over..thxs

cpcloud · 2013-08-20T20:25:23Z

pandas/core/index.py

+        Similar to equals, but check that other comparable attributes are also equal
+        """
+        return self.equals(other) and all(
+            [ getattr(self,c,None) == getattr(other,c,None) for c in self._comparables ])


minor point: if the attr checks are moved to the left hand side and u use a generator in all best case will be faster. totally minor though bc len(_comparables) == 1! u will return False faster than having to check the whole array if attrs are in fact not equal

I changed that...but for the most part the name attributes are usually None IME

jtratner · 2013-08-20T23:54:37Z

Will look over in a minute.

jtratner · 2013-08-21T00:35:23Z

@jreback this doesn't quite work for MultiIndex yet. Is it supposed to?

E.g.

mi = pd.util.testing.makeCustomDataframe(5, 5).stack().index
assert mi.identical(mi)
other = mi.from_tuples(tuple(mi))
assert not mi.identical(other), "Failed - other does not have same names"

On a related note - for some reason, MultIIndex appears to have a name attribute that returns None, which shadows the issue. (i.e., continuing previous)

In [16]: print(mi.name)
None

In [17]: print(mi.names)
FrozenList([u'R0', u'C0'])

In [18]: print(other.name)
None

In [19]: print(other.names)
FrozenList([None, None])

jtratner · 2013-08-21T00:37:41Z

Ah, it's because Index sets name=None explicitly. Is it worth adding a property to MultiIndex to mi.name fail?

jreback · 2013-08-21T00:41:16Z

I think there is prob some compatibility going on here

iow you need to know whether an object is an index or mi and then use name or names

jreback · 2013-08-21T01:02:58Z

@jtratner side issue - rename in index is broken for MultiIndex, and need some tests...thanks

jtratner · 2013-08-21T01:23:40Z

@jreback the real problem is that rename and name should not exist for MultiIndex (they only make sense for Index objects).

jtratner · 2013-08-21T01:24:43Z

That said, why can't you just set _comparables = ['names']?

jreback · 2013-08-21T01:24:52Z

rename does makes sense for MI, e.g.

mi.set_names(['level_0','level_1'])

?

jreback · 2013-08-21T01:26:31Z

@jtratner oh I did do that, I was trying to use rename in some tests; it breaks for MI....that's all (you can just use set_names, but the arguments are only built for a single name), and the rename for an Index should be able to take a list of a scalar (e.g. i.rename('foo') doesn't work, but i.rename(['foo']) does...(I know its a convenience function)....but ...

jtratner · 2013-08-21T01:27:50Z

@jreback You're right, that's definitely broken.

jtratner · 2013-08-21T01:28:40Z

@jreback I'll put up a fix tomorrow. Sorry about that!

jreback · 2013-08-21T01:29:49Z

no...thank you for testing.....when I find something breaking...I realized that I didn't test it!

jtratner · 2013-08-21T01:30:15Z

@jreback however, your example doesn't actually fail:

In [1]: from pandas import *
ind
In [2]: ind = Index(range(10))

In [3]: ind.name

In [4]: ind.rename('apple')
Out[4]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [5]: ind = _

In [6]: ind.name
Out[6]: 'apple'

jtratner · 2013-08-21T01:31:32Z

The issue is that rename and name should only exist on Index (where they can only be used if you actually have an Index). Otherwise, you should use set_names and names. Not sure if there are other examples for "1d-ish" index vs. MultiIndex.

jtratner · 2013-08-21T01:33:04Z

And it works for MultiIndex too...can you show what you mean by failing?

In [8]: import pandas as pd

In [9]: mi = pd.util.testing.makeCustomDataframe(5, 5).stack().index

In [10]: mi
Out[10]:
MultiIndex
[(u'R_l0_g0', u'C_l0_g0'), (u'R_l0_g0', u'C_l0_g1'), (u'R_l0_g0', u'C_l0_g2'), (u'R_l0_g0', u'C_l0_g3'), (u'R_l0_g0', u'C_l0_g4'), (u'R_l0_g1', u'C_l0_g0'), (u'R_l0_g1', u'C_l0_g1'), (u'R_l0_g1', u'C_l0_g2'), (u'R_l0_g1', u'C_l0_g3'), (u'R_l0_g1', u'C_l0_g4'), (u'R_l0_g2', u'C_l0_g0'), (u'R_l0_g2', u'C_l0_g1'), (u'R_l0_g2', u'C_l0_g2'), (u'R_l0_g2', u'C_l0_g3'), (u'R_l0_g2', u'C_l0_g4'), (u'R_l0_g3', u'C_l0_g0'), (u'R_l0_g3', u'C_l0_g1'), (u'R_l0_g3', u'C_l0_g2'), (u'R_l0_g3', u'C_l0_g3'), (u'R_l0_g3', u'C_l0_g4'), (u'R_l0_g4', u'C_l0_g0'), (u'R_l0_g4', u'C_l0_g1'), (u'R_l0_g4', u'C_l0_g2'), (u'R_l0_g4', u'C_l0_g3'), (u'R_l0_g4', u'C_l0_g4')]

In [11]: mi.set_names(["foo", "bar"])
Out[11]:
MultiIndex
[(u'R_l0_g0', u'C_l0_g0'), (u'R_l0_g0', u'C_l0_g1'), (u'R_l0_g0', u'C_l0_g2'), (u'R_l0_g0', u'C_l0_g3'), (u'R_l0_g0', u'C_l0_g4'), (u'R_l0_g1', u'C_l0_g0'), (u'R_l0_g1', u'C_l0_g1'), (u'R_l0_g1', u'C_l0_g2'), (u'R_l0_g1', u'C_l0_g3'), (u'R_l0_g1', u'C_l0_g4'), (u'R_l0_g2', u'C_l0_g0'), (u'R_l0_g2', u'C_l0_g1'), (u'R_l0_g2', u'C_l0_g2'), (u'R_l0_g2', u'C_l0_g3'), (u'R_l0_g2', u'C_l0_g4'), (u'R_l0_g3', u'C_l0_g0'), (u'R_l0_g3', u'C_l0_g1'), (u'R_l0_g3', u'C_l0_g2'), (u'R_l0_g3', u'C_l0_g3'), (u'R_l0_g3', u'C_l0_g4'), (u'R_l0_g4', u'C_l0_g0'), (u'R_l0_g4', u'C_l0_g1'), (u'R_l0_g4', u'C_l0_g2'), (u'R_l0_g4', u'C_l0_g3'), (u'R_l0_g4', u'C_l0_g4')]

In [12]: _.names
Out[12]: FrozenList([u'foo', u'bar'])

jreback · 2013-08-21T01:34:32Z

ahh...sorry...what I meant was that rename fails for multi-index (which is a nice name actually); I know it just calls set_names....but rename is cooler, that's why I was using it

These all should yield the same (I know I am abusing the input), but I am user of your functions!

In [1]: i = Index(np.arange(10))

In [2]: i.set_names(['foo'])
Out[2]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [3]: i.set_names(['foo']).name
Out[3]: 'foo'

In [4]: i.set_names('foo').name
ValueError: Length of new names must be 1, got 3

In [5]: i.rename('foo').name
Out[5]: 'foo'

In [6]: i.rename(['foo']).name
Out[6]: ['foo']

jreback · 2013-08-21T01:37:08Z

rename is just conviennce for set_names?

In [12]: mi.set_names(['foo','bar']).names
Out[12]: FrozenList([u'foo', u'bar'])

In [13]: mi.rename(['foo','bar']).names
ValueError: Length of names (1) must be same as level (2)

jtratner · 2013-08-21T01:37:52Z

@jreback okay, well, trivial to enable rename for MultiIndex. I'm not sure I agree with your examples - especially because they lead to really strict type-sniffing.

rename is basically just lambda self, x: self.set_names([x])

jreback · 2013-08-21T01:38:42Z

@jtratner is rename different than set_names? (or is it supposed to be)?

cpcloud · 2013-08-21T01:39:31Z

R users are probably familiar with rename via the plyr package so I think we should keep it.

jtratner · 2013-08-21T01:39:54Z

@jreback It would be for Index (since set_names always expects some kind of iterable).

jtratner · 2013-08-21T01:40:24Z

@cpcloud heh, I introduced rename a few PRs ago so I'm not sure that it's compliant, would be good to check that though.

jtratner · 2013-08-21T01:41:57Z

@cpcloud uh, I think you're thinking of dataframe/series here. Having a plyr-like rename would be kinda weird on Index.

cpcloud · 2013-08-21T01:42:25Z

yep ur right, carry on good sirs

jreback · 2013-08-21T01:42:41Z

Thing is the set_* makes sense for the MI, while rename for the name attributes, but set_names is a natural extension, ok...myabe just make them exactly the same then (with or with/out iterable is ok by me)

what is plyr rename do? (is it not the obvious thing?)....like (attach)??

cpcloud · 2013-08-21T01:43:39Z

@jreback ha! no....it's like DataFrame.rename. I got mixed up for a sec. You guys are talking about rename on Index objects, I was thinking about the frame version.

jreback · 2013-08-21T01:45:10Z

@cpcloud oh...that's not so obvious

cpcloud · 2013-08-21T01:47:19Z

here's what rename does

R > df <- data.frame(x=rnorm(10), y=rnorm(10))
R > df
            x          y
 1:  1.196889  0.8105994
 2: -0.185365 -0.7262135
 3: -0.613755  1.0857286
 4: -1.459379 -0.4459683
 5:  0.348874  0.3673212
 6:  0.401074 -0.0862284
 7:  0.398706  0.9814990
 8: -0.935906  0.7453723
 9:  0.243090  0.6259889
10: -0.428978 -0.8742873

R > rename(df, c(x="a", y="b"))
            a          b
 1:  1.196889  0.8105994
 2: -0.185365 -0.7262135
 3: -0.613755  1.0857286
 4: -1.459379 -0.4459683
 5:  0.348874  0.3673212
 6:  0.401074 -0.0862284
 7:  0.398706  0.9814990
 8: -0.935906  0.7453723
 9:  0.243090  0.6259889
10: -0.428978 -0.8742873

but that's OT

jtratner · 2013-08-21T01:48:21Z

@jreback you can set names on index too:

ind.names = ['apple']
ind.name # 'apple'

I'd say the behavior that makes sense is something like:

Index.rename : takes single object
MultiIndex.set_names, Index.set_names - same type signature: take list of names [Index.set_names can only take iterable of length 1)
MultiIndex.rename - synonym for set_names

So if you want to do something generically, you should use set_names (just like you would with other attributes). If you expect a non-MI index, then you could use rename with a string, etc.

jtratner · 2013-08-21T01:55:51Z

@jreback well, I'll alias MultiIndex.rename to MultiIndex.set_names and add some test cases for rename, set_names, set_levels, and set_labels.

jreback · 2013-08-21T02:00:43Z

@jtratner that all looks good

…x if freq is not specified during _ensure_index PERF: remove automatic downcasting, instead do on-demand via 'downcast=infer' TST: fix up failing tests CLN: reindex/fill in internals moved to core/internals/Block/reindex_items_from PERF: don't automatically downcast with a float block BUG: GH3317 reverse prior fix in tseries/offsets, to change slightly the multi reindex TST: identical index tests BUG: GH4618 provide automatic downcasting on a reindexed with method Series (in this case a shifted boolean then filled series)

CLN: refactor Series.reindex to core/generic

jtratner · 2013-08-21T22:57:48Z

@jreback to be clear - the only 'bug' with rename now is that MultiIndex should set rename to set_names, right?

jreback · 2013-08-21T23:20:06Z

I think that is right

jreback mentioned this pull request Aug 19, 2013

CLN: Post Series subclass from NDFrame #4324

Closed

18 tasks

cpcloud reviewed Aug 20, 2013
View reviewed changes

jreback mentioned this pull request Aug 21, 2013

BUG: shift() on Series of dtype bool makes dtype into object, so HDFStore won't append #4618

Closed

jreback added a commit that referenced this pull request Aug 21, 2013

Merge pull request #4610 from jreback/series_reindex

ecca28d

CLN: refactor Series.reindex to core/generic

jreback merged commit ecca28d into pandas-dev:master Aug 21, 2013

jreback mentioned this pull request Aug 21, 2013

BUG: reindex by both index and columns loses DatetimeIndex frequency #3317

Closed

CLN: refactor Series.reindex to core/generic #4610

CLN: refactor Series.reindex to core/generic #4610

Conversation

jreback commented Aug 19, 2013

jreback commented Aug 19, 2013

cpcloud commented Aug 19, 2013

jreback commented Aug 19, 2013

cpcloud commented Aug 19, 2013

jtratner commented Aug 19, 2013

jtratner commented Aug 19, 2013

jreback commented Aug 19, 2013

jreback commented Aug 19, 2013

cpcloud commented Aug 19, 2013

jreback commented Aug 19, 2013

jreback commented Aug 19, 2013

cpcloud commented Aug 19, 2013

jreback commented Aug 19, 2013

cpcloud commented Aug 19, 2013

jreback commented Aug 19, 2013

cpcloud commented Aug 19, 2013

jtratner commented Aug 19, 2013

jreback commented Aug 20, 2013

cpcloud Aug 20, 2013

Choose a reason for hiding this comment

jreback Aug 20, 2013

Choose a reason for hiding this comment

jtratner commented Aug 20, 2013

jtratner commented Aug 21, 2013

jtratner commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jtratner commented Aug 21, 2013

jtratner commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jtratner commented Aug 21, 2013

jtratner commented Aug 21, 2013

jreback commented Aug 21, 2013

jtratner commented Aug 21, 2013

jtratner commented Aug 21, 2013

jtratner commented Aug 21, 2013

jreback commented Aug 21, 2013

jreback commented Aug 21, 2013

jtratner commented Aug 21, 2013

jreback commented Aug 21, 2013

cpcloud commented Aug 21, 2013

jtratner commented Aug 21, 2013

jtratner commented Aug 21, 2013

jtratner commented Aug 21, 2013

cpcloud commented Aug 21, 2013

jreback commented Aug 21, 2013

cpcloud commented Aug 21, 2013

jreback commented Aug 21, 2013

cpcloud commented Aug 21, 2013

jtratner commented Aug 21, 2013

jtratner commented Aug 21, 2013

jreback commented Aug 21, 2013

jtratner commented Aug 21, 2013

jreback commented Aug 21, 2013