Skip to content

CLN: refactor Series.reindex to core/generic #4610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 21, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Aug 19, 2013

closes #4604, #3317, #4618

Incorrect results

In [1]: s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
In [2]: s.reindex(['a', 'g', 'c', 'f'], method='ffill')
Out[2]: 
a     1
g     5
c   NaN
f     5
dtype: float64

Fixed (and happens to automatically cast as well back to the original dtype)
(this was broken in 0.12)

In [2]: s.reindex(['a', 'g', 'c', 'f'], method='ffill')
Out[2]: 
a    1
g    1
c    3
f    3
dtype: int64

This does NOT automatically downcast here as this is a major
performance bottleneck if we did (this is the same result as in 0.12)

In [3]: >>> s.reindex(['a', 'g', 'c', 'f']).ffill()
Out[3]: 
a    1
g    1
c    3
f    3
dtype: float64

You can force downcasting by passing 'infer' if you want
(this option was not available in 0.12 for Series)

In [4]: >>> s.reindex(['a', 'g', 'c', 'f']).ffill(downcast='infer')
Out[4]: 
a    1
g    1
c    3
f    3
dtype: int64

TST: GH4604, reindexing with a method of 'ffill' gives incorrect results

BUG/CLN: (GH4604) Refactor Series.reindex to core/generic.py allow method= in reindexing
on a Series to work

API/CLN: GH4604 Infer and downcast dtype if appropriate on ffill/bfill
this is for consistency when doing: df.reindex().ffill() and df.reindex(method='ffill')

CLN: allow backfill/pad/interpolate to operate on integers (by float conversion)
provide downcasting back to original dtype where needed core.internals.interpolate

ENH: provide core.index.identical method to compare values and attributes similar to .equals

API: changed back to pre-GH3482 where a reindex with no args will by default copy

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2013

cc @jtratner

I had to create a core.index.identical method (similar to .equals) but also compares attributes (e.g. name)

we don't have anything like this, right?

do you know python calls for x is y test? (aside from __eq__ or is that it?

@cpcloud
Copy link
Member

cpcloud commented Aug 19, 2013

There's an issue for that isn't there? Comparing indexes including meta data

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2013

prob is
can u link if there is (what I did prob wont solve it)

@cpcloud
Copy link
Member

cpcloud commented Aug 19, 2013

could only find #3317, with a just passing mention....should i open a new issue for it? really just a decision....i remember preserving name broke a few tests. can't remember if it was a bunch or just a few though...i might have it in a branch

@jtratner
Copy link
Contributor

x is y tests for object identity, so it won't call any methods on the object. It's not (and shouldn't be) guaranteed that Index will remain the same object after transformations, even though it doesn't require copying any of the index attributes.

Example of is checks:

lst = [1, 2, 3]
lst2 = list(lst)
assert not (lst is lst2)

You could potentially add a np.may_share_memory check (but only for non-MultiIndex).

I'm not totally clear on what you're looking for here. I don't think it's a good idea to have identical test for object identity (though, granted, there is some overhead with views of MultiIndex, but not much [approx space complexity O(n+m) where n and m are length of the set of levels and length of the set of labels]).

@jtratner
Copy link
Contributor

I'm wrong on space complexity of MI view - it's whatever space python needs to make a list pointing to all the lists of levels and a list pointing to all the lists of labels. (so a MI with 2 sets of levels and 2 lists of labels costs two lists of length 2 to shallow copy) very minimal.

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2013

@jtratner

its just a very simple issue

if you are reindexing with an index that is identical (and copy is not set), then just return the object

but this HAS to account for an equals index, but with a different name for example
(which relates to the semi-mutable index)

BUG/CLN: (GH4604) Refactor Series.reindex to core/generic.py allow method= in reindexing
  on a Series to work

API/CLN: GH4604 Infer and downcast dtype if appropriate on ffill/bfill
  this is for consistency when doing: df.reindex().ffill() and df.reindex(method='ffill')

CLN: allow backfill/pad/interpolate to operate on integers (by float conversion)
     provide downcasting back to original dtype where needed core.internals.interpolate

ENH: provide core.index.identical method to compare values and attributes similar to .equals

API: changed back to pre-GH3482 where a reindex with no args will by default copy
@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2013

makes sense?

In [6]: index = pd.Index(list(range(5)))

In [7]: index2 = pd.Index(list(range(5))).copy(name='foo')

In [8]: index.equals(index)
Out[8]: True

In [9]: index.identical(index)
Out[9]: True

In [10]: index.identical(index2)
Out[10]: False

In [11]: index.equals(index2)
Out[11]: True

In [12]: index.identical(index2)
Out[12]: False
In [13]: df = DataFrame(randn(5,2),index=index)

In [17]: df.reindex(index2).index.identical(index2)
Out[17]: True

In [18]: df.reindex(index2).index.equals(index2)
Out[18]: True

In [20]: df.reindex(index2).index.identical(index)
Out[20]: False

In [21]: df.reindex(index2).index.equals(index)
Out[21]: True

@cpcloud
Copy link
Member

cpcloud commented Aug 19, 2013

Does identical also test the freq attribute of the appropriate indexes?

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2013

yep freq/tz on datetime index, freq on period

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2013

on MI, levels are already tested (and I think labels) via equals, so it just tests name,

@cpcloud
Copy link
Member

cpcloud commented Aug 19, 2013

👍

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2013

what if identical replaces __eq__ ?

so all current code will work, but then true equality is tested by index == index2 or is that confusing?

@cpcloud
Copy link
Member

cpcloud commented Aug 19, 2013

that will break tests... plus that might be kind of a nasty surprise to people who don't religiously keep track of tz and freq

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2013

fair enough...

@cpcloud
Copy link
Member

cpcloud commented Aug 19, 2013

for the record i would like that, but too much back-incompat. would actually fix #3317 i believe

@jtratner
Copy link
Contributor

Oh okay, I wasn't totally clear on your question. Looks good :)

@jreback
Copy link
Contributor Author

jreback commented Aug 20, 2013

@jtratner @cpcloud pls give this a once over..thxs

Similar to equals, but check that other comparable attributes are also equal
"""
return self.equals(other) and all(
[ getattr(self,c,None) == getattr(other,c,None) for c in self._comparables ])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor point: if the attr checks are moved to the left hand side and u use a generator in all best case will be faster. totally minor though bc len(_comparables) == 1! u will return False faster than having to check the whole array if attrs are in fact not equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed that...but for the most part the name attributes are usually None IME

@jtratner
Copy link
Contributor

Will look over in a minute.

@jtratner
Copy link
Contributor

@jreback this doesn't quite work for MultiIndex yet. Is it supposed to?

E.g.

mi = pd.util.testing.makeCustomDataframe(5, 5).stack().index
assert mi.identical(mi)
other = mi.from_tuples(tuple(mi))
assert not mi.identical(other), "Failed - other does not have same names"

On a related note - for some reason, MultIIndex appears to have a name attribute that returns None, which shadows the issue. (i.e., continuing previous)

In [16]: print(mi.name)
None

In [17]: print(mi.names)
FrozenList([u'R0', u'C0'])

In [18]: print(other.name)
None

In [19]: print(other.names)
FrozenList([None, None])

@jtratner
Copy link
Contributor

Ah, it's because Index sets name=None explicitly. Is it worth adding a property to MultiIndex to mi.name fail?

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

I think there is prob some compatibility going on here

iow you need to know whether an object is an index or mi and then use name or names

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

@jtratner side issue - rename in index is broken for MultiIndex, and need some tests...thanks

@jtratner
Copy link
Contributor

@jreback the real problem is that rename and name should not exist for MultiIndex (they only make sense for Index objects).

@jtratner
Copy link
Contributor

That said, why can't you just set _comparables = ['names']?

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

rename does makes sense for MI, e.g.

mi.set_names(['level_0','level_1'])

?

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

@jtratner oh I did do that, I was trying to use rename in some tests; it breaks for MI....that's all (you can just use set_names, but the arguments are only built for a single name), and the rename for an Index should be able to take a list of a scalar (e.g. i.rename('foo') doesn't work, but i.rename(['foo']) does...(I know its a convenience function)....but ...

@jtratner
Copy link
Contributor

@jreback You're right, that's definitely broken.

@jtratner
Copy link
Contributor

@jreback I'll put up a fix tomorrow. Sorry about that!

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

no...thank you for testing.....when I find something breaking...I realized that I didn't test it!

@jtratner
Copy link
Contributor

@jreback however, your example doesn't actually fail:

In [1]: from pandas import *
ind
In [2]: ind = Index(range(10))

In [3]: ind.name

In [4]: ind.rename('apple')
Out[4]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [5]: ind = _

In [6]: ind.name
Out[6]: 'apple'

@jtratner
Copy link
Contributor

The issue is that rename and name should only exist on Index (where they can only be used if you actually have an Index). Otherwise, you should use set_names and names. Not sure if there are other examples for "1d-ish" index vs. MultiIndex.

@jtratner
Copy link
Contributor

And it works for MultiIndex too...can you show what you mean by failing?

In [8]: import pandas as pd

In [9]: mi = pd.util.testing.makeCustomDataframe(5, 5).stack().index

In [10]: mi
Out[10]:
MultiIndex
[(u'R_l0_g0', u'C_l0_g0'), (u'R_l0_g0', u'C_l0_g1'), (u'R_l0_g0', u'C_l0_g2'), (u'R_l0_g0', u'C_l0_g3'), (u'R_l0_g0', u'C_l0_g4'), (u'R_l0_g1', u'C_l0_g0'), (u'R_l0_g1', u'C_l0_g1'), (u'R_l0_g1', u'C_l0_g2'), (u'R_l0_g1', u'C_l0_g3'), (u'R_l0_g1', u'C_l0_g4'), (u'R_l0_g2', u'C_l0_g0'), (u'R_l0_g2', u'C_l0_g1'), (u'R_l0_g2', u'C_l0_g2'), (u'R_l0_g2', u'C_l0_g3'), (u'R_l0_g2', u'C_l0_g4'), (u'R_l0_g3', u'C_l0_g0'), (u'R_l0_g3', u'C_l0_g1'), (u'R_l0_g3', u'C_l0_g2'), (u'R_l0_g3', u'C_l0_g3'), (u'R_l0_g3', u'C_l0_g4'), (u'R_l0_g4', u'C_l0_g0'), (u'R_l0_g4', u'C_l0_g1'), (u'R_l0_g4', u'C_l0_g2'), (u'R_l0_g4', u'C_l0_g3'), (u'R_l0_g4', u'C_l0_g4')]

In [11]: mi.set_names(["foo", "bar"])
Out[11]:
MultiIndex
[(u'R_l0_g0', u'C_l0_g0'), (u'R_l0_g0', u'C_l0_g1'), (u'R_l0_g0', u'C_l0_g2'), (u'R_l0_g0', u'C_l0_g3'), (u'R_l0_g0', u'C_l0_g4'), (u'R_l0_g1', u'C_l0_g0'), (u'R_l0_g1', u'C_l0_g1'), (u'R_l0_g1', u'C_l0_g2'), (u'R_l0_g1', u'C_l0_g3'), (u'R_l0_g1', u'C_l0_g4'), (u'R_l0_g2', u'C_l0_g0'), (u'R_l0_g2', u'C_l0_g1'), (u'R_l0_g2', u'C_l0_g2'), (u'R_l0_g2', u'C_l0_g3'), (u'R_l0_g2', u'C_l0_g4'), (u'R_l0_g3', u'C_l0_g0'), (u'R_l0_g3', u'C_l0_g1'), (u'R_l0_g3', u'C_l0_g2'), (u'R_l0_g3', u'C_l0_g3'), (u'R_l0_g3', u'C_l0_g4'), (u'R_l0_g4', u'C_l0_g0'), (u'R_l0_g4', u'C_l0_g1'), (u'R_l0_g4', u'C_l0_g2'), (u'R_l0_g4', u'C_l0_g3'), (u'R_l0_g4', u'C_l0_g4')]

In [12]: _.names
Out[12]: FrozenList([u'foo', u'bar'])

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

ahh...sorry...what I meant was that rename fails for multi-index (which is a nice name actually); I know it just calls set_names....but rename is cooler, that's why I was using it

These all should yield the same (I know I am abusing the input), but I am user of your functions!

In [1]: i = Index(np.arange(10))

In [2]: i.set_names(['foo'])
Out[2]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

In [3]: i.set_names(['foo']).name
Out[3]: 'foo'

In [4]: i.set_names('foo').name
ValueError: Length of new names must be 1, got 3

In [5]: i.rename('foo').name
Out[5]: 'foo'

In [6]: i.rename(['foo']).name
Out[6]: ['foo']

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

rename is just conviennce for set_names?

In [12]: mi.set_names(['foo','bar']).names
Out[12]: FrozenList([u'foo', u'bar'])

In [13]: mi.rename(['foo','bar']).names
ValueError: Length of names (1) must be same as level (2)

@jtratner
Copy link
Contributor

@jreback okay, well, trivial to enable rename for MultiIndex. I'm not sure I agree with your examples - especially because they lead to really strict type-sniffing.

rename is basically just lambda self, x: self.set_names([x])

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

@jtratner is rename different than set_names? (or is it supposed to be)?

@cpcloud
Copy link
Member

cpcloud commented Aug 21, 2013

R users are probably familiar with rename via the plyr package so I think we should keep it.

@jtratner
Copy link
Contributor

@jreback It would be for Index (since set_names always expects some kind of iterable).

@jtratner
Copy link
Contributor

@cpcloud heh, I introduced rename a few PRs ago so I'm not sure that it's compliant, would be good to check that though.

@jtratner
Copy link
Contributor

@cpcloud uh, I think you're thinking of dataframe/series here. Having a plyr-like rename would be kinda weird on Index.

@cpcloud
Copy link
Member

cpcloud commented Aug 21, 2013

yep ur right, carry on good sirs

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

Thing is the set_* makes sense for the MI, while rename for the name attributes, but set_names is a natural extension, ok...myabe just make them exactly the same then (with or with/out iterable is ok by me)

what is plyr rename do? (is it not the obvious thing?)....like (attach)??

@cpcloud
Copy link
Member

cpcloud commented Aug 21, 2013

@jreback ha! no....it's like DataFrame.rename. I got mixed up for a sec. You guys are talking about rename on Index objects, I was thinking about the frame version.

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

@cpcloud oh...that's not so obvious

@cpcloud
Copy link
Member

cpcloud commented Aug 21, 2013

here's what rename does

R > df <- data.frame(x=rnorm(10), y=rnorm(10))
R > df
            x          y
 1:  1.196889  0.8105994
 2: -0.185365 -0.7262135
 3: -0.613755  1.0857286
 4: -1.459379 -0.4459683
 5:  0.348874  0.3673212
 6:  0.401074 -0.0862284
 7:  0.398706  0.9814990
 8: -0.935906  0.7453723
 9:  0.243090  0.6259889
10: -0.428978 -0.8742873

R > rename(df, c(x="a", y="b"))
            a          b
 1:  1.196889  0.8105994
 2: -0.185365 -0.7262135
 3: -0.613755  1.0857286
 4: -1.459379 -0.4459683
 5:  0.348874  0.3673212
 6:  0.401074 -0.0862284
 7:  0.398706  0.9814990
 8: -0.935906  0.7453723
 9:  0.243090  0.6259889
10: -0.428978 -0.8742873

but that's OT

@jtratner
Copy link
Contributor

@jreback you can set names on index too:

ind.names = ['apple']
ind.name # 'apple'

I'd say the behavior that makes sense is something like:

  • Index.rename : takes single object
  • MultiIndex.set_names, Index.set_names - same type signature: take list of names [Index.set_names can only take iterable of length 1)
  • MultiIndex.rename - synonym for set_names

So if you want to do something generically, you should use set_names (just like you would with other attributes). If you expect a non-MI index, then you could use rename with a string, etc.

@jtratner
Copy link
Contributor

@jreback well, I'll alias MultiIndex.rename to MultiIndex.set_names and add some test cases for rename, set_names, set_levels, and set_labels.

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

@jtratner that all looks good

…x if freq is not specified during _ensure_index

PERF: remove automatic downcasting, instead do on-demand via 'downcast=infer'

TST: fix up failing tests

CLN: reindex/fill in internals moved to core/internals/Block/reindex_items_from

PERF: don't automatically downcast with a float block

BUG: GH3317 reverse prior fix in tseries/offsets, to change slightly the multi reindex

TST: identical index tests

BUG: GH4618 provide automatic downcasting on a reindexed with method Series
     (in this case a shifted boolean then filled series)
jreback added a commit that referenced this pull request Aug 21, 2013
CLN: refactor Series.reindex to core/generic
@jreback jreback merged commit ecca28d into pandas-dev:master Aug 21, 2013
@jtratner
Copy link
Contributor

@jreback to be clear - the only 'bug' with rename now is that MultiIndex should set rename to set_names, right?

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2013

I think that is right

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: reindex strange behaviour
3 participants