Skip to content

BUG/CLN: Cleanup sanitize column and abstract broadcasting #5341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 26, 2013

Conversation

jtratner
Copy link
Contributor

Also fixes small bug from #5321

@@ -1417,6 +1417,18 @@ def test_getitem_boolean_missing(self):
def test_setitem_boolean_missing(self):
pass

def test_setitem_duplicate_columns_with_index(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this test with the other duplicate method tests (even though its a long method), test_columns_dups_operations, or at least right after it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely - was searching for 'duplicate' so I missed it :)

@jtratner
Copy link
Contributor Author

@jreback A few points here:

  1. This function never checked for column uniqueness previously, it always tried the lookup and then tested if it was a DataFrame or not. Thus it can't be a big performance hit (and my own perf testing shows that it isn't...)
  2. It can't check for uniqueness because of issues with multilevel columns. E.g. consider this example:
df = pd.DataFrame(zip(range(-5,1), range(-5,1)), columns = pd.MultiIndex.from_arrays([['a', 'a'], ['x', 'y']]))
df['a'] = df.index

@jreback
Copy link
Contributor

jreback commented Oct 26, 2013

regarding your last comment....

yeh..perf may not be a problem....I am always wary of heavily called functions that have to deal with unique/non-unique...almost always I separate these cases (because of perf)

if its a multi-index it doesn't matter(so self.index.is_unique and not not isinstance(self.index,MultiIndex) should suffice (fyi....I would maybe add a method to do exactly this in your refactor (or could even override is_unique for mi, maybe)

@jtratner
Copy link
Contributor Author

@jreback just thinking about it again, there's no reason to have a separate branch for Index, just needs to do an instance check for Index and ndarray first, then can use is_sequence.

@jtratner
Copy link
Contributor Author

Is there a reason to prefer is_sequence() over checking np.isscalar(val)?

@jreback
Copy link
Contributor

jreback commented Oct 26, 2013

@jtratner that's fine...but then need to fix is_sequence (for the perf hit of iter, e.g. you can special case Index/Series in there I think``

you can try is_scalar....there is a prob a case where they differ (e.g. a Timestamp would hit is_sequence and not isscalar)

@jtratner
Copy link
Contributor Author

@jreback wha? Timestamp is true for _is_sequence?

@jtratner
Copy link
Contributor Author

ah, I see, is_scalar is False for timestamp, got it.

@jreback
Copy link
Contributor

jreback commented Oct 26, 2013

yep strings/Timestamp looks like lists and isscalr fails

@jreback
Copy link
Contributor

jreback commented Oct 26, 2013

@jtratner if perf checks out ok...this looks good!

@jtratner
Copy link
Contributor Author

I ran the perf check 4 times on base commit and tip of this PR, everything came out about even on average, and the few that were particularly different were because 1 run of that particular benchmark took much more time than the other three, so I think it should be fine.

And only try to broadcast if columns aren't unique or are MI.

(also in the process fix small bug from pandas-dev#5321)
@jreback
Copy link
Contributor

jreback commented Oct 26, 2013

ok good to go then

jtratner added a commit that referenced this pull request Oct 26, 2013
BUG/CLN: Cleanup sanitize column and abstract broadcasting
@jtratner jtratner merged commit e96df8e into pandas-dev:master Oct 26, 2013
@jtratner jtratner deleted the tweak-sanitize-column branch October 26, 2013 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants