Skip to content

BUG: GroupBy.size created by TimeGrouper raises AttributeError #7600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 30, 2014

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Jun 28, 2014

Related to #7453 2nd issue:

  • GroupBy.size created by TimeGrouper raises AttributeError

@jreback jreback added this to the 0.14.1 milestone Jun 28, 2014
@@ -1665,6 +1666,22 @@ def levels(self):
def names(self):
return [self.binlabels.name]

def size(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not using value_counts? this should be very similar (if not call irectly), BaseGrouper.size

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 reasons.

  • value_counts cannot fulfill intermediate timestamps which is not included in the group key. For example, if the TimeGrouper categorize [2011-01-01, 2011-01-02, 2011-01-04] with daily frequency, the resulted group keys must be [2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04]
  • BinGrouper doesn't retain original Index to be directly passed to value_counts. To use value_counts, it must be created using similar logic.
indices = self.indices
labels = []
for k, v in compat.iteritems(indices):
            labels.extend([k] * len(v))
value_counts(labels)

@jreback
Copy link
Contributor

jreback commented Jun 28, 2014

ok can u add a bench for size in that case?
for both a regular groupby and TimeGrouper?

@sinhrks
Copy link
Member Author

sinhrks commented Jun 29, 2014

OK, added benchmarks.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2014

can I post benchmarks for those 2? thxs

@sinhrks
Copy link
Member Author

sinhrks commented Jun 30, 2014

Here it is. Cannot measure base performance of groupby_dt_timegrouper_size because of the bug, but it looks not slow comparing to groupby_dt_size

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_dt_size                              |  42.2956 |  55.4504 |   0.7628 |
groupby_dt_timegrouper_size                  |  34.2657 |  ------- |   ------ |

@jreback
Copy link
Contributor

jreback commented Jun 30, 2014

run then with -Hto test current perf (with your fix) with comparable samples (or do in ipython).

@sinhrks
Copy link
Member Author

sinhrks commented Jun 30, 2014

Yep, current perf of groupby_dt_timegrouper_size is attached above.

jreback added a commit that referenced this pull request Jun 30, 2014
BUG: GroupBy.size created by TimeGrouper raises AttributeError
@jreback jreback merged commit d21f44b into pandas-dev:master Jun 30, 2014
@sinhrks sinhrks deleted the bingrouper branch July 5, 2014 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants