Skip to content

PERF: performance improvement in MultiIndex.sortlevel #9445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 16, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.16.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ Performance
- Performance improvement of up to 20x in ``DataFrame.count`` when using a ``MultiIndex`` and the ``level`` keyword argument (:issue:`9163`)
- Performance and memory usage improvements in ``merge`` when key space exceeds ``int64`` bounds (:issue:`9151`)
- Performance improvements in multi-key ``groupby`` (:issue:`9429`)
- Performance improvements in ``MultiIndex.sortlevel`` (:issue:`9445`)

Bug Fixes
~~~~~~~~~
Expand Down
21 changes: 8 additions & 13 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -3584,21 +3584,15 @@ def decons_obs_group_ids(comp_ids, obs_ids, shape, labels):


def _indexer_from_factorized(labels, shape, compress=True):
if _int64_overflow_possible(shape):
indexer = np.lexsort(np.array(labels[::-1]))
return indexer

group_index = get_group_index(labels, shape, sort=True, xnull=True)
ids = get_group_index(labels, shape, sort=True, xnull=False)

if compress:
comp_ids, obs_ids = _compress_group_index(group_index)
max_group = len(obs_ids)
if not compress:
ngroups = (ids.size and ids.max()) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both idx.size and ids.max() are integers, yes?

I find using and for integers rather confusing. Perhaps this could be: (idx.max() if ids.size else 0) + 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean-operations:

The expression x and y first evaluates x; if x is false, its value is returned; otherwise, y is evaluated and the resulting value is returned.

note that the documentation follows with an example of s or 'foo', i.e. oring two strings, as a use-case of boolean operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri I agree with @shoyer here. We get the idiom, but I don't think a casual glance is intuitve here. Pls change to what @shoyer suggests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i gave an example from python's documentation that this is an actual use-case for boolean operations:

This is sometimes useful, e.g., if s is a string that should be replaced by a default value if it is empty, the expression s or 'foo' yields the desired value.

this is right off from python's documentation, and i have seen it used a lot.

that said, if you like to change it, please do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that it is from python docs
but not all of that idioms are useful
in the context of assignment with a value that could be None I would agree with you
but these are integers and it's not obvious at all

pls make the change

else:
comp_ids = group_index
max_group = com._long_prod(shape)
ids, obs = _compress_group_index(ids, sort=True)
ngroups = len(obs)

indexer = _get_group_index_sorter(comp_ids.astype(np.int64), max_group)
return indexer
return _get_group_index_sorter(ids, ngroups)


def _lexsort_indexer(keys, orders=None, na_position='last'):
Expand Down Expand Up @@ -3753,7 +3747,8 @@ def _compress_group_index(group_index, sort=True):
(comp_ids) into the list of unique labels (obs_group_ids).
"""

table = _hash.Int64HashTable(min(1000000, len(group_index)))
size_hint = min(len(group_index), _hash._SIZE_HINT_LIMIT)
table = _hash.Int64HashTable(size_hint)

group_index = com._ensure_int64(group_index)

Expand Down
12 changes: 12 additions & 0 deletions vb_suite/index_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,3 +159,15 @@
datetime_index_repr = \
Benchmark("dr._is_dates_only", setup,
start_date=datetime(2012, 1, 11))

setup = common_setup + """
n = 3 * 5 * 7 * 11 * (1 << 10)
low, high = - 1 << 12, 1 << 12
f = lambda k: np.repeat(np.random.randint(low, high, n // k), k)

i = np.random.permutation(n)
mi = MultiIndex.from_arrays([f(11), f(7), f(5), f(3), f(1)])[i]
"""

multiindex_sortlevel_int64 = Benchmark('mi.sortlevel()', setup,
name='multiindex_sortlevel_int64')