Skip to content

PERF: always slice when indexing on columns #33597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 63 commits into from

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Touched on in #32779, this makes all* DataFrame indexing-on-columns do slicing instead of taking, so doesn't make copies.

* Actually there is a different path in _slice_take_blocks_ax0 that we go down if we self._is_single_block, can update that later if we decide this is something we want to do.

In [3]: dti = pd.date_range("2016-01-01", periods=10**5, freq="S")                                                                                                                                                  
In [4]: df = pd.DataFrame._from_arrays([dti]*10 + [dti - dti] * 10 + [dti.to_period("D")]*10, columns=range(30), index=range(len(dti)))                                                                             

In [8]: arr = np.arange(30)                                                                                                                                                                                         
In [9]: np.random.shuffle(arr)                                                                                                                                                                                      

In [10]: %timeit df[arr]
8.35 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    # <-- master                                                                                                                                                     
650 µs ± 52.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   # <-- PR

The tradeoff is that we end up with less-consolidated results. I'm OK with that, but there may be downsides I'm not aware of, will wait for others to weigh in.

@jorisvandenbossche
Copy link
Member

@jbrockmendel can you maybe open an issue for this? As your questions seems to be more broad than the actual PR, and to not mix that discussion with code review of the actual PR

@jbrockmendel
Copy link
Member Author

i cant reproduce the test failures locally

@jbrockmendel
Copy link
Member Author

Opened #33780, mothballing this.

@jbrockmendel jbrockmendel reopened this Jul 24, 2021
@jbrockmendel jbrockmendel added Indexing Related to indexing on series/frames, not to indexes themselves Copy / view semantics and removed Mothballed Temporarily-closed PR the author plans to return to labels Jul 24, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

this is quite old, happen to reopen if actively worked on.

@jreback jreback closed this Nov 28, 2021
@jbrockmendel jbrockmendel added the Mothballed Temporarily-closed PR the author plans to return to label Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Mothballed Temporarily-closed PR the author plans to return to Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants