Skip to content

CLN: Make Series._values match Index._values #31182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jan 28, 2020

Conversation

jbrockmendel
Copy link
Member

Discussed in #31037

  • Note this is not an alternative to that, as this should not be considered for 1.0.
  • This does not implement any of the simplifications that it makes available.
  • There is a decent chance that the check in core.apply can be changed so that we can still use the fastpath for these dtypes.
  • The PandasDType edit is not central to the change, just included for perf comparison against PERF: improve access of .array #31037.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

obj = obj.array
arr = obj._values
if not extract_numpy and isinstance(arr, np.ndarray):
return obj.array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was also just thinking while looking above at the .array implementation, that we could do the same here instead of going through the ".array -> wrap in PandasArray -> extract the numpy array again" route, that will further reduce some overhead of extract_array(..., extract_numpy=True).

Could also do a arr = PandasArray(arr) here for being explicit (it's not that it duplicates a lot from .array)

@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Jan 22, 2020
@jbrockmendel
Copy link
Member Author

rebased, this now removes array_values

@jreback jreback added the Clean label Jan 24, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2020

@jorisvandenbossche

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to update the Series._values docstring I wrote.

Also, can you show similar timings as you showed on the other PR but now compared to latest master?

@@ -1249,6 +1258,9 @@ def unique(self):
if hasattr(values, "unique"):

result = values.unique()
if self.dtype.kind in ["m", "M"]:
if getattr(self.dtype, "tz", None) is None:
result = np.asarray(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here on why this is needed

@jorisvandenbossche jorisvandenbossche changed the title Make Series._values match Index._values CLN: Make Series._values match Index._values Jan 24, 2020
@jbrockmendel
Copy link
Member Author

Updated timings vs current master are a mixed bag:

import pandas as pd
from pandas.core.construction import extract_array

s1 = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC'))
s2 = pd.Series(pd.date_range("2012-01-01", periods=3))
s3 = pd.Series([1, 2, 3])

%timeit s1.array
442 ns ± 16.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
913 ns ± 60.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit s2.array
1.26 µs ± 40.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
972 ns ± 33.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit s3.array
1.89 µs ± 71.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
3.44 µs ± 28.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

%timeit extract_array(s1, extract_numpy=True)
1.85 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.62 µs ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit extract_array(s2, extract_numpy=True)
2.61 µs ± 29.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.74 µs ± 66.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit extract_array(s3, extract_numpy=True)
3.95 µs ± 32.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.6 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit extract_array(s3, extract_numpy=False)
3.1 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
5.29 µs ± 160 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

@jorisvandenbossche
Copy link
Member

Do we understand where the slowdown is coming from? From looking at the code, from the extra if isinstance(result, np.ndarray) in the .array attribute, I wouldn't expect such a difference.

@jbrockmendel
Copy link
Member Author

Do we understand where the slowdown is coming from?

It looks like restoring array_values and Series.array does the trick. With those two changes, all of the timings become slightly better than master, with the exception of extract_array(s3, extract_numpy=False) which is ~30% slower

@jbrockmendel
Copy link
Member Author

Restored array_values to avoid a perf hit; im ambivalent on if its worthwhile.

import pandas as pd 
from pandas.core.construction import extract_array 

s1 = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC')) 
s2 = pd.Series(pd.date_range("2012-01-01", periods=3)) 
s3 = pd.Series([1, 2, 3])                                                                                                        

In [2]: %timeit s1.array                                                                                                                 
391 ns ± 3.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
408 ns ± 17.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [3]: %timeit s2.array                                                                                                                 
1.1 µs ± 6.44 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
514 ns ± 43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [4]: %timeit s3.array                                                                                                                 
1.68 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
509 ns ± 4.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [5]: %timeit extract_array(s1, extract_numpy=True)                                                                                    
1.77 µs ± 35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
1.72 µs ± 37.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [6]: %timeit extract_array(s2, extract_numpy=True)                                                                                    
2.59 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.78 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [7]: %timeit extract_array(s3, extract_numpy=True)                                                                                    
3.89 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
2.58 µs ± 27.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

In [8]: %timeit extract_array(s3, extract_numpy=False)                                                                                   
2.71 µs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.46 µs ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

@jorisvandenbossche
Copy link
Member

Ah, I suppose the use of _simple_new was certainly part of the reason (I remember now that I changed to use that in the DatetimeBlock.array_values() at some point in the other PR)

@jorisvandenbossche
Copy link
Member

It might be that now you added to use _simple_new for internal_values as well, you can again combine internal_values and array_values as you did before with less performance loss ?

@jbrockmendel
Copy link
Member Author

OK, updated one more time, this time just changing DatetimeBlock/TimedeltaBlock.internal_values to return self.array_values(). This should have zero effect on array or extract_array, and local timeit results bear that out (not posting for brevity)

We can worry about optimizing array/extract_array further separately, as that should really be orthogonal to this.

@@ -515,8 +515,9 @@ def _values(self):
----------- | ------------- | ------------- | ------------- | --------------- |
Numeric | ndarray | ndarray | PandasArray | ndarray |
Category | Categorical | Categorical | Categorical | ndarray[int] |
dt64[ns] | ndarray[M8ns] | ndarray[M8ns] | DatetimeArray | ndarray[M8ns] |
dt64[ns] | ndarray[M8ns] | DatetimeArray | DatetimeArray | ndarray[M8ns] |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit above (beginning of the docstring), the sentence "This are the values as stored in the Block" is no longer adequate I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence I am quoting (the second line of the docstring) still needs to be updated

@jbrockmendel
Copy link
Member Author

(btw, the comment "Disallow dtypes that have blocks backed by EAs" is not fully correct, as DatetimeBlock is still backed by ndarray no?)

@jreback
Copy link
Contributor

jreback commented Jan 28, 2020

lgtm needs a rebase

@jorisvandenbossche jorisvandenbossche merged commit 4edcc55 into pandas-dev:master Jan 28, 2020
@jorisvandenbossche
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants