CLN: Make Series._values match Index._values #31182

jbrockmendel · 2020-01-21T19:32:13Z

Discussed in #31037

Note this is not an alternative to that, as this should not be considered for 1.0.
This does not implement any of the simplifications that it makes available.
There is a decent chance that the check in core.apply can be changed so that we can still use the fastpath for these dtypes.
The PandasDType edit is not central to the change, just included for perf comparison against PERF: improve access of .array #31037.

jorisvandenbossche

Nice

jorisvandenbossche · 2020-01-22T08:32:42Z

pandas/core/construction.py

-        obj = obj.array
+        arr = obj._values
+        if not extract_numpy and isinstance(arr, np.ndarray):
+            return obj.array


Yes, I was also just thinking while looking above at the .array implementation, that we could do the same here instead of going through the ".array -> wrap in PandasArray -> extract the numpy array again" route, that will further reduce some overhead of extract_array(..., extract_numpy=True).

Could also do a arr = PandasArray(arr) here for being explicit (it's not that it duplicates a lot from .array)

…lues-64

jbrockmendel · 2020-01-24T01:25:31Z

rebased, this now removes array_values

pandas/core/internals/blocks.py

jreback

lgtm.

jreback · 2020-01-24T03:50:17Z

@jorisvandenbossche

jorisvandenbossche

You will need to update the Series._values docstring I wrote.

Also, can you show similar timings as you showed on the other PR but now compared to latest master?

jorisvandenbossche · 2020-01-24T10:37:09Z

pandas/core/base.py

@@ -1249,6 +1258,9 @@ def unique(self):
        if hasattr(values, "unique"):

            result = values.unique()
+            if self.dtype.kind in ["m", "M"]:
+                if getattr(self.dtype, "tz", None) is None:
+                    result = np.asarray(result)


Can you add a comment here on why this is needed

pandas/core/apply.py

…lues-64

jbrockmendel · 2020-01-25T01:32:03Z

Updated timings vs current master are a mixed bag:

import pandas as pd
from pandas.core.construction import extract_array

s1 = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC'))
s2 = pd.Series(pd.date_range("2012-01-01", periods=3))
s3 = pd.Series([1, 2, 3])

%timeit s1.array
442 ns ± 16.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
913 ns ± 60.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit s2.array
1.26 µs ± 40.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
972 ns ± 33.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit s3.array
1.89 µs ± 71.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
3.44 µs ± 28.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

%timeit extract_array(s1, extract_numpy=True)
1.85 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.62 µs ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit extract_array(s2, extract_numpy=True)
2.61 µs ± 29.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.74 µs ± 66.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit extract_array(s3, extract_numpy=True)
3.95 µs ± 32.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.6 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

%timeit extract_array(s3, extract_numpy=False)
3.1 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
5.29 µs ± 160 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

jorisvandenbossche · 2020-01-25T07:26:27Z

Do we understand where the slowdown is coming from? From looking at the code, from the extra if isinstance(result, np.ndarray) in the .array attribute, I wouldn't expect such a difference.

…lues-64

jbrockmendel · 2020-01-25T17:12:15Z

Do we understand where the slowdown is coming from?

It looks like restoring array_values and Series.array does the trick. With those two changes, all of the timings become slightly better than master, with the exception of extract_array(s3, extract_numpy=False) which is ~30% slower

jbrockmendel · 2020-01-25T19:52:00Z

Restored array_values to avoid a perf hit; im ambivalent on if its worthwhile.

import pandas as pd 
from pandas.core.construction import extract_array 

s1 = pd.Series(pd.date_range("2012-01-01", periods=3, tz='UTC')) 
s2 = pd.Series(pd.date_range("2012-01-01", periods=3)) 
s3 = pd.Series([1, 2, 3])                                                                                                        

In [2]: %timeit s1.array                                                                                                                 
391 ns ± 3.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
408 ns ± 17.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [3]: %timeit s2.array                                                                                                                 
1.1 µs ± 6.44 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
514 ns ± 43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [4]: %timeit s3.array                                                                                                                 
1.68 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
509 ns ± 4.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [5]: %timeit extract_array(s1, extract_numpy=True)                                                                                    
1.77 µs ± 35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- master
1.72 µs ± 37.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [6]: %timeit extract_array(s2, extract_numpy=True)                                                                                    
2.59 µs ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.78 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

In [7]: %timeit extract_array(s3, extract_numpy=True)                                                                                    
3.89 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
2.58 µs ± 27.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

In [8]: %timeit extract_array(s3, extract_numpy=False)                                                                                   
2.71 µs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- master
1.46 µs ± 16.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  # <-- PR

jorisvandenbossche · 2020-01-25T19:53:10Z

Ah, I suppose the use of _simple_new was certainly part of the reason (I remember now that I changed to use that in the DatetimeBlock.array_values() at some point in the other PR)

pandas/core/construction.py

jorisvandenbossche · 2020-01-25T20:01:42Z

It might be that now you added to use _simple_new for internal_values as well, you can again combine internal_values and array_values as you did before with less performance loss ?

jbrockmendel · 2020-01-25T23:39:16Z

OK, updated one more time, this time just changing DatetimeBlock/TimedeltaBlock.internal_values to return self.array_values(). This should have zero effect on array or extract_array, and local timeit results bear that out (not posting for brevity)

We can worry about optimizing array/extract_array further separately, as that should really be orthogonal to this.

pandas/core/apply.py

jorisvandenbossche · 2020-01-26T07:37:51Z

pandas/core/series.py

@@ -515,8 +515,9 @@ def _values(self):
        ----------- | ------------- | ------------- | ------------- | --------------- |
        Numeric     | ndarray       | ndarray       | PandasArray   | ndarray         |
        Category    | Categorical   | Categorical   | Categorical   | ndarray[int]    |
-        dt64[ns]    | ndarray[M8ns] | ndarray[M8ns] | DatetimeArray | ndarray[M8ns]   |
+        dt64[ns]    | ndarray[M8ns] | DatetimeArray | DatetimeArray | ndarray[M8ns]   |


A bit above (beginning of the docstring), the sentence "This are the values as stored in the Block" is no longer adequate I think?

The sentence I am quoting (the second line of the docstring) still needs to be updated

…lues-64

jbrockmendel · 2020-01-27T00:37:33Z

(btw, the comment "Disallow dtypes that have blocks backed by EAs" is not fully correct, as DatetimeBlock is still backed by ndarray no?)

…lues-64

jreback · 2020-01-28T02:20:27Z

lgtm needs a rebase

…lues-64

jorisvandenbossche · 2020-01-28T06:11:35Z

Thanks!

Make Series._values match Index._values

11eff3e

jorisvandenbossche reviewed Jan 22, 2020

View reviewed changes

jorisvandenbossche added this to the 1.1 milestone Jan 22, 2020

TomAugspurger mentioned this pull request Jan 23, 2020

PERF: improve access of .array #31037

Merged

jbrockmendel added 2 commits January 23, 2020 16:43

Merge branch 'master' of https://github.com/pandas-dev/pandas into va…

d172ef5

…lues-64

rebase

3b22d7e

jreback reviewed Jan 24, 2020

View reviewed changes

pandas/core/internals/blocks.py Outdated Show resolved Hide resolved

jreback added the Clean label Jan 24, 2020

jreback approved these changes Jan 24, 2020

View reviewed changes

jorisvandenbossche requested changes Jan 24, 2020

View reviewed changes

jorisvandenbossche changed the title ~~Make Series._values match Index._values~~ CLN: Make Series._values match Index._values Jan 24, 2020

Merge branch 'master' of https://github.com/pandas-dev/pandas into va…

256a54e

…lues-64

docstring, comments

91af806

Merge branch 'master' of https://github.com/pandas-dev/pandas into va…

95de06b

…lues-64

restore array_values for perf

5fd52c4

jorisvandenbossche reviewed Jan 25, 2020

View reviewed changes

pandas/core/construction.py Outdated Show resolved Hide resolved

just do internal_values

75cb3f4

jorisvandenbossche reviewed Jan 26, 2020

View reviewed changes

jbrockmendel added 2 commits January 26, 2020 15:50

Merge branch 'master' of https://github.com/pandas-dev/pandas into va…

0c01488

…lues-64

docstring update

c270efa

jbrockmendel closed this Jan 27, 2020

jbrockmendel reopened this Jan 27, 2020

jbrockmendel added 3 commits January 26, 2020 16:46

better comment

713bac2

Merge branch 'master' of https://github.com/pandas-dev/pandas into va…

324835c

…lues-64

Merge branch 'master' of https://github.com/pandas-dev/pandas into va…

2016b84

…lues-64

jbrockmendel added 3 commits January 27, 2020 18:22

Merge branch 'master' of https://github.com/pandas-dev/pandas into va…

5395388

…lues-64

Merge branch 'master' of https://github.com/pandas-dev/pandas into va…

9601525

…lues-64

update docstring

5c305ff

jorisvandenbossche approved these changes Jan 28, 2020

View reviewed changes

jorisvandenbossche merged commit 4edcc55 into pandas-dev:master Jan 28, 2020

jbrockmendel deleted the values-64 branch January 28, 2020 17:11

simonjayhawkins mentioned this pull request Apr 16, 2020

BUG:Timezone lost when assigning Datetime via DataFrame.at #33544

Closed

3 tasks

simonjayhawkins mentioned this pull request Sep 7, 2020

REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby #36003

Closed

1 task

simonjayhawkins mentioned this pull request Dec 13, 2020

BUG: set_index screws up the dtypes on empty DataFrames #38419

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: Make Series._values match Index._values #31182

CLN: Make Series._values match Index._values #31182

jbrockmendel commented Jan 21, 2020

jorisvandenbossche left a comment

jorisvandenbossche Jan 22, 2020

jbrockmendel commented Jan 24, 2020

jreback left a comment

jreback commented Jan 24, 2020

jorisvandenbossche left a comment

jorisvandenbossche Jan 24, 2020

jbrockmendel commented Jan 25, 2020

jorisvandenbossche commented Jan 25, 2020

jbrockmendel commented Jan 25, 2020

jbrockmendel commented Jan 25, 2020

jorisvandenbossche commented Jan 25, 2020

jorisvandenbossche commented Jan 25, 2020

jbrockmendel commented Jan 25, 2020

jorisvandenbossche Jan 26, 2020

jorisvandenbossche Jan 27, 2020

jbrockmendel commented Jan 27, 2020

jreback commented Jan 28, 2020

jorisvandenbossche commented Jan 28, 2020

CLN: Make Series._values match Index._values #31182

CLN: Make Series._values match Index._values #31182

Conversation

jbrockmendel commented Jan 21, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 22, 2020

Choose a reason for hiding this comment

jbrockmendel commented Jan 24, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback commented Jan 24, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 24, 2020

Choose a reason for hiding this comment

jbrockmendel commented Jan 25, 2020

jorisvandenbossche commented Jan 25, 2020

jbrockmendel commented Jan 25, 2020

jbrockmendel commented Jan 25, 2020

jorisvandenbossche commented Jan 25, 2020

jorisvandenbossche commented Jan 25, 2020

jbrockmendel commented Jan 25, 2020

jorisvandenbossche Jan 26, 2020

Choose a reason for hiding this comment

jorisvandenbossche Jan 27, 2020

Choose a reason for hiding this comment

jbrockmendel commented Jan 27, 2020

jreback commented Jan 28, 2020

jorisvandenbossche commented Jan 28, 2020