BUG: [ArrowStringArray] Recognize ArrowStringArray in infer_dtype #40725

simonjayhawkins · 2021-04-01T10:39:32Z

PR is draft. follow-up to #40679, #40679 (comment) which is not yet merged and need to make window to run benchmarks #40679 (review)

on master

>>> pd.__version__
'1.3.0.dev0+1180.gdd697e1cca'
>>> 
>>> from pandas.core.arrays.string_arrow import ArrowStringDtype
>>> 
>>> arr = ArrowStringDtype.construct_array_type()._from_sequence(["B", pd.NA, "A"])
>>> 
>>> arr
<ArrowStringArray>
['B', <NA>, 'A']
Length: 3, dtype: arrow_string
>>> 
>>> pd._libs.lib.infer_dtype(arr)
'unknown-array'
>>>

…type

This reverts commit c095cd4.

pandas/conftest.py

jorisvandenbossche

I merged the pre-cursor, so this can be updated now.

For the rest looks good to me (using str (the dtype.type) in the _TYPE_MAP is a similar solution as what we did for PeriodDtype (which also has a parametrized name), so that's seems an appropriate solution to me)

simonjayhawkins · 2021-04-01T14:38:12Z

I merged the pre-cursor, so this can be updated now.

Thanks @jorisvandenbossche will get the open PRs updated.

and a couple of PRs for astype failures (both ways) in the pipeline. one uses the fixture #39908 (comment)

and some more parameterization of existing tests where dtype="string" #39908 (comment)

jorisvandenbossche · 2021-04-02T07:06:20Z

@simonjayhawkins this can be undrafted?

simonjayhawkins · 2021-04-02T10:16:32Z

@simonjayhawkins this can be undrafted?

the benchmarks are running and will have a answer later today.

jorisvandenbossche · 2021-04-02T14:27:41Z

(I don't think this will have any impact on performance, though. And we already did the same change for Period)

simonjayhawkins · 2021-04-02T14:32:09Z

(I don't think this will have any impact on performance, though. And we already did the same change for Period)

#40679 (comment)

simonjayhawkins · 2021-04-02T15:15:49Z

hmm, not good.

       before           after         ratio
     [1367cacd]       [37571106]
     <master>         <follow-up>
+        400±50ns         500±50ns     1.25  index_cached_properties.IndexCache.time_inferred_type('Int64Index')
+        400±50ns         500±50ns     1.25  index_cached_properties.IndexCache.time_is_unique('Int64Index')
+     2.69±0.01ms       3.18±0.4ms     1.18  inference.ToDatetimeISO8601.time_iso8601
+      2.20±0.2μs       2.60±0.4μs     1.18  index_cached_properties.IndexCache.time_inferred_type('MultiIndex')
+     1.87±0.04μs      2.11±0.05μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 10000, datetime.timezone.utc)
+     2.00±0.05μs      2.25±0.03μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 12000, None)
+     1.99±0.04μs      2.24±0.01μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 11000, None)
+     2.00±0.02μs      2.26±0.05μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 4000, None)
+     2.00±0.03μs      2.25±0.03μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, None)
+     1.87±0.06μs      2.10±0.01μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 11000, datetime.timezone.utc)
+     2.00±0.03μs      2.24±0.02μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 8000, None)
+     2.01±0.02μs      2.25±0.02μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 6000, None)
+     2.01±0.05μs      2.26±0.02μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 12000, None)
+     2.01±0.05μs      2.25±0.01μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 3000, None)
+     1.99±0.05μs      2.23±0.01μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 1011, None)
+     2.02±0.02μs      2.25±0.01μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, None)
+     2.02±0.02μs      2.25±0.05μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 10000, None)
+      16.3±0.2μs       18.2±0.2μs     1.12  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+     2.01±0.04μs      2.24±0.02μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, None)
+     2.00±0.02μs      2.23±0.03μs     1.12  tslibs.normalize.Normalize.time_normalize_i8_timestamps(0, tzlocal())
+     2.03±0.03μs      2.27±0.05μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 2011, tzlocal())
+     2.04±0.04μs      2.28±0.08μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 4000, tzlocal())
+     2.00±0.03μs      2.23±0.03μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 8000, None)
+     2.05±0.03μs      2.27±0.05μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, tzlocal())
+     2.01±0.06μs      2.23±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 7000, None)
+     2.05±0.02μs      2.27±0.03μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 9000, tzlocal())
+     2.01±0.05μs      2.23±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2000, None)
+     2.28±0.07μs      2.52±0.01μs     1.11  tslibs.normalize.Normalize.time_normalize_i8_timestamps(100, None)
+     2.04±0.03μs      2.26±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 1000, None)
+     1.85±0.05μs      2.05±0.03μs     1.11  tslibs.normalize.Normalize.time_normalize_i8_timestamps(1, datetime.timezone.utc)
+     2.01±0.05μs      2.23±0.03μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 11000, None)
+     1.97±0.05μs      2.19±0.03μs     1.11  tslibs.normalize.Normalize.time_normalize_i8_timestamps(1, None)
+     1.89±0.05μs      2.10±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, datetime.timezone.utc)
+     1.87±0.06μs      2.07±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, datetime.timezone.utc)
+     2.04±0.03μs      2.25±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2000, tzlocal())
+     1.86±0.06μs      2.06±0.02μs     1.11  tslibs.normalize.Normalize.time_normalize_i8_timestamps(0, datetime.timezone.utc)
+     2.07±0.02μs      2.29±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1000, tzlocal())
+     2.06±0.04μs      2.27±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 3000, tzlocal())
+     1.88±0.07μs      2.08±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 9000, datetime.timezone.utc)
+     1.89±0.05μs      2.09±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 1011, datetime.timezone.utc)
+     2.02±0.06μs      2.24±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 4006, None)
+     2.04±0.06μs      2.25±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 3000, None)
+     1.89±0.06μs      2.09±0.02μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 1000, datetime.timezone.utc)
+     1.89±0.05μs      2.08±0.01μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 7000, datetime.timezone.utc)
+     2.08±0.02μs      2.29±0.01μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, tzlocal())
+     2.05±0.04μs      2.26±0.04μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 11000, tzlocal())
+     1.87±0.04μs      2.06±0.02μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 12000, datetime.timezone.utc)
+     2.05±0.01μs      2.26±0.03μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 9000, tzlocal())
+     2.05±0.03μs      2.26±0.01μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 10000, tzlocal())
+     2.04±0.04μs      2.25±0.03μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 11000, tzlocal())
+     2.03±0.04μs      2.23±0.06μs     1.10  tslibs.normalize.Normalize.time_normalize_i8_timestamps(1, tzlocal())
+     2.04±0.05μs      2.24±0.02μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, tzlocal())
+     2.05±0.03μs      2.25±0.03μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 5000, tzlocal())
-        55.0±5μs       49.6±0.3μs     0.90  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7fdbdd6368b0>, False, 'int')
-      20.4±0.3ms       18.2±0.3ms     0.89  gil.ParallelGroupbyMethods.time_parallel(2, 'mean')
-        64.8±6μs       57.9±0.3μs     0.89  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7fdbdd6368b0>, True, 'int')
-       54.3±10μs       46.3±0.8μs     0.85  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7fdbdd6368b0>, False, 'float')
-       72.8±20μs       55.6±0.5μs     0.76  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7fdbdd6368b0>, True, 'float')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

will look in detail tomorrow.

jorisvandenbossche · 2021-04-02T15:35:41Z

@simonjayhawkins I think that is mostly within noise-bounds.
For example, all the TimeDT64ArrToPeriodArr are unrelated, AFAIK. The dt64arr_to_periodarr function doesn't use infer_dtype, as far as I see, and thus can't be affected by this change.

This _TYPE_MAP is only used for infer_dtype. We do have benchmarks for infer_dtype (InferDtype), but those don't show up in your summary above, so indicating they show no significant change.

We can also check the function itself specifically:

In [1]: arr1 = np.array([1, 2, 3])

In [2]: arr2 = pd.array(["a", "b"], dtype="string")

In [3]: %timeit pd.api.types.infer_dtype(arr1)
2.59 µs ± 28.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- master
2.67 µs ± 34.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- PR

In [5]: %timeit pd.api.types.infer_dtype(arr2)
1.04 µs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- master
1.4 µs ± 41.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- PR

So for a StringArray input, there is some change. But even then, the question is whether this is change is significant for actual usage. We typically use infer_dtype when handling generic input (eg in constructors), and for those cases this change will not be significant, I think.

jorisvandenbossche · 2021-04-08T08:52:06Z

pandas/_libs/lib.pyx

@@ -1110,7 +1110,7 @@ _TYPE_MAP = {
    "complex64": "complex",
    "complex128": "complex",
    "c": "complex",
-    "string": "string",
+    str: "string",


Suggested change

str: "string",

"string": "string",

str: "string",

We could keep both, and that should address the small slowdown for inferring an array with dtype="string" (as it will first check the name, and only then the dtype.type).

(but again, the infer_dtype is mostly used to infer actual lists or object dtype arrays, the inferring of an array with already a proper dtype is fast anyway, so I don't think this small difference matters much)

changed to keep both for now, so as not to affect performance of object backed StringArray.

jreback · 2021-04-09T01:41:48Z

thanks @simonjayhawkins

…ndas-dev#40725)

simonjayhawkins added 8 commits March 29, 2021 14:51

TST: [ArrowStringArray] more parameterised testing - part 1

3bb9750

Merge remote-tracking branch 'upstream/master' into nullable_string_d…

acfb5f5

…type

revert changes to pandas/tests/frame/methods/test_astype.py

98b3a5f

Merge remote-tracking branch 'upstream/master' into nullable_string_d…

56d3717

…type

undo inference change

c095cd4

Merge remote-tracking branch 'upstream/master' into nullable_string_d…

88b05e8

…type

Revert "undo inference change"

d02379d

This reverts commit c095cd4.

remove overlap with pandas-dev#40679

2715690

simonjayhawkins added Bug Strings String extension data type and string data labels Apr 1, 2021

simonjayhawkins added this to the 1.3 milestone Apr 1, 2021

simonjayhawkins commented Apr 1, 2021

View reviewed changes

pandas/conftest.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Apr 1, 2021

View reviewed changes

jorisvandenbossche changed the title ~~BUG: [ArrowStringArray] pd._libs.lib.infer_dtype(<ArrowStringArray>) returns 'unknown-array'~~ BUG: [ArrowStringArray] Recognize ArrowStringArray in infer_dtype Apr 1, 2021

Merge remote-tracking branch 'upstream/master' into follow-up

4e68c85

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Apr 1, 2021

remove changes to test_string_dtype - broken off in pandas-dev#40725

74dbf96

This was referenced Apr 2, 2021

Backport PR #40718 on branch 1.2.x (COMPAT: matplotlib 3.4.0) #40739

Merged

ENH: [ArrowStringArray] Enable the string methods for the arrow-backed StringArray #40708

Merged

jorisvandenbossche reviewed Apr 8, 2021

View reviewed changes

simonjayhawkins added 2 commits April 8, 2021 11:14

Merge remote-tracking branch 'upstream/master' into follow-up

225233c

keep both

842b0b1

simonjayhawkins marked this pull request as ready for review April 8, 2021 13:42

jorisvandenbossche approved these changes Apr 8, 2021

View reviewed changes

jreback merged commit d742094 into pandas-dev:master Apr 9, 2021

simonjayhawkins deleted the follow-up branch April 9, 2021 08:59

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

BUG: [ArrowStringArray] Recognize ArrowStringArray in infer_dtype (pa…

2bf0e8d

…ndas-dev#40725)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: [ArrowStringArray] Recognize ArrowStringArray in infer_dtype #40725

BUG: [ArrowStringArray] Recognize ArrowStringArray in infer_dtype #40725

Uh oh!

simonjayhawkins commented Apr 1, 2021

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Uh oh!

simonjayhawkins commented Apr 1, 2021

Uh oh!

jorisvandenbossche commented Apr 2, 2021

Uh oh!

simonjayhawkins commented Apr 2, 2021

Uh oh!

jorisvandenbossche commented Apr 2, 2021

Uh oh!

simonjayhawkins commented Apr 2, 2021

Uh oh!

simonjayhawkins commented Apr 2, 2021

Uh oh!

jorisvandenbossche commented Apr 2, 2021

Uh oh!

jorisvandenbossche Apr 8, 2021

Uh oh!

simonjayhawkins Apr 8, 2021

Uh oh!

jreback commented Apr 9, 2021

Uh oh!

Uh oh!

Uh oh!

BUG: [ArrowStringArray] Recognize ArrowStringArray in infer_dtype #40725

BUG: [ArrowStringArray] Recognize ArrowStringArray in infer_dtype #40725

Uh oh!

Conversation

simonjayhawkins commented Apr 1, 2021

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins commented Apr 1, 2021

Uh oh!

jorisvandenbossche commented Apr 2, 2021

Uh oh!

simonjayhawkins commented Apr 2, 2021

Uh oh!

jorisvandenbossche commented Apr 2, 2021

Uh oh!

simonjayhawkins commented Apr 2, 2021

Uh oh!

simonjayhawkins commented Apr 2, 2021

Uh oh!

jorisvandenbossche commented Apr 2, 2021

Uh oh!

jorisvandenbossche Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins Apr 8, 2021

Choose a reason for hiding this comment

Uh oh!

jreback commented Apr 9, 2021

Uh oh!

Uh oh!