Skip to content

BUG: [ArrowStringArray] Recognize ArrowStringArray in infer_dtype #40725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 9, 2021

Conversation

simonjayhawkins
Copy link
Member

PR is draft. follow-up to #40679, #40679 (comment) which is not yet merged and need to make window to run benchmarks #40679 (review)

on master

>>> pd.__version__
'1.3.0.dev0+1180.gdd697e1cca'
>>> 
>>> from pandas.core.arrays.string_arrow import ArrowStringDtype
>>> 
>>> arr = ArrowStringDtype.construct_array_type()._from_sequence(["B", pd.NA, "A"])
>>> 
>>> arr
<ArrowStringArray>
['B', <NA>, 'A']
Length: 3, dtype: arrow_string
>>> 
>>> pd._libs.lib.infer_dtype(arr)
'unknown-array'
>>> 

@simonjayhawkins simonjayhawkins added Bug Strings String extension data type and string data labels Apr 1, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone Apr 1, 2021
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged the pre-cursor, so this can be updated now.

For the rest looks good to me (using str (the dtype.type) in the _TYPE_MAP is a similar solution as what we did for PeriodDtype (which also has a parametrized name), so that's seems an appropriate solution to me)

@jorisvandenbossche jorisvandenbossche changed the title BUG: [ArrowStringArray] pd._libs.lib.infer_dtype(<ArrowStringArray>) returns 'unknown-array' BUG: [ArrowStringArray] Recognize ArrowStringArray in infer_dtype Apr 1, 2021
@simonjayhawkins
Copy link
Member Author

I merged the pre-cursor, so this can be updated now.

Thanks @jorisvandenbossche will get the open PRs updated.

and a couple of PRs for astype failures (both ways) in the pipeline. one uses the fixture #39908 (comment)

and some more parameterization of existing tests where dtype="string" #39908 (comment)

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Apr 1, 2021
@jorisvandenbossche
Copy link
Member

@simonjayhawkins this can be undrafted?

@simonjayhawkins
Copy link
Member Author

@simonjayhawkins this can be undrafted?

the benchmarks are running and will have a answer later today.

@jorisvandenbossche
Copy link
Member

(I don't think this will have any impact on performance, though. And we already did the same change for Period)

@simonjayhawkins
Copy link
Member Author

(I don't think this will have any impact on performance, though. And we already did the same change for Period)

#40679 (comment)

@simonjayhawkins
Copy link
Member Author

hmm, not good.

       before           after         ratio
     [1367cacd]       [37571106]
     <master>         <follow-up>
+        400±50ns         500±50ns     1.25  index_cached_properties.IndexCache.time_inferred_type('Int64Index')
+        400±50ns         500±50ns     1.25  index_cached_properties.IndexCache.time_is_unique('Int64Index')
+     2.69±0.01ms       3.18±0.4ms     1.18  inference.ToDatetimeISO8601.time_iso8601
+      2.20±0.2μs       2.60±0.4μs     1.18  index_cached_properties.IndexCache.time_inferred_type('MultiIndex')
+     1.87±0.04μs      2.11±0.05μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 10000, datetime.timezone.utc)
+     2.00±0.05μs      2.25±0.03μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 12000, None)
+     1.99±0.04μs      2.24±0.01μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 11000, None)
+     2.00±0.02μs      2.26±0.05μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 4000, None)
+     2.00±0.03μs      2.25±0.03μs     1.13  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, None)
+     1.87±0.06μs      2.10±0.01μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 11000, datetime.timezone.utc)
+     2.00±0.03μs      2.24±0.02μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 8000, None)
+     2.01±0.02μs      2.25±0.02μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 6000, None)
+     2.01±0.05μs      2.26±0.02μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 12000, None)
+     2.01±0.05μs      2.25±0.01μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 3000, None)
+     1.99±0.05μs      2.23±0.01μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 1011, None)
+     2.02±0.02μs      2.25±0.01μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, None)
+     2.02±0.02μs      2.25±0.05μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 10000, None)
+      16.3±0.2μs       18.2±0.2μs     1.12  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+     2.01±0.04μs      2.24±0.02μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, None)
+     2.00±0.02μs      2.23±0.03μs     1.12  tslibs.normalize.Normalize.time_normalize_i8_timestamps(0, tzlocal())
+     2.03±0.03μs      2.27±0.05μs     1.12  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 2011, tzlocal())
+     2.04±0.04μs      2.28±0.08μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 4000, tzlocal())
+     2.00±0.03μs      2.23±0.03μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 8000, None)
+     2.05±0.03μs      2.27±0.05μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, tzlocal())
+     2.01±0.06μs      2.23±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 7000, None)
+     2.05±0.02μs      2.27±0.03μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 9000, tzlocal())
+     2.01±0.05μs      2.23±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2000, None)
+     2.28±0.07μs      2.52±0.01μs     1.11  tslibs.normalize.Normalize.time_normalize_i8_timestamps(100, None)
+     2.04±0.03μs      2.26±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 1000, None)
+     1.85±0.05μs      2.05±0.03μs     1.11  tslibs.normalize.Normalize.time_normalize_i8_timestamps(1, datetime.timezone.utc)
+     2.01±0.05μs      2.23±0.03μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 11000, None)
+     1.97±0.05μs      2.19±0.03μs     1.11  tslibs.normalize.Normalize.time_normalize_i8_timestamps(1, None)
+     1.89±0.05μs      2.10±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, datetime.timezone.utc)
+     1.87±0.06μs      2.07±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, datetime.timezone.utc)
+     2.04±0.03μs      2.25±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2000, tzlocal())
+     1.86±0.06μs      2.06±0.02μs     1.11  tslibs.normalize.Normalize.time_normalize_i8_timestamps(0, datetime.timezone.utc)
+     2.07±0.02μs      2.29±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1000, tzlocal())
+     2.06±0.04μs      2.27±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 3000, tzlocal())
+     1.88±0.07μs      2.08±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 9000, datetime.timezone.utc)
+     1.89±0.05μs      2.09±0.01μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 1011, datetime.timezone.utc)
+     2.02±0.06μs      2.24±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 4006, None)
+     2.04±0.06μs      2.25±0.02μs     1.11  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 3000, None)
+     1.89±0.06μs      2.09±0.02μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 1000, datetime.timezone.utc)
+     1.89±0.05μs      2.08±0.01μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 7000, datetime.timezone.utc)
+     2.08±0.02μs      2.29±0.01μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, tzlocal())
+     2.05±0.04μs      2.26±0.04μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 11000, tzlocal())
+     1.87±0.04μs      2.06±0.02μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 12000, datetime.timezone.utc)
+     2.05±0.01μs      2.26±0.03μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 9000, tzlocal())
+     2.05±0.03μs      2.26±0.01μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 10000, tzlocal())
+     2.04±0.04μs      2.25±0.03μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 11000, tzlocal())
+     2.03±0.04μs      2.23±0.06μs     1.10  tslibs.normalize.Normalize.time_normalize_i8_timestamps(1, tzlocal())
+     2.04±0.05μs      2.24±0.02μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, tzlocal())
+     2.05±0.03μs      2.25±0.03μs     1.10  tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 5000, tzlocal())
-        55.0±5μs       49.6±0.3μs     0.90  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7fdbdd6368b0>, False, 'int')
-      20.4±0.3ms       18.2±0.3ms     0.89  gil.ParallelGroupbyMethods.time_parallel(2, 'mean')
-        64.8±6μs       57.9±0.3μs     0.89  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7fdbdd6368b0>, True, 'int')
-       54.3±10μs       46.3±0.8μs     0.85  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7fdbdd6368b0>, False, 'float')
-       72.8±20μs       55.6±0.5μs     0.76  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7fdbdd6368b0>, True, 'float')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

will look in detail tomorrow.

@jorisvandenbossche
Copy link
Member

@simonjayhawkins I think that is mostly within noise-bounds.
For example, all the TimeDT64ArrToPeriodArr are unrelated, AFAIK. The dt64arr_to_periodarr function doesn't use infer_dtype, as far as I see, and thus can't be affected by this change.

This _TYPE_MAP is only used for infer_dtype. We do have benchmarks for infer_dtype (InferDtype), but those don't show up in your summary above, so indicating they show no significant change.

We can also check the function itself specifically:

In [1]: arr1 = np.array([1, 2, 3])

In [2]: arr2 = pd.array(["a", "b"], dtype="string")

In [3]: %timeit pd.api.types.infer_dtype(arr1)
2.59 µs ± 28.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- master
2.67 µs ± 34.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  <-- PR

In [5]: %timeit pd.api.types.infer_dtype(arr2)
1.04 µs ± 55.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- master
1.4 µs ± 41.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)  <-- PR

So for a StringArray input, there is some change. But even then, the question is whether this is change is significant for actual usage. We typically use infer_dtype when handling generic input (eg in constructors), and for those cases this change will not be significant, I think.

@@ -1110,7 +1110,7 @@ _TYPE_MAP = {
"complex64": "complex",
"complex128": "complex",
"c": "complex",
"string": "string",
str: "string",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
str: "string",
"string": "string",
str: "string",

We could keep both, and that should address the small slowdown for inferring an array with dtype="string" (as it will first check the name, and only then the dtype.type).

(but again, the infer_dtype is mostly used to infer actual lists or object dtype arrays, the inferring of an array with already a proper dtype is fast anyway, so I don't think this small difference matters much)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to keep both for now, so as not to affect performance of object backed StringArray.

@simonjayhawkins simonjayhawkins marked this pull request as ready for review April 8, 2021 13:42
@jreback jreback merged commit d742094 into pandas-dev:master Apr 9, 2021
@jreback
Copy link
Contributor

jreback commented Apr 9, 2021

thanks @simonjayhawkins

@simonjayhawkins simonjayhawkins deleted the follow-up branch April 9, 2021 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants