Skip to content

[ArrowStringArray] Use utf8_upper and utf8_lower functions from Apache Arrow #41056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 20, 2021

Conversation

simonjayhawkins
Copy link
Member

data = ["Mouse", "dog", "house and parrot", "23", np.NaN] * 100_000
s = pd.Series(data, dtype="string")
s1 = pd.Series(data, dtype="arrow_string")

%timeit s.str.upper()
# 63.8 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit s1.str.upper()
# 92 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
# 4.44 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

%timeit s.str.lower()
# 49.1 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit s1.str.lower()
# 78.2 ms ± 814 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
# 4.34 ms ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels Apr 20, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone Apr 20, 2021
@jreback jreback merged commit 9c43cd7 into pandas-dev:master Apr 20, 2021
@jreback
Copy link
Contributor

jreback commented Apr 20, 2021

great, do we have sufficient asv's for these?

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request Apr 21, 2021
@simonjayhawkins simonjayhawkins deleted the String-transforms branch April 21, 2021 08:25
@simonjayhawkins
Copy link
Member Author

great, do we have sufficient asv's for these?

have parameterised existing benchmarks for these in #41041

[ 25.00%] ··· strings.Methods.time_lower                                                                                                                  ok
[ 25.00%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        16.2±0ms 
                  string      12.6±0ms 
               arrow_string   2.67±0ms 
              ============== ==========

[ 50.00%] ··· strings.Methods.time_upper                                                                                                                  ok
[ 50.00%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        19.4±0ms 
                  string      16.4±0ms 
               arrow_string   2.71±0ms 
              ============== ==========

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
@jreback
Copy link
Contributor

jreback commented Jul 3, 2021

@simonjayhawkins anyidea why this is not timing well?

In [39]: s = pd.concat([pd.Series(list('abc'))] * 100_000)

In [40]: s1 = pd.concat([pd.Series(list('abc'), dtype='string[pyarrow]')] * 100_000)

In [41]: %timeit s.str.upper()
48.5 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %timeit s1.str.upper()
299 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]: pd.__version__
Out[43]: '1.4.0.dev0+143.g5675cd8ab2'

In [44]: pyarrow.__version__
Out[44]: '4.0.1'

@simonjayhawkins
Copy link
Member Author

does it create a chunked array with 100_000 chunks?

@jreback
Copy link
Contributor

jreback commented Jul 3, 2021

ahh yes it does

@jreback
Copy link
Contributor

jreback commented Jul 3, 2021

s1 = s.astype('string[pyarrow]'

%timeit s_pyarrow_large.str.uppe()
1.53 ms ± 29.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@jreback
Copy link
Contributor

jreback commented Jul 3, 2021

so actually this raises an issue if we should be rechunking.

@jreback
Copy link
Contributor

jreback commented Jul 3, 2021

let me open an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants