pd.Series.map is unreasonably slow. #21278

allComputableThings · 2018-05-31T23:23:45Z

Code Sample, a copy-pastable example if possible

numiter = 100
for n in [10, 1000, 1000000, 10000000,]:
    domain = np.arange(0, n)
    range = domain+10
    maptable = pd.Series(range, index=domain).sort_index()

    query_vals = pd.Series([1,2,3])
    def f():
        query_vals.map(maptable)
    print n, timeit.timeit(stmt=f, number=numiter)/numiter


10 0.000630810260773
1000 0.000978469848633
1000000 0.00130645036697
10000000 0.0162791204453

Problem description

The above tries to map 3 values in a lookup table, much like looking up values in python dictionary.

n is the size of the table (not the query).

At n=10000000 its taken (0.01/3) seconds per mapped value -- unbelievably shockingly slow.

Expected Output

Costs are growing with O(len(maptable)).
My series' index is sorted. My expectation is that pandas costs <<< O(maptable) for this type of operation.

Costs can scale equal to or less than:

O(len(query_vals) * O(len(log(maptable))), for a binary search on index, or
O(len(query_vals)) for hashed index.

... in other words, there shouldn't be huge differences in the timing of n=10, and n=10000000
for a trivial implementation.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.2.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2018-06-01T00:06:41Z

your results seem odd

for n = 10000000

in 0.23

   ...:     domain = np.arange(0, n)
   ...:     range = domain+10
   ...:     maptable = pd.Series(range, index=domain).sort_index()
   ...: 
   ...:     query_vals = pd.Series([1,2,3])
   ...:     def f():
   ...:         query_vals.map(maptable)
   ...: 
   ...: 

In [9]: %timeit query_vals.map(maptable)
145 us +- 15.9 us per loop (mean +- std. dev. of 7 runs, 1 loop each)

and in 0.22

   ...:     domain = np.arange(0, n)
   ...:     range = domain+10
   ...:     maptable = pd.Series(range, index=domain).sort_index()
   ...: 
   ...:     query_vals = pd.Series([1,2,3])
   ...:     def f():
   ...:         query_vals.map(maptable)
   ...:     
   ...:     

In [2]: %timeit query_vals.map(maptable)
351 us +- 71.2 us per loop (mean +- std. dev. of 7 runs, 1 loop each)

note that we build a hashtable for indexing. once its built, these things are very cheap to do.

sure you could take advantage of monotonic indexes, but there is a cost (in code complexity) and is only sometimes worth it.

allComputableThings · 2018-06-01T01:13:27Z

Nice! Warmup was the issue.

Actually, performance looks very reasonably with this in mind - there's basically no difference between n=10 and n=100000000, and the cost to build the index is not bad.

It also seems there's no benefit in calling sort_index.
Thanks!

allComputableThings closed this as completed Jun 1, 2018

jorisvandenbossche added this to the No action milestone Jun 4, 2018

charlesdong1991 mentioned this issue Jun 11, 2020

PERF: pd.Series.map too slow for huge dictionary #34717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.Series.map is unreasonably slow. #21278

pd.Series.map is unreasonably slow. #21278

allComputableThings commented May 31, 2018 •

edited

Loading

INSTALLED VERSIONS

jreback commented Jun 1, 2018 •

edited

Loading

allComputableThings commented Jun 1, 2018

pd.Series.map is unreasonably slow. #21278

pd.Series.map is unreasonably slow. #21278

Comments

allComputableThings commented May 31, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jun 1, 2018 • edited Loading

allComputableThings commented Jun 1, 2018

allComputableThings commented May 31, 2018 •

edited

Loading

Output of `pd.show_versions()`

jreback commented Jun 1, 2018 •

edited

Loading