Poor performance for .loc and .iloc compared to .ix #6683

brownmk · 2014-03-21T17:51:15Z

When using indices, we are encouraged to use .loc instead of .ix. Also the "SettingWithCopyWarning:" recommends us to use .loc instead. But it seems the performance of .loc and .iloc is 20-30 times slower than .ix (I am using Pandas 0.13.1)

.ix takes 4.54897093773 sec
.iloc takes 111.531260967 sec
.loc takes 92.8014230728 sec

The costs for .loc and .iloc seems too high. Thanks!
-- test code ---

!/usr/bin/env python

import pandas as pd
import numpy as np
import time

t=pd.DataFrame(data=np.zeros([100000,6]), columns=['A','B','C','D','E','F'])
start=time.time()
for i in t.index:
for j in ['A','B','C','D','E']:
t.ix[i,j]
t1=time.time()
print t1-start
for i in xrange(100000):
for j in xrange(6):
t.iloc[i,j]
t2=time.time()
print t2-t1
for i in t.index:
for j in ['A','B','C','D','E']:
t.loc[i,j]
t3=time.time()
print t3-t2

jreback · 2014-03-21T18:09:45Z

this is on the order a few function calls the difference validating the input

what exactly are you trying to do?

If you are actually iterating thru the elements, then all of these approaches are wrong.

The indexers

jreback · 2014-03-21T18:10:40Z

In [16]: %timeit t.ix[100,'A']
100000 loops, best of 3: 4.67 ﾵs per loop

In [17]: %timeit t.loc[100,'A']
10000 loops, best of 3: 142 ﾵs per loop

In [18]: %timeit t.iloc[100,0]
10000 loops, best of 3: 113 ﾵs per loop

If you really really want to access individual elements, then use iat/at

In [23]: %timeit t.iat[100,0]
100000 loops, best of 3: 8.8 ﾵs per loop

In [24]: %timeit t.at[100,'A']
100000 loops, best of 3: 5.8 ﾵs per loop

brownmk · 2014-03-21T18:20:53Z

Thanks! Will use .at instead. Could you explain a bit why is .loc so much slower in access an element (while it appear to work find with a range selection t.loc[:100, ['A','E']])? Why does not .ix suffer from that?

brownmk · 2014-03-21T18:22:46Z

Since .ix is always the fastest in all cases, what is against using .ix instead of .at? Thanks!

jreback · 2014-03-21T18:24:15Z

ix can very subtly give wrong results (use an index of say even numbers)

you can use whatever function you want; ix is still there, but it doesn't provide the guarantees that loc provides, namely that it won't interpret a number as a location

but 2 my first point, if you are actually iterating over elements their are much better ways to do it.

csaladenes · 2014-10-01T01:12:38Z

I usually convert the pandas to a Python dictionary (which is for many applications enough after you did the sorting/slicing specific to pandas). Then iteration is about 3-4 orders of magnitude faster.

sergeny · 2014-12-24T01:21:52Z

This does not really answer the question. While these approaches might be impractical, all of the documentation suggests using .loc or .iloc, or at, or .iat over .ix, yet .ix is consistently faster. Are there any advantages to not using .ix? Better type checking? What if I am completely sure that all my indexes only contain integers, and nothing else?

I've profiled the example above, inspired by the recent issue with .loc[list] (#9126), and indeed, even iloc remains slower than ix.

In fact, even .at is slower than .ix for retrieving a single element. This is because .at is calling pd.core.indexing._AtIndexer._covert_key, which is the most time-consuming function call, whereas .ix is just not doing it.

 a=pd.DataFrame(data=np.zeros([10000,6]), columns=['A','B','C','D','E','F'])
In [58]: %timeit  [a.at[5235,'C'] for i in xrange(100000)]
1 loops, best of 3: 668 ms per loop

In [58]: %prun  [a.at[5235,'C'] for i in xrange(100000)]

ncalls tottime percall cumtime percall filename:lineno(function)
100000 0.226 0.000 0.393 0.000 indexing.py:1526(_convert_key)
100000 0.147 0.000 0.569 0.000 frame.py:1623(get_value)
100000 0.142 0.000 1.117 0.000 indexing.py:1498(getitem)
100000 0.116 0.000 0.253 0.000 internals.py:3469(get_values)
1 0.105 0.105 1.280 1.280 :1()
200000 0.083 0.000 0.095 0.000 index.py:606(is_integer)
100000 0.065 0.000 0.065 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}

In [60]: %timeit  [a.ix[5235,'C'] for i in xrange(100000)]
1 loops, best of 3: 446 ms per loop


In [62]: %prun  [a.ix[5235,'C'] for i in xrange(100000)]

ncalls tottime percall cumtime percall filename:lineno(function)
100000 0.146 0.000 0.532 0.000 frame.py:1623(get_value)
100000 0.110 0.000 0.704 0.000 indexing.py:61(getitem)
100000 0.104 0.000 0.231 0.000 internals.py:3469(get_values)
1 0.100 0.100 0.862 0.862 :1()
100000 0.060 0.000 0.060 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}

xgdgsc · 2015-10-08T09:52:29Z

Are you sure this is fixed? I still experience 30 time faster when using ix instead of loc when accessing with something like df.ix['row', 'col'] with pandas 0.16.2. When running %timeit df.ix['row', 'col'], I get a message:

The slowest run took 10.01 times longer than the fastest. This could mean that an intermediate result is being cached

when using loc, this message doesn' t appear and it is much slower.

@csaladenes Thanks. Converting to dict is indeed much faster!

jreback · 2015-10-08T12:33:54Z

@xgdgsc yes this is fixed.

.loc does quite a bit of inference and is much more typesafe than .ix. You can still use .ix if you'd like. But if this perf diff actually matters (and we are talking microseconds here). Then you are just doing things wrong. You shouldn't be repeatedly calling these function in a loop. There are much much better ways of doing things.

xgdgsc · 2015-10-08T13:00:31Z

Yes. I shouldn't be repeatedly calling these function in a loop.

jreback · 2015-10-08T13:04:18Z

Here's a notebook that I show sometimes that explains why this is not a good idea: https://github.com/jreback/StrataNYC2015/blob/master/performance/8.%20indexing.ipynb

ys198918 · 2016-05-24T07:05:48Z

i used .ix instead of .loc,but its still very slow,almost the same.do you know why ?@jreback

jreback · 2016-05-24T13:07:23Z

@ys198918 if this perf difference actually matters to you then you are iterating, which is wrong. you need to do vectorized things. iterating is much less performant.

markyoder · 2017-12-14T05:53:02Z

Why does PANDAS need so many different methods to access data elements? Given the esoteric complexity -- use .at vs .loc or .xs, or .ix (and if you use the wrong mode of access, your code will run like a rocket sled plowing through a wall of jello), i am still trying to figure out what is so great about PANDAS?

carstenf · 2020-05-22T21:03:41Z

@jreback

I'm using .loc to get several rows with the same column name out of a larger dataframe and would like to speed it up. actual one run needs 15 hours...

the pseudo code is like:

data_big = pd.read_csv(path)

for i in my_list.index
    name  = my_list.at[i, 'name']    
    data = data_big.loc[data_big['name'] == name ]

.....than continue other stuff

data than will hold around 4000 rows.

the most time consuming part is:
data = data_big.loc[data_big['name'] == name ]

How can I speed that up?

Thank you

markyoder · 2020-05-22T21:13:34Z

Can you use a numpy recarray instead? A while back, I benchmarked all of the PANDAS access methods and they all performed poorly compared to... well, compared to everything.

On Fri, May 22, 2020 at 2:03 PM carstenf ***@***.***> wrote: @jreback <https://github.com/jreback> I'm using .loc to get several rows with the same column name out of a larger dataframe and would like to speed it up. actual one run needs 15 hours... the pseudo code is like: data_big = pd.read_csv(path) for i in my_list.index name = my_list.at[i, 'name'] data = data_big.loc[data_big['name'] == name ] .....than continue other stuff data than will hold around 4000 rows. the most time consuming part is: data = data_big.loc[data_big['name'] == name ] How can I speed that up? Thank you — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2DC447IFJPAA4LWT5P6FTRS3SDZANCNFSM4ANO7SEQ> .

-- Mark Yoder, PhD 805 451 8750 *www.linkedin.com/in/Mark-R-Yoder-PhD <http://www.linkedin.com/in/Mark-R-Yoder-PhD>* "If you bore me, you lose your soul (to me)..." ~belly

carstenf · 2020-05-22T22:29:19Z

thankx

this should be the changes....

data_big = pd.read_csv(path)
data_big_np = data_big.to_records()

for i in my_list.index
    name  = my_list.at[i, 'name']    
    # data = data_big.loc[data_big['name'] == name ]
    data = data_big_np['name'] == name

    # check if not empty
    if data.size != 0:
        
        
       # vals will be send to a database
        vals = ",".join(["""('{}',{},{})""".format (
        data[i, 'date'],  
        data[i, 'something'],  
        security_id ) for i in data])

....but it looks like my iteration is not working.
were did I missed it? I did not find any useful google/help on "iterate recarray"

quangtuan202 · 2021-01-23T00:15:03Z

thankx

this should be the changes....

data_big = pd.read_csv(path)
data_big_np = data_big.to_records()

for i in my_list.index
    name  = my_list.at[i, 'name']    
    # data = data_big.loc[data_big['name'] == name ]
    data = data_big_np['name'] == name

    # check if not empty
    if data.size != 0:
        
        
       # vals will be send to a database
        vals = ",".join(["""('{}',{},{})""".format (
        data[i, 'date'],  
        data[i, 'something'],  
        security_id ) for i in data])

....but it looks like my iteration is not working.
were did I missed it? I did not find any useful google/help on "iterate recarray"

Not really understand why you need for loop here.

carstenf · 2021-01-23T10:22:32Z

For i in my_list -> runs the different items
for i in data -> each item has different rows
after each join, I write them to a database

markyoder · 2021-01-23T23:14:13Z

Let me look more closely, but iterating over a recarray (which might still be different that a structured array — which might be faster), is easy. By default, an integer index is interpreted as a row; a string as a column; two indices should be row, col. If you are only interested in one column, it is best to work with it directly (ie, data[col_name][k] is faster than data[k][col_name]). Something like: for rw in recarray: gives an iteration of mini-recarray rows. Assignment can be tricky. Typically, I’ve found that they need to be assigned by-column, which is not ideal, but not too bad. So if you want to assign values to all columns, for rows [k:j], you assign to that range for each column separately. M

On Sat, Jan 23, 2021 at 2:22 AM carstenf ***@***.***> wrote: For i in my_list -> runs the different items for i in data -> each item has different rows after each join, I write them to a database — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2DC4Z6NLPGZ5TARQQP5TTS3KPPLANCNFSM4ANO7SEQ> .

-- Mark Yoder, PhD 805 451 8750 *www.linkedin.com/in/Mark-R-Yoder-PhD <http://www.linkedin.com/in/Mark-R-Yoder-PhD>* "If you bore me, you lose your soul (to me)..." ~belly

markyoder · 2021-01-23T23:14:43Z

Otherwise, it’s lots of googling and experimenting with syntax.

On Sat, Jan 23, 2021 at 3:13 PM Mark Yoder ***@***.***> wrote: Let me look more closely, but iterating over a recarray (which might still be different that a structured array — which might be faster), is easy. By default, an integer index is interpreted as a row; a string as a column; two indices should be row, col. If you are only interested in one column, it is best to work with it directly (ie, data[col_name][k] is faster than data[k][col_name]). Something like: for rw in recarray: gives an iteration of mini-recarray rows. Assignment can be tricky. Typically, I’ve found that they need to be assigned by-column, which is not ideal, but not too bad. So if you want to assign values to all columns, for rows [k:j], you assign to that range for each column separately. M On Sat, Jan 23, 2021 at 2:22 AM carstenf ***@***.***> wrote: > For i in my_list -> runs the different items > for i in data -> each item has different rows > after each join, I write them to a database > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#6683 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AB2DC4Z6NLPGZ5TARQQP5TTS3KPPLANCNFSM4ANO7SEQ> > . > -- Mark Yoder, PhD 805 451 8750 *www.linkedin.com/in/Mark-R-Yoder-PhD <http://www.linkedin.com/in/Mark-R-Yoder-PhD>* "If you bore me, you lose your soul (to me)..." ~belly

-- Mark Yoder, PhD 805 451 8750 *www.linkedin.com/in/Mark-R-Yoder-PhD <http://www.linkedin.com/in/Mark-R-Yoder-PhD>* "If you bore me, you lose your soul (to me)..." ~belly

lzell · 2021-04-10T16:17:36Z

FYI, there is a built in pandas method that converts dataframes to numpy recarrays. I found this thread due to poor indexing times with iloc and loc. Converting to a numpy recarray first, as @markyoder suggested, improved my algo's execution time by an order of magnitude (algo relies on random access).

markyoder · 2021-04-10T16:28:37Z

Generally, I rarely see the value in PANDAS, but it is my understanding that it performs much better than numpy for very large arrays. But mostly, I like talking smack about PANDAS, so consider that before taking my advice!

On Sat, Apr 10, 2021 at 9:17 AM Lou Zell ***@***.***> wrote: FYI, there is a built in pandas method that converts dataframes to numpy recarrays <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_records.html>. I found this thread due to poor indexing times with iloc and loc. Converting to a numpy recarray first, as @markyoder <https://github.com/markyoder> suggested, improved my algo's execution time by an order of magnitude (algo relies on random access). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2DC44E4LMTJZ67TRH4IZ3TIB227ANCNFSM4ANO7SEQ> .

-- Mark Yoder, PhD 805 451 8750 *www.linkedin.com/in/Mark-R-Yoder-PhD <http://www.linkedin.com/in/Mark-R-Yoder-PhD>* "If you bore me, you lose your soul (to me)..." ~belly

jreback closed this as completed Mar 21, 2014

sergeny mentioned this issue Dec 24, 2014

PERF: fix slow s.loc[[0]] #9127

Merged

robintibor mentioned this issue Feb 28, 2020

Faster dataset iteration through precomputed target/index braindecode/braindecode#89

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance for .loc and .iloc compared to .ix #6683

Poor performance for .loc and .iloc compared to .ix #6683

brownmk commented Mar 21, 2014

jreback commented Mar 21, 2014

jreback commented Mar 21, 2014

brownmk commented Mar 21, 2014

brownmk commented Mar 21, 2014

jreback commented Mar 21, 2014

csaladenes commented Oct 1, 2014

sergeny commented Dec 24, 2014

xgdgsc commented Oct 8, 2015

jreback commented Oct 8, 2015

xgdgsc commented Oct 8, 2015

jreback commented Oct 8, 2015

ys198918 commented May 24, 2016

jreback commented May 24, 2016

markyoder commented Dec 14, 2017

carstenf commented May 22, 2020

markyoder commented May 22, 2020 via email

carstenf commented May 22, 2020

quangtuan202 commented Jan 23, 2021

carstenf commented Jan 23, 2021

markyoder commented Jan 23, 2021 via email

markyoder commented Jan 23, 2021 via email

lzell commented Apr 10, 2021

markyoder commented Apr 10, 2021 via email

Poor performance for .loc and .iloc compared to .ix #6683

Poor performance for .loc and .iloc compared to .ix #6683

Comments

brownmk commented Mar 21, 2014

!/usr/bin/env python

jreback commented Mar 21, 2014

jreback commented Mar 21, 2014

brownmk commented Mar 21, 2014

brownmk commented Mar 21, 2014

jreback commented Mar 21, 2014

csaladenes commented Oct 1, 2014

sergeny commented Dec 24, 2014

xgdgsc commented Oct 8, 2015

jreback commented Oct 8, 2015

xgdgsc commented Oct 8, 2015

jreback commented Oct 8, 2015

ys198918 commented May 24, 2016

jreback commented May 24, 2016

markyoder commented Dec 14, 2017

carstenf commented May 22, 2020

markyoder commented May 22, 2020 via email

carstenf commented May 22, 2020

quangtuan202 commented Jan 23, 2021

carstenf commented Jan 23, 2021

markyoder commented Jan 23, 2021 via email

markyoder commented Jan 23, 2021 via email

lzell commented Apr 10, 2021

markyoder commented Apr 10, 2021 via email