-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Poor performance for .loc and .iloc compared to .ix #6683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this is on the order a few function calls the difference validating the input what exactly are you trying to do? If you are actually iterating thru the elements, then all of these approaches are wrong. The indexers |
If you really really want to access individual elements, then use
|
Thanks! Will use .at instead. Could you explain a bit why is .loc so much slower in access an element (while it appear to work find with a range selection t.loc[:100, ['A','E']])? Why does not .ix suffer from that? |
Since .ix is always the fastest in all cases, what is against using .ix instead of .at? Thanks! |
you can use whatever function you want; but 2 my first point, if you are actually iterating over elements their are much better ways to do it. |
I usually convert the pandas to a Python dictionary (which is for many applications enough after you did the sorting/slicing specific to pandas). Then iteration is about 3-4 orders of magnitude faster. |
This does not really answer the question. While these approaches might be impractical, all of the documentation suggests using I've profiled the example above, inspired by the recent issue with In fact, even
ncalls tottime percall cumtime percall filename:lineno(function)
ncalls tottime percall cumtime percall filename:lineno(function) |
Are you sure this is fixed? I still experience 30 time faster when using ix instead of loc when accessing with something like df.ix['row', 'col'] with pandas 0.16.2. When running
when using loc, this message doesn' t appear and it is much slower. @csaladenes Thanks. Converting to dict is indeed much faster! |
@xgdgsc yes this is fixed.
|
Yes. I shouldn't be repeatedly calling these function in a loop. |
Here's a notebook that I show sometimes that explains why this is not a good idea: https://github.com/jreback/StrataNYC2015/blob/master/performance/8.%20indexing.ipynb |
i used .ix instead of .loc,but its still very slow,almost the same.do you know why ?@jreback |
@ys198918 if this perf difference actually matters to you then you are iterating, which is wrong. you need to do vectorized things. iterating is much less performant. |
Why does PANDAS need so many different methods to access data elements? Given the esoteric complexity -- use .at vs .loc or .xs, or .ix (and if you use the wrong mode of access, your code will run like a rocket sled plowing through a wall of jello), i am still trying to figure out what is so great about PANDAS? |
I'm using .loc to get several rows with the same column name out of a larger dataframe and would like to speed it up. actual one run needs 15 hours... the pseudo code is like:
data than will hold around 4000 rows. the most time consuming part is: How can I speed that up? Thank you |
Can you use a numpy recarray instead? A while back, I benchmarked all of
the PANDAS access methods and they all performed poorly compared to...
well, compared to everything.
On Fri, May 22, 2020 at 2:03 PM carstenf ***@***.***> wrote:
@jreback <https://github.com/jreback>
I'm using .loc to get several rows with the same column name out of a
larger dataframe and would like to speed it up. actual one run needs 15
hours...
the pseudo code is like:
data_big = pd.read_csv(path)
for i in my_list.index
name = my_list.at[i, 'name']
data = data_big.loc[data_big['name'] == name ]
.....than continue other stuff
data than will hold around 4000 rows.
the most time consuming part is:
data = data_big.loc[data_big['name'] == name ]
How can I speed that up?
Thank you
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2DC447IFJPAA4LWT5P6FTRS3SDZANCNFSM4ANO7SEQ>
.
--
Mark Yoder, PhD
805 451 8750
*www.linkedin.com/in/Mark-R-Yoder-PhD
<http://www.linkedin.com/in/Mark-R-Yoder-PhD>*
"If you bore me, you lose your soul (to me)..."
~belly
|
thankx this should be the changes....
....but it looks like my iteration is not working. |
Not really understand why you need for loop here. |
For i in my_list -> runs the different items |
Let me look more closely, but iterating over a recarray (which might still
be different that a structured array — which might be faster), is easy. By
default, an integer index is interpreted as a row; a string as a column;
two indices should be row, col.
If you are only interested in one column, it is best to work with it
directly (ie, data[col_name][k] is faster than data[k][col_name]).
Something like: for rw in recarray: gives an iteration of mini-recarray
rows.
Assignment can be tricky. Typically, I’ve found that they need to be
assigned by-column, which is not ideal, but not too bad. So if you want to
assign values to all columns, for rows [k:j], you assign to that range for
each column separately.
M
On Sat, Jan 23, 2021 at 2:22 AM carstenf ***@***.***> wrote:
For i in my_list -> runs the different items
for i in data -> each item has different rows
after each join, I write them to a database
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2DC4Z6NLPGZ5TARQQP5TTS3KPPLANCNFSM4ANO7SEQ>
.
--
Mark Yoder, PhD
805 451 8750
*www.linkedin.com/in/Mark-R-Yoder-PhD
<http://www.linkedin.com/in/Mark-R-Yoder-PhD>*
"If you bore me, you lose your soul (to me)..."
~belly
|
Otherwise, it’s lots of googling and experimenting with syntax.
On Sat, Jan 23, 2021 at 3:13 PM Mark Yoder ***@***.***> wrote:
Let me look more closely, but iterating over a recarray (which might still
be different that a structured array — which might be faster), is easy. By
default, an integer index is interpreted as a row; a string as a column;
two indices should be row, col.
If you are only interested in one column, it is best to work with it
directly (ie, data[col_name][k] is faster than data[k][col_name]).
Something like: for rw in recarray: gives an iteration of mini-recarray
rows.
Assignment can be tricky. Typically, I’ve found that they need to be
assigned by-column, which is not ideal, but not too bad. So if you want to
assign values to all columns, for rows [k:j], you assign to that range for
each column separately.
M
On Sat, Jan 23, 2021 at 2:22 AM carstenf ***@***.***> wrote:
> For i in my_list -> runs the different items
> for i in data -> each item has different rows
> after each join, I write them to a database
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#6683 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AB2DC4Z6NLPGZ5TARQQP5TTS3KPPLANCNFSM4ANO7SEQ>
> .
>
--
Mark Yoder, PhD
805 451 8750
*www.linkedin.com/in/Mark-R-Yoder-PhD
<http://www.linkedin.com/in/Mark-R-Yoder-PhD>*
"If you bore me, you lose your soul (to me)..."
~belly
--
Mark Yoder, PhD
805 451 8750
*www.linkedin.com/in/Mark-R-Yoder-PhD
<http://www.linkedin.com/in/Mark-R-Yoder-PhD>*
"If you bore me, you lose your soul (to me)..."
~belly
|
FYI, there is a built in pandas method that converts dataframes to numpy recarrays. I found this thread due to poor indexing times with |
Generally, I rarely see the value in PANDAS, but it is my understanding
that it performs much better than numpy for very large arrays.
But mostly, I like talking smack about PANDAS, so consider that before
taking my advice!
On Sat, Apr 10, 2021 at 9:17 AM Lou Zell ***@***.***> wrote:
FYI, there is a built in pandas method that converts dataframes to numpy
recarrays
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_records.html>.
I found this thread due to poor indexing times with iloc and loc.
Converting to a numpy recarray first, as @markyoder
<https://github.com/markyoder> suggested, improved my algo's execution
time by an order of magnitude (algo relies on random access).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2DC44E4LMTJZ67TRH4IZ3TIB227ANCNFSM4ANO7SEQ>
.
--
Mark Yoder, PhD
805 451 8750
*www.linkedin.com/in/Mark-R-Yoder-PhD
<http://www.linkedin.com/in/Mark-R-Yoder-PhD>*
"If you bore me, you lose your soul (to me)..."
~belly
|
When using indices, we are encouraged to use .loc instead of .ix. Also the "SettingWithCopyWarning:" recommends us to use .loc instead. But it seems the performance of .loc and .iloc is 20-30 times slower than .ix (I am using Pandas 0.13.1)
.ix takes 4.54897093773 sec
.iloc takes 111.531260967 sec
.loc takes 92.8014230728 sec
The costs for .loc and .iloc seems too high. Thanks!
-- test code ---
!/usr/bin/env python
import pandas as pd
import numpy as np
import time
t=pd.DataFrame(data=np.zeros([100000,6]), columns=['A','B','C','D','E','F'])
start=time.time()
for i in t.index:
for j in ['A','B','C','D','E']:
t.ix[i,j]
t1=time.time()
print t1-start
for i in xrange(100000):
for j in xrange(6):
t.iloc[i,j]
t2=time.time()
print t2-t1
for i in t.index:
for j in ['A','B','C','D','E']:
t.loc[i,j]
t3=time.time()
print t3-t2
The text was updated successfully, but these errors were encountered: