Skip to content

iloc indexing ~10 times slower than direct column indexing +no documention of it #29316

Closed
@GregoryMorse

Description

@GregoryMorse

Code Sample, a copy-pastable example if possible

Updated without yfinance as it was a lousy choice of example - requires only standard numpy and datetime. A DateTimeIndex exacerbates the time slowness.

import timeit
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Col1', setup='import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col})', number=3)
#3.553613900000073
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.Col1[x]', setup='import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col})', number=3)
#0.5215671999999358
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Col1', setup='import datetime; import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col}, [datetime.datetime.now() + datetime.timedelta(seconds=x) for x in range(10000)])', number=3)
#4.485623600000508
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.Col1[x]', setup='import datetime; import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col}, [datetime.datetime.now() + datetime.timedelta(seconds=x) for x in range(10000)])', number=3)
#0.7185356999998476

Old code example:
#requires pip install yfinance for this example - could substitute with any data set
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Close', setup='import yfinance as yf; hist = yf.Ticker("MSFT").history(period="max")', number=3)
#5.178573199999846
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Close', setup='import yfinance as yf; hist = yf.Ticker("MSFT").history(period="max")', number=3)
#0.5886300999998184

Problem description

One would naturally expect iloc to have close to identical efficiency as column indexing. This is a dramatic blunder in a lot of code more than likely if it is not well documented and known. I would expect some strange inefficiency is present as logically it could be slightly slower to return a row rather than a data point. But 10 times slower is so dramatic that this function should be avoided.

Expected Output

Less than double the speed of single column value access. Or at least thorough documentation of this limitation.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.5
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions