Skip to content

Performance difference between to_datetime & astype('datetime64[ns]') #17976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
akansal1 opened this issue Oct 25, 2017 · 3 comments
Closed

Performance difference between to_datetime & astype('datetime64[ns]') #17976

akansal1 opened this issue Oct 25, 2017 · 3 comments
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance

Comments

@akansal1
Copy link

akansal1 commented Oct 25, 2017

There is huge performance difference between, to_datetime and using astype for a epoch time series:

Pandas to_datetime performance :

%timeit pd.to_datetime(pd.Series(np.arange(1352500101000000000,1420438546000000000,6000000000)))

Output: 1.51 s ± 37.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas astype('datetime64[ns]') performance :

%timeit pd.Series(np.arange(1352500101000000000,1420438546000000000,6000000000)).astype('datetime64[ns]')

57.5 ms ± 987 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I am unable to find reason for this performance variance, any help will be great

Thanks!

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-79-generic
machine: x86_64
processor:
byteorder: little
LC_ALL: en_US.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.1.2
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: 0.9.6
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.4.2
numexpr: 2.6.2
feather: 0.4.0
matplotlib: 2.0.0
openpyxl: 2.5.0a2
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: 0.1.0
pandas_gbq: None
pandas_datareader: 0.4.0

@jorisvandenbossche
Copy link
Member

If you specify the unit, the difference is already much smaller:

In [15]: s = pd.Series(np.arange(1352500101000000000,1420438546000000000,6000000000))

In [16]: %timeit pd.to_datetime(s)
1.65 s ± 54.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [17]: %timeit s.astype('datetime64[ns]')
64 ms ± 965 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [18]: %timeit pd.to_datetime(s, unit='ns')
259 ms ± 4.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(but still the difference seems larger than it should be)

@jreback
Copy link
Contributor

jreback commented Oct 25, 2017

the rest of the diff is related to #17449, this ends up being copied 3 times internally

In [10]: %timeit pd.to_datetime(s, unit='ns')
107 ms +- 2.16 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)

In [11]: %timeit s.astype('datetime64[ns]')
20.3 ms +- 283 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

In [12]: %timeit s.copy()
22.3 ms +- 533 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

@jreback jreback added Performance Memory or execution speed performance Datetime Datetime data dtype labels Oct 25, 2017
@jreback jreback added this to the No action milestone Oct 25, 2017
@jreback
Copy link
Contributor

jreback commented Oct 25, 2017

closing, but if you want to help on that other issue would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants