PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754

anmyachev · 2019-03-17T18:49:47Z

closes N/A
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback

woa what benchmark are u trying to make better? this is adding a huge amount of complexity

codecov · 2019-03-17T19:28:42Z

Codecov Report

Merging #25754 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25754      +/-   ##
==========================================
+ Coverage   91.25%   91.25%   +<.01%     
==========================================
  Files         172      172              
  Lines       52977    52971       -6     
==========================================
- Hits        48342    48338       -4     
+ Misses       4635     4633       -2

Flag	Coverage Δ
#multiple	`89.83% <100%> (ø)`	⬆️
#single	`41.74% <50%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.39% <100%> (+0.04%)`	⬆️
pandas/util/testing.py	`89.08% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 707c720...846f0bf. Read the comment docs.

codecov · 2019-03-17T19:28:42Z

Codecov Report

Merging #25754 into master will decrease coverage by 0.12%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25754      +/-   ##
==========================================
- Coverage   92.04%   91.91%   -0.13%     
==========================================
  Files         175      174       -1     
  Lines       52291    50708    -1583     
==========================================
- Hits        48131    46610    -1521     
+ Misses       4160     4098      -62

Flag	Coverage Δ
#multiple	`90.42% <ø> (-0.17%)`	⬇️
#single	`41.18% <ø> (+0.31%)`	⬆️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.01% <0%> (-0.12%)`	⬇️
pandas/core/arrays/categorical.py	`95.95% <0%> (ø)`	⬆️
pandas/core/indexes/range.py	`97.97% <0%> (+0.01%)`	⬆️
pandas/core/base.py	`98% <0%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bfbd81...5dda33c. Read the comment docs.

anmyachev · 2019-03-17T22:16:03Z

I'm trying to speed up read_csv function. Manually, I ran asv as follows:
asv continuous -f 1.05 origin HEAD -b io.csv
And got these result:

before	after	ratio	test name
[`707c720`]	[`846f0bf`]
master	datehelpers-performance-patch
2.98±0.06ms	2.78±0.02ms	0.93	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'round_trip')
2.56±0.04ms	2.38±0.01ms	0.93	io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'round_trip')
1.94±0.03ms	1.79±0.04ms	0.93	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'ymd')
3.27±0.08ms	3.02±0.04ms	0.92	io.csv.ReadUint64Integers.time_read_uint64
1.55±0.03ms	1.38±0.02ms	0.89	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')

During further testing performance, one of the datasets showed a decrease in the read_csv runtime for 25%.

jreback · 2019-03-17T22:36:59Z

these are tiny numbers - and likely not the cause of an actual slowdown - read_csv is io bound for the most part

anmyachev · 2019-03-18T10:32:12Z

I have written a simple benchmark to demonstrate the impact.
Download dataset here: https://vincentarelbundock.github.io/Rdatasets/csv/mosaicData/Birthdays.csv
And run this simple script:

#!/usr/bin/env python
import os
import sys
import time
import pandas as pd

print("pandas version - %s" % pd.__version__)

if len(sys.argv) > 2:
    dataset_file = sys.argv[1]
    test_time = float(sys.argv[2]) * 60
elif len(sys.argv) == 2:
    dataset_file = sys.argv[1]
    test_time = 60
else:
    raise ValueError("please specify dataset")

count_call = 0
test_result_time = 0
while(test_time > 0):
    time_call = time.time()
    df = pd.read_csv(dataset_file, sep=',', parse_dates=['date'])
    time_call = time.time() - time_call
    test_time -= time_call
    test_result_time += time_call
    count_call += 1
print("%d calls read_csv on %s per %f seconds" % (count_call, dataset_file, test_result_time))
average_time = test_result_time / count_call;
print("average time per call read_csv - %f" % average_time)

Without proposed patch:

pandas version - 0.25.0.dev0+266.g707c7201a
167 calls read_csv on .\Birthdays.csv per 60.212000 seconds
average time per call read_csv - 0.360551

With:

pandas version - 0.25.0.dev0+267.g846f0bfa5
208 calls read_csv on .\Birthdays.csv per 60.246000 seconds
average time per call read_csv - 0.289644

jreback · 2019-03-18T10:43:35Z

@anmyachev can u show the effect on the asv benchmarks for csv

vnlitvinov · 2019-03-18T10:46:54Z

Here's the makeshift script that directly measures time spent in _concat_date_cols:

#!/usr/bin/env python
import os
import sys
import time
import pandas as pd

import pandas.io.parsers
old_concat_date_cols = pandas.io.parsers._concat_date_cols
total = 0
def my_concat_date_cols(*a, **kw):
    global total
    start = time.time()
    try:
        return old_concat_date_cols(*a, **kw)
    finally:
        total += time.time() - start
pandas.io.parsers._concat_date_cols = my_concat_date_cols

print("pandas version - %s" % pd.__version__)

if len(sys.argv) > 2:
    dataset_file = sys.argv[1]
    test_time = float(sys.argv[2]) * 60
elif len(sys.argv) == 2:
    dataset_file = sys.argv[1]
    test_time = 60
else:
    raise ValueError("please specify dataset")

count_call = 0
test_result_time = 0
while(test_time > 0):
    time_call = time.time()
    df = pd.read_csv(dataset_file, sep=',', parse_dates=['date'])
    time_call = time.time() - time_call
    test_time -= time_call
    test_result_time += time_call
    count_call += 1
print("%d calls read_csv on %s per %f seconds" % (count_call, dataset_file, test_result_time))
average_time = test_result_time / count_call;
print("average time per call read_csv - %f" % average_time)
print("average time for _concat_date_cols() - %f" % (total / count_call))

And here are its results:

pandas	read_csv average time	_concat_date_cols average time
Original	0.418585	0.076364
Patched	0.310918	0.007838

As you can see here, with the patch _concat_date_cols gets 10x faster.

jreback · 2019-03-18T11:14:00Z

@anmyachev see the docs: http://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

you would need to see the effect on the current perf test suite & write a new benchmark in the same style.

anmyachev · 2019-03-18T11:47:55Z

Original csv benchmarks were mostly not measuring _concat_date_time - most their time was spent in parsing item. I have composed new benchmarks (see 61d8e6e) that highlight the speedup brought by this PR.

master	datehelpers-performance-patch	speedup	test name
28.6±0.3ms	24.2±0.2ms	0.85	io.csv.ReadCSVConcatDatetime.time_read_csv
12.1±0.2ms	7.17±0.3ms	0.59	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
17.1±0.1ms	6.90±0.1ms	0.40	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
16.2±0.4ms	5.04±0.2ms	0.31	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')

jreback · 2019-03-18T11:57:46Z

@anmyachev those are not much of a speedup. This PR adds a huge amount of complexity / c-code. I would rather you cythonize things if need by. Am -1 on adding c-code.

vnlitvinov · 2019-03-18T15:07:17Z

I would rather you cythonize things if need by.

I have implemented a small benchmark for _does_string_look_like_datetime. It clearly shows that Cython in certain cases like tight loops or lots of exception raising/handling can lag behind C code quite a bit:

before	after	ratio	test case
[`707c720`]	[`e66883e`]
pre-patch	datehelpers-performance-patch
105±3ms	32.9±0.05ms	0.31	io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')
168±4ms	46.1±1ms	0.27	io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('10000')
296±5ms	41.5±0.4ms	0.14	io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('2Q2005')

vnlitvinov · 2019-03-18T15:29:10Z

Also note that benchmarks are currently working on quite limited amount of data to be fast, if I patch those to resemble real-life scenarios (using 5M lines instead of 50K) results would be:

before	after	ratio	test case
[`707c720`]	[`e66883e`]
pre-patch	datehelpers-performance-patch
3.38±0s	2.77±0s	0.82	io.csv.ReadCSVConcatDatetime.time_read_csv
1.19±0.01s	584±9ms	0.49	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
1.73±0.01s	482±10ms	0.28	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
1.66±0.01s	414±2ms	0.25	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')

Which yields at least 0.5s saving on reading (or more than 1 second in the best case for our optimization).

mroeschke · 2019-03-18T23:52:19Z

Is there any way the to align _does_string_look_like_datetime closer to what your implementation in C does or are the speedups purely language specifics? I am not the biggest fan of the maintenance burden of this approach either.

vnlitvinov · 2019-03-19T16:28:12Z

@mroeschke we did our best to bring our improvements from C to Cython for _does_string_look_like_datetime (and new benchmarks show it's now on par with C version).

We've now trying to re-write _concat_date_cols in Cython and we'll see how it would impact the benchmark numbers.

pep8speaks · 2019-03-22T09:05:13Z

Hello @anmyachev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-09 21:12:33 UTC

anmyachev · 2019-03-22T09:11:39Z

@jreback, @mroeschke what you say about this cython version of _concat_date_cols?
This version loses in performance but not much relative to C.
Specific numbers will be later.

jreback · 2019-03-22T11:27:15Z

@anmyachev pls remove all c-code. further this is not passing.

vnlitvinov · 2019-03-22T12:32:05Z

Performance loss of Cython vs C code is noticable but not big:
asv continuous -f 1.05 datehelpers-performance-patch datehelpers-performance-patch-purec -b io.parsers -e -a sample_time=2 -a warmup_time=2

before	after	ratio	test name
[`7074b96`]	[6ba6fb91]
datehelpers-performance-patch	datehelpers-performance-patch-purec
365±2μs	347±3μs	0.95	io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, )
954±20μs	899±10μs	0.94	io.parsers.ConcatDateCols.time_check_concat('AAAA', 2, )
82.3±0.2μs	62.3±2μs	0.76	io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, <class 'list'>)

jreback · 2019-03-22T12:39:26Z

@anmyachev we are -1 on adding really any c-code. this is increases complexity exponentially. if you have a perf improvement with cython great. but even that must be clear code.

vnlitvinov · 2019-03-22T13:14:00Z

@jreback I've removed C code and rebased to latest master.
Here's latest benchmark results, ran via asv continuous -f 1.05 upstream/master datehelpers-performance-patch -b io.parsers -e -a sample_time=2 -a warmup_time=2:

before	after	ratio	test name
[`f2bcb9f`]	[`764aafd`]
master	datehelpers-performance-patch
105±0.6ms	36.5±1ms	0.35	io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')
1.74±0.01ms	601±8μs	0.35	io.parsers.ConcatDateCols.time_check_concat(1234567890, 1, <class 'list'>)
161±0.6ms	51.3±1ms	0.32	io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('10000')
6.97±0.02ms	1.41±0.01ms	0.20	io.parsers.ConcatDateCols.time_check_concat(1234567890, 2, <class 'list'>)
291±2ms	50.6±2ms	0.17	io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('2Q2005')
4.92±0.1ms	722±6μs	0.15	io.parsers.ConcatDateCols.time_check_concat(1234567890, 1, )
13.7±0.6ms	1.75±0.02ms	0.13	io.parsers.ConcatDateCols.time_check_concat(1234567890, 2, )
3.10±0.2ms	372±4μs	0.12	io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, )
9.97±0.4ms	919±10μs	0.09	io.parsers.ConcatDateCols.time_check_concat('AAAA', 2, )
5.91±0.04ms	540±6μs	0.09	io.parsers.ConcatDateCols.time_check_concat('AAAA', 2, <class 'list'>)
1.12±0.03ms	81.0±0.5μs	0.07	io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, <class 'list'>)

P.S. Please note that while gain might look small please look at ratio more than at raw time units, as we run the benchmark on Intel(R) Core(TM) i9-7900X (and raw timings would be, for example, several times bigger on some data scientist laptop). Also gain would scale linearly and for some more real-world cases with 10M of lines in a csv file (multiplied by number of date columns) it would be several seconds' saved.

jreback · 2019-03-22T13:42:00Z

P.S. Please note that while gain might look small please look at ratio more than at raw time units, as we run the benchmark on Intel(R) Core(TM) i9-7900X (and raw timings would be, for example, several times bigger on some data scientist laptop). Also gain would scale linearly and for some more real-world cases with 10M of lines in a csv file (multiplied by number of date columns) it would be several seconds' saved.

of course, the only time we care about the absolute numbers is to avoid the impact of function calls (e.g. some micro bencharks are severly impacted by just 1 or 2 function calling conventions). also we are mindful of making the absolute time too big, as the benchmark suite is large.

vnlitvinov · 2019-03-22T14:25:36Z

@jreback I hope I fixed all CI issues (will see soon), and this patch is as compact as we can make keeping same speedup.

vnlitvinov · 2019-03-22T15:13:14Z

@jreback pinging on green

jreback · 2019-03-22T16:24:16Z

can you show the entire suite of csv parsing benchmarks results.

anmyachev · 2019-03-22T20:58:28Z

@jreback, here's latest benchmark results, ran via asv continuous -f 1.05 origin/master datehelpers-performance-patch -b io.csv -e -a sample_time=2 -a warmup_time=2:

before	after	ratio	test_name
[`f2bcb9f`]	[`8caea7f`]
master	datehelpers-performance-patch
1.76±0.03ms	1.65±0.02ms	0.94	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
1.40±0.01ms	1.30±0.01ms	0.93	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')
1.47±0.01ms	1.35±0.01ms	0.92	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'iso8601')
27.4±0.3ms	22.0±0.2ms	0.80	io.csv.ReadCSVConcatDatetime.time_read_csv
12.0±0.1ms	6.95±0.07ms	0.58	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
17.6±0.2ms	6.37±0.08ms	0.36	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
16.0±0.1ms	4.83±0.1ms	0.30	io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')

jreback · 2019-03-22T21:43:00Z

@anmyachev maybe you didn't see my comment. I want ALL of the csv asv. I want to see if your changes actual matter ex-micro benchmarking.

…e style

….ndarray[object]

…em, bytes) check

…hmark

… parsing.pyx

anmyachev · 2019-05-08T08:12:13Z

@jreback rebase & green

jreback

tiny comment, ping on green.

pandas/_libs/tslibs/parsing.pyx

…date_string' -> 'py_string'

anmyachev · 2019-05-09T21:53:12Z

@jreback ping on green

jreback · 2019-05-12T20:54:49Z

thanks @anmyachev

nice work!

jreback requested changes Mar 17, 2019

View reviewed changes

anmyachev force-pushed the datehelpers-performance-patch branch from d545d3a to 2e2d588 Compare March 22, 2019 09:05

vnlitvinov force-pushed the datehelpers-performance-patch branch from 2e2d588 to 7074b96 Compare March 22, 2019 11:53

anmyachev changed the title ~~PERF: rewrited _concat_date_cols function on C with removing extra co…~~ PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime Mar 22, 2019

vnlitvinov and others added 23 commits May 8, 2019 10:30

Fix isort, retain some comments

f05564d

removed unnecessary common_include list with headers; some change cod…

b9c96fd

…e style

using util.is_array now; changed double to float64_t; fix docstring

6dc3c51

split _concat_date_cols functionality

08c7f47

added error parameter for xstrtod call

ba6b86a

removed Py_SIZE; renamed indexes

14b9cad

Switch to helper method for getting C buffer of string object

4e9211b

changed return type in _concat_date_cols_* functions from void to cnp…

0aefa7b

….ndarray[object]

added doc-string to _concat_date_cols* functions

a3a0a77

added doc-string for convert_and_set_item func; removed isinstance(it…

f1ae23c

…em, bytes) check

fix docstrings

8797e53

currently only one conversion function is used - convert_to_unicode

3bdb452

added some comments in _concat_date_cols_numpy

dcbcd9a

fix problem from rebase

1d9c7b7

added some comments in _does_string_look_like_datetime

b4fc887

changed default value of keep_trivial_numbers to true

25ee2d2

Remove not needed try..except in _does_string_look_like_datetime benc…

2046dcb

…hmark

upgraded doc-ststring; added some blank lines

28b6670

removed '_concat_date_cols_sequence' func

30f70ab

now only one function '_concat_date_cols'

3800c40

removed 'do_convert' local var from 'convert_to_unicode'

b45df3f

replaced '_concat_date_cols' and 'convert_to_unicode' from lib.pyx to…

43dffec

… parsing.pyx

added 'test_concat_date_col_fail' test

c06a662

anmyachev force-pushed the datehelpers-performance-patch branch from 713fb97 to c06a662 Compare May 8, 2019 07:31

jreback approved these changes May 9, 2019

View reviewed changes

pandas/_libs/tslibs/parsing.pyx Show resolved Hide resolved

added doc-string to '_does_string_look_like_datetime' func; changed '…

5dda33c

…date_string' -> 'py_string'

jreback merged commit b48d1ff into pandas-dev:master May 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754

PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754

anmyachev commented Mar 17, 2019 •

edited

Loading

jreback left a comment

codecov bot commented Mar 17, 2019

codecov bot commented Mar 17, 2019 •

edited

Loading

anmyachev commented Mar 17, 2019 •

edited

Loading

jreback commented Mar 17, 2019

anmyachev commented Mar 18, 2019

jreback commented Mar 18, 2019

vnlitvinov commented Mar 18, 2019

jreback commented Mar 18, 2019

anmyachev commented Mar 18, 2019

jreback commented Mar 18, 2019

vnlitvinov commented Mar 18, 2019 •

edited

Loading

vnlitvinov commented Mar 18, 2019

mroeschke commented Mar 18, 2019

vnlitvinov commented Mar 19, 2019

pep8speaks commented Mar 22, 2019 •

edited

Loading

anmyachev commented Mar 22, 2019

jreback commented Mar 22, 2019

vnlitvinov commented Mar 22, 2019

jreback commented Mar 22, 2019

vnlitvinov commented Mar 22, 2019

jreback commented Mar 22, 2019

vnlitvinov commented Mar 22, 2019

vnlitvinov commented Mar 22, 2019

jreback commented Mar 22, 2019

anmyachev commented Mar 22, 2019

jreback commented Mar 22, 2019

anmyachev commented May 8, 2019

jreback left a comment

anmyachev commented May 9, 2019

jreback commented May 12, 2019

PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754

PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754

Conversation

anmyachev commented Mar 17, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 17, 2019

Codecov Report

codecov bot commented Mar 17, 2019 • edited Loading

Codecov Report

anmyachev commented Mar 17, 2019 • edited Loading

jreback commented Mar 17, 2019

anmyachev commented Mar 18, 2019

jreback commented Mar 18, 2019

vnlitvinov commented Mar 18, 2019

jreback commented Mar 18, 2019

anmyachev commented Mar 18, 2019

jreback commented Mar 18, 2019

vnlitvinov commented Mar 18, 2019 • edited Loading

vnlitvinov commented Mar 18, 2019

mroeschke commented Mar 18, 2019

vnlitvinov commented Mar 19, 2019

pep8speaks commented Mar 22, 2019 • edited Loading

Comment last updated at 2019-05-09 21:12:33 UTC

anmyachev commented Mar 22, 2019

jreback commented Mar 22, 2019

vnlitvinov commented Mar 22, 2019

jreback commented Mar 22, 2019

vnlitvinov commented Mar 22, 2019

jreback commented Mar 22, 2019

vnlitvinov commented Mar 22, 2019

vnlitvinov commented Mar 22, 2019

jreback commented Mar 22, 2019

anmyachev commented Mar 22, 2019

jreback commented Mar 22, 2019

anmyachev commented May 8, 2019

jreback left a comment

Choose a reason for hiding this comment

anmyachev commented May 9, 2019

jreback commented May 12, 2019

anmyachev commented Mar 17, 2019 •

edited

Loading

codecov bot commented Mar 17, 2019 •

edited

Loading

anmyachev commented Mar 17, 2019 •

edited

Loading

vnlitvinov commented Mar 18, 2019 •

edited

Loading

pep8speaks commented Mar 22, 2019 •

edited

Loading