Skip to content

PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
May 12, 2019

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Mar 17, 2019

  • closes N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woa what benchmark are u trying to make better? this is adding a huge amount of complexity

@codecov
Copy link

codecov bot commented Mar 17, 2019

Codecov Report

Merging #25754 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25754      +/-   ##
==========================================
+ Coverage   91.25%   91.25%   +<.01%     
==========================================
  Files         172      172              
  Lines       52977    52971       -6     
==========================================
- Hits        48342    48338       -4     
+ Misses       4635     4633       -2
Flag Coverage Δ
#multiple 89.83% <100%> (ø) ⬆️
#single 41.74% <50%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.39% <100%> (+0.04%) ⬆️
pandas/util/testing.py 89.08% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 707c720...846f0bf. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 17, 2019

Codecov Report

Merging #25754 into master will decrease coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25754      +/-   ##
==========================================
- Coverage   92.04%   91.91%   -0.13%     
==========================================
  Files         175      174       -1     
  Lines       52291    50708    -1583     
==========================================
- Hits        48131    46610    -1521     
+ Misses       4160     4098      -62
Flag Coverage Δ
#multiple 90.42% <ø> (-0.17%) ⬇️
#single 41.18% <ø> (+0.31%) ⬆️
Impacted Files Coverage Δ
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.01% <0%> (-0.12%) ⬇️
pandas/core/arrays/categorical.py 95.95% <0%> (ø) ⬆️
pandas/core/indexes/range.py 97.97% <0%> (+0.01%) ⬆️
pandas/core/base.py 98% <0%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bfbd81...5dda33c. Read the comment docs.

@anmyachev
Copy link
Contributor Author

anmyachev commented Mar 17, 2019

I'm trying to speed up read_csv function. Manually, I ran asv as follows:
asv continuous -f 1.05 origin HEAD -b io.csv
And got these result:

before after ratio test name
[707c720] [846f0bf]
master datehelpers-performance-patch
2.98±0.06ms 2.78±0.02ms 0.93 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'round_trip')
2.56±0.04ms 2.38±0.01ms 0.93 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'round_trip')
1.94±0.03ms 1.79±0.04ms 0.93 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'ymd')
3.27±0.08ms 3.02±0.04ms 0.92 io.csv.ReadUint64Integers.time_read_uint64
1.55±0.03ms 1.38±0.02ms 0.89 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')

During further testing performance, one of the datasets showed a decrease in the read_csv runtime for 25%.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2019

these are tiny numbers - and likely not the cause of an actual slowdown - read_csv is io bound for the most part

@anmyachev
Copy link
Contributor Author

I have written a simple benchmark to demonstrate the impact.
Download dataset here: https://vincentarelbundock.github.io/Rdatasets/csv/mosaicData/Birthdays.csv
And run this simple script:

#!/usr/bin/env python
import os
import sys
import time
import pandas as pd

print("pandas version - %s" % pd.__version__)

if len(sys.argv) > 2:
    dataset_file = sys.argv[1]
    test_time = float(sys.argv[2]) * 60
elif len(sys.argv) == 2:
    dataset_file = sys.argv[1]
    test_time = 60
else:
    raise ValueError("please specify dataset")

count_call = 0
test_result_time = 0
while(test_time > 0):
    time_call = time.time()
    df = pd.read_csv(dataset_file, sep=',', parse_dates=['date'])
    time_call = time.time() - time_call
    test_time -= time_call
    test_result_time += time_call
    count_call += 1
print("%d calls read_csv on %s per %f seconds" % (count_call, dataset_file, test_result_time))
average_time = test_result_time / count_call;
print("average time per call read_csv - %f" % average_time)

Without proposed patch:

pandas version - 0.25.0.dev0+266.g707c7201a
167 calls read_csv on .\Birthdays.csv per 60.212000 seconds
average time per call read_csv - 0.360551

With:

pandas version - 0.25.0.dev0+267.g846f0bfa5
208 calls read_csv on .\Birthdays.csv per 60.246000 seconds
average time per call read_csv - 0.289644

@jreback
Copy link
Contributor

jreback commented Mar 18, 2019

@anmyachev can u show the effect on the asv benchmarks for csv

@vnlitvinov
Copy link
Contributor

Here's the makeshift script that directly measures time spent in _concat_date_cols:

#!/usr/bin/env python
import os
import sys
import time
import pandas as pd

import pandas.io.parsers
old_concat_date_cols = pandas.io.parsers._concat_date_cols
total = 0
def my_concat_date_cols(*a, **kw):
    global total
    start = time.time()
    try:
        return old_concat_date_cols(*a, **kw)
    finally:
        total += time.time() - start
pandas.io.parsers._concat_date_cols = my_concat_date_cols

print("pandas version - %s" % pd.__version__)

if len(sys.argv) > 2:
    dataset_file = sys.argv[1]
    test_time = float(sys.argv[2]) * 60
elif len(sys.argv) == 2:
    dataset_file = sys.argv[1]
    test_time = 60
else:
    raise ValueError("please specify dataset")

count_call = 0
test_result_time = 0
while(test_time > 0):
    time_call = time.time()
    df = pd.read_csv(dataset_file, sep=',', parse_dates=['date'])
    time_call = time.time() - time_call
    test_time -= time_call
    test_result_time += time_call
    count_call += 1
print("%d calls read_csv on %s per %f seconds" % (count_call, dataset_file, test_result_time))
average_time = test_result_time / count_call;
print("average time per call read_csv - %f" % average_time)
print("average time for _concat_date_cols() - %f" % (total / count_call))

And here are its results:

pandas read_csv average time _concat_date_cols average time
Original 0.418585 0.076364
Patched 0.310918 0.007838

As you can see here, with the patch _concat_date_cols gets 10x faster.

@jreback
Copy link
Contributor

jreback commented Mar 18, 2019

@anmyachev see the docs: http://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

you would need to see the effect on the current perf test suite & write a new benchmark in the same style.

@anmyachev
Copy link
Contributor Author

Original csv benchmarks were mostly not measuring _concat_date_time - most their time was spent in parsing item. I have composed new benchmarks (see 61d8e6e) that highlight the speedup brought by this PR.

master datehelpers-performance-patch speedup test name
28.6±0.3ms 24.2±0.2ms 0.85 io.csv.ReadCSVConcatDatetime.time_read_csv
12.1±0.2ms 7.17±0.3ms 0.59 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
17.1±0.1ms 6.90±0.1ms 0.40 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
16.2±0.4ms 5.04±0.2ms 0.31 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')

@jreback
Copy link
Contributor

jreback commented Mar 18, 2019

@anmyachev those are not much of a speedup. This PR adds a huge amount of complexity / c-code. I would rather you cythonize things if need by. Am -1 on adding c-code.

@vnlitvinov
Copy link
Contributor

vnlitvinov commented Mar 18, 2019

I would rather you cythonize things if need by.

I have implemented a small benchmark for _does_string_look_like_datetime. It clearly shows that Cython in certain cases like tight loops or lots of exception raising/handling can lag behind C code quite a bit:

before after ratio test case
[707c720] [e66883e]
pre-patch datehelpers-performance-patch
105±3ms 32.9±0.05ms 0.31 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')
168±4ms 46.1±1ms 0.27 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('10000')
296±5ms 41.5±0.4ms 0.14 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('2Q2005')

@vnlitvinov
Copy link
Contributor

Also note that benchmarks are currently working on quite limited amount of data to be fast, if I patch those to resemble real-life scenarios (using 5M lines instead of 50K) results would be:

before after ratio test case
[707c720] [e66883e]
pre-patch datehelpers-performance-patch
3.38±0s 2.77±0s 0.82 io.csv.ReadCSVConcatDatetime.time_read_csv
1.19±0.01s 584±9ms 0.49 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
1.73±0.01s 482±10ms 0.28 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
1.66±0.01s 414±2ms 0.25 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')

Which yields at least 0.5s saving on reading (or more than 1 second in the best case for our optimization).

@mroeschke
Copy link
Member

Is there any way the to align _does_string_look_like_datetime closer to what your implementation in C does or are the speedups purely language specifics? I am not the biggest fan of the maintenance burden of this approach either.

@vnlitvinov
Copy link
Contributor

@mroeschke we did our best to bring our improvements from C to Cython for _does_string_look_like_datetime (and new benchmarks show it's now on par with C version).

We've now trying to re-write _concat_date_cols in Cython and we'll see how it would impact the benchmark numbers.

@anmyachev anmyachev force-pushed the datehelpers-performance-patch branch from d545d3a to 2e2d588 Compare March 22, 2019 09:05
@pep8speaks
Copy link

pep8speaks commented Mar 22, 2019

Hello @anmyachev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-09 21:12:33 UTC

@anmyachev
Copy link
Contributor Author

@jreback, @mroeschke what you say about this cython version of _concat_date_cols?
This version loses in performance but not much relative to C.
Specific numbers will be later.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

@anmyachev pls remove all c-code. further this is not passing.

@vnlitvinov vnlitvinov force-pushed the datehelpers-performance-patch branch from 2e2d588 to 7074b96 Compare March 22, 2019 11:53
@vnlitvinov
Copy link
Contributor

Performance loss of Cython vs C code is noticable but not big:
asv continuous -f 1.05 datehelpers-performance-patch datehelpers-performance-patch-purec -b io.parsers -e -a sample_time=2 -a warmup_time=2

before after ratio test name
[7074b96] [6ba6fb91]
datehelpers-performance-patch datehelpers-performance-patch-purec
365±2μs 347±3μs 0.95 io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, )
954±20μs 899±10μs 0.94 io.parsers.ConcatDateCols.time_check_concat('AAAA', 2, )
82.3±0.2μs 62.3±2μs 0.76 io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, <class 'list'>)

@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

@anmyachev we are -1 on adding really any c-code. this is increases complexity exponentially. if you have a perf improvement with cython great. but even that must be clear code.

@vnlitvinov
Copy link
Contributor

@jreback I've removed C code and rebased to latest master.
Here's latest benchmark results, ran via asv continuous -f 1.05 upstream/master datehelpers-performance-patch -b io.parsers -e -a sample_time=2 -a warmup_time=2:

before after ratio test name
[f2bcb9f] [764aafd]
master datehelpers-performance-patch
105±0.6ms 36.5±1ms 0.35 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('0.0')
1.74±0.01ms 601±8μs 0.35 io.parsers.ConcatDateCols.time_check_concat(1234567890, 1, <class 'list'>)
161±0.6ms 51.3±1ms 0.32 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('10000')
6.97±0.02ms 1.41±0.01ms 0.20 io.parsers.ConcatDateCols.time_check_concat(1234567890, 2, <class 'list'>)
291±2ms 50.6±2ms 0.17 io.parsers.DoesStringLookLikeDatetime.time_check_datetimes('2Q2005')
4.92±0.1ms 722±6μs 0.15 io.parsers.ConcatDateCols.time_check_concat(1234567890, 1, )
13.7±0.6ms 1.75±0.02ms 0.13 io.parsers.ConcatDateCols.time_check_concat(1234567890, 2, )
3.10±0.2ms 372±4μs 0.12 io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, )
9.97±0.4ms 919±10μs 0.09 io.parsers.ConcatDateCols.time_check_concat('AAAA', 2, )
5.91±0.04ms 540±6μs 0.09 io.parsers.ConcatDateCols.time_check_concat('AAAA', 2, <class 'list'>)
1.12±0.03ms 81.0±0.5μs 0.07 io.parsers.ConcatDateCols.time_check_concat('AAAA', 1, <class 'list'>)

P.S. Please note that while gain might look small please look at ratio more than at raw time units, as we run the benchmark on Intel(R) Core(TM) i9-7900X (and raw timings would be, for example, several times bigger on some data scientist laptop). Also gain would scale linearly and for some more real-world cases with 10M of lines in a csv file (multiplied by number of date columns) it would be several seconds' saved.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

P.S. Please note that while gain might look small please look at ratio more than at raw time units, as we run the benchmark on Intel(R) Core(TM) i9-7900X (and raw timings would be, for example, several times bigger on some data scientist laptop). Also gain would scale linearly and for some more real-world cases with 10M of lines in a csv file (multiplied by number of date columns) it would be several seconds' saved.

of course, the only time we care about the absolute numbers is to avoid the impact of function calls (e.g. some micro bencharks are severly impacted by just 1 or 2 function calling conventions). also we are mindful of making the absolute time too big, as the benchmark suite is large.

@vnlitvinov
Copy link
Contributor

@jreback I hope I fixed all CI issues (will see soon), and this patch is as compact as we can make keeping same speedup.

@vnlitvinov
Copy link
Contributor

@jreback pinging on green

@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

can you show the entire suite of csv parsing benchmarks results.

@anmyachev
Copy link
Contributor Author

@jreback, here's latest benchmark results, ran via asv continuous -f 1.05 origin/master datehelpers-performance-patch -b io.csv -e -a sample_time=2 -a warmup_time=2:

before after ratio test_name
[f2bcb9f] [8caea7f]
master datehelpers-performance-patch
1.76±0.03ms 1.65±0.02ms 0.94 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
1.40±0.01ms 1.30±0.01ms 0.93 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')
1.47±0.01ms 1.35±0.01ms 0.92 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'iso8601')
27.4±0.3ms 22.0±0.2ms 0.80 io.csv.ReadCSVConcatDatetime.time_read_csv
12.0±0.1ms 6.95±0.07ms 0.58 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
17.6±0.2ms 6.37±0.08ms 0.36 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
16.0±0.1ms 4.83±0.1ms 0.30 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')

@anmyachev anmyachev changed the title PERF: rewrited _concat_date_cols function on C with removing extra co… PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime Mar 22, 2019
@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

@anmyachev maybe you didn't see my comment. I want ALL of the csv asv. I want to see if your changes actual matter ex-micro benchmarking.

vnlitvinov and others added 23 commits May 8, 2019 10:30
@anmyachev anmyachev force-pushed the datehelpers-performance-patch branch from 713fb97 to c06a662 Compare May 8, 2019 07:31
@anmyachev
Copy link
Contributor Author

@jreback rebase & green

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny comment, ping on green.

@anmyachev
Copy link
Contributor Author

@jreback ping on green

@jreback jreback merged commit b48d1ff into pandas-dev:master May 12, 2019
@jreback
Copy link
Contributor

jreback commented May 12, 2019

thanks @anmyachev

nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants