-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa what benchmark are u trying to make better? this is adding a huge amount of complexity
Codecov Report
@@ Coverage Diff @@
## master #25754 +/- ##
==========================================
+ Coverage 91.25% 91.25% +<.01%
==========================================
Files 172 172
Lines 52977 52971 -6
==========================================
- Hits 48342 48338 -4
+ Misses 4635 4633 -2
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25754 +/- ##
==========================================
- Coverage 92.04% 91.91% -0.13%
==========================================
Files 175 174 -1
Lines 52291 50708 -1583
==========================================
- Hits 48131 46610 -1521
+ Misses 4160 4098 -62
Continue to review full report at Codecov.
|
I'm trying to speed up read_csv function. Manually, I ran asv as follows:
During further testing performance, one of the datasets showed a decrease in the read_csv runtime for 25%. |
these are tiny numbers - and likely not the cause of an actual slowdown - read_csv is io bound for the most part |
I have written a simple benchmark to demonstrate the impact. #!/usr/bin/env python
import os
import sys
import time
import pandas as pd
print("pandas version - %s" % pd.__version__)
if len(sys.argv) > 2:
dataset_file = sys.argv[1]
test_time = float(sys.argv[2]) * 60
elif len(sys.argv) == 2:
dataset_file = sys.argv[1]
test_time = 60
else:
raise ValueError("please specify dataset")
count_call = 0
test_result_time = 0
while(test_time > 0):
time_call = time.time()
df = pd.read_csv(dataset_file, sep=',', parse_dates=['date'])
time_call = time.time() - time_call
test_time -= time_call
test_result_time += time_call
count_call += 1
print("%d calls read_csv on %s per %f seconds" % (count_call, dataset_file, test_result_time))
average_time = test_result_time / count_call;
print("average time per call read_csv - %f" % average_time) Without proposed patch:
With:
|
@anmyachev can u show the effect on the asv benchmarks for csv |
Here's the makeshift script that directly measures time spent in #!/usr/bin/env python
import os
import sys
import time
import pandas as pd
import pandas.io.parsers
old_concat_date_cols = pandas.io.parsers._concat_date_cols
total = 0
def my_concat_date_cols(*a, **kw):
global total
start = time.time()
try:
return old_concat_date_cols(*a, **kw)
finally:
total += time.time() - start
pandas.io.parsers._concat_date_cols = my_concat_date_cols
print("pandas version - %s" % pd.__version__)
if len(sys.argv) > 2:
dataset_file = sys.argv[1]
test_time = float(sys.argv[2]) * 60
elif len(sys.argv) == 2:
dataset_file = sys.argv[1]
test_time = 60
else:
raise ValueError("please specify dataset")
count_call = 0
test_result_time = 0
while(test_time > 0):
time_call = time.time()
df = pd.read_csv(dataset_file, sep=',', parse_dates=['date'])
time_call = time.time() - time_call
test_time -= time_call
test_result_time += time_call
count_call += 1
print("%d calls read_csv on %s per %f seconds" % (count_call, dataset_file, test_result_time))
average_time = test_result_time / count_call;
print("average time per call read_csv - %f" % average_time)
print("average time for _concat_date_cols() - %f" % (total / count_call)) And here are its results:
As you can see here, with the patch |
@anmyachev see the docs: http://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite you would need to see the effect on the current perf test suite & write a new benchmark in the same style. |
Original
|
@anmyachev those are not much of a speedup. This PR adds a huge amount of complexity / c-code. I would rather you cythonize things if need by. Am -1 on adding c-code. |
I have implemented a small benchmark for
|
Also note that benchmarks are currently working on quite limited amount of data to be fast, if I patch those to resemble real-life scenarios (using 5M lines instead of 50K) results would be:
Which yields at least 0.5s saving on reading (or more than 1 second in the best case for our optimization). |
Is there any way the to align |
@mroeschke we did our best to bring our improvements from C to Cython for We've now trying to re-write |
d545d3a
to
2e2d588
Compare
Hello @anmyachev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-05-09 21:12:33 UTC |
@jreback, @mroeschke what you say about this cython version of _concat_date_cols? |
@anmyachev pls remove all c-code. further this is not passing. |
2e2d588
to
7074b96
Compare
Performance loss of Cython vs C code is noticable but not big:
|
@anmyachev we are -1 on adding really any c-code. this is increases complexity exponentially. if you have a perf improvement with cython great. but even that must be clear code. |
@jreback I've removed C code and rebased to latest master.
P.S. Please note that while gain might look small please look at ratio more than at raw time units, as we run the benchmark on |
of course, the only time we care about the absolute numbers is to avoid the impact of function calls (e.g. some micro bencharks are severly impacted by just 1 or 2 function calling conventions). also we are mindful of making the absolute time too big, as the benchmark suite is large. |
@jreback I hope I fixed all CI issues (will see soon), and this patch is as compact as we can make keeping same speedup. |
@jreback pinging on green |
can you show the entire suite of csv parsing benchmarks results. |
@jreback, here's latest benchmark results, ran via
|
@anmyachev maybe you didn't see my comment. I want ALL of the csv asv. I want to see if your changes actual matter ex-micro benchmarking. |
713fb97
to
c06a662
Compare
@jreback rebase & green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny comment, ping on green.
…date_string' -> 'py_string'
@jreback ping on green |
thanks @anmyachev nice work! |
git diff upstream/master -u -- "*.py" | flake8 --diff