PERF: Improve performance of hash sets in read_csv #25804

vnlitvinov · 2019-03-20T16:36:42Z

closes N/A
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Improved performance of checking of string presence in hashsets by:

implemented lookup table for first symbol of entries - this speeds up negative answers
increased size of hash table to reduce the probability of hash collision

Hash bucket increase is needed due to khash search algorithm - it computes a hash of key being searched, uses its last N bits where N corresponds to bucket size, and then does an open ended search over the table if collision happened with calling a comparison function over keys. When keys are strings (or, more precisely, char* things) comparison function is strcmp() which complexity is O(N), thus we want for collisions to be rare.
Most lookups of those sets happen when values are checked against N/A table which by default has 17 entries. Without the change bucket size is 32, thus hash collision probability is 17/32 (~53%), after the patch bucket size is 256 and collision probability is 6.6%.

I will add asv results and whatsnew later.

…ts_t khash struct

codecov · 2019-03-20T17:19:59Z

Codecov Report

Merging #25804 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25804      +/-   ##
==========================================
+ Coverage   91.27%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       53002    53002              
==========================================
+ Hits        48375    48376       +1     
+ Misses       4627     4626       -1

Flag	Coverage Δ
#multiple	`89.83% <ø> (ø)`	⬆️
#single	`41.76% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/util/testing.py	`89.4% <0%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85c3f82...da9d4e6. Read the comment docs.

codecov · 2019-03-20T17:20:02Z

Codecov Report

Merging #25804 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25804      +/-   ##
==========================================
+ Coverage   91.27%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       53002    53002              
==========================================
+ Hits        48375    48376       +1     
+ Misses       4627     4626       -1

Flag	Coverage Δ
#multiple	`89.83% <ø> (ø)`	⬆️
#single	`41.76% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/util/testing.py	`89.4% <0%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85c3f82...da9d4e6. Read the comment docs.

codecov · 2019-03-20T17:20:19Z

Codecov Report

Merging #25804 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #25804   +/-   ##
=======================================
  Coverage   91.27%   91.27%           
=======================================
  Files         173      173           
  Lines       53002    53002           
=======================================
  Hits        48375    48375           
  Misses       4627     4627

Flag	Coverage Δ
#multiple	`89.83% <ø> (ø)`	⬆️
#single	`41.77% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/io/excel/_base.py	`92.48% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85c3f82...2ceb850. Read the comment docs.

vnlitvinov · 2019-03-20T20:58:17Z

Results of running asv continuous -f 1.05 upstream/master HEAD -b io.csv -a sample_time=1 -a warmup_time=1:

before	after	ratio	test name
[`fbe2523`]	[`da9d4e6`]
master	khash-na-sets
1.34±0.02ms	1.26±0.01ms	0.94	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
13.7±0.2ms	12.8±0.3ms	0.94	io.csv.ReadCSVSkipRows.time_skipprows(None)
1.41±0.01ms	1.31±0.02ms	0.93	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
1.42±0.01ms	1.32±0.01ms	0.93	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
32.9±1ms	28.8±0.2ms	0.87	io.csv.ReadCSVCategorical.time_convert_direct
12.7±0.2ms	10.9±0.1ms	0.86	io.csv.ReadCSVThousands.time_thousands(',', ',')
13.2±0.08ms	11.3±0.1ms	0.86	io.csv.ReadCSVThousands.time_thousands('\|', ',')
10.7±0.07ms	9.06±0.09ms	0.84	io.csv.ReadCSVThousands.time_thousands('\|', None)
10.9±0.1ms	9.17±0.2ms	0.84	io.csv.ReadCSVThousands.time_thousands(',', None)

vnlitvinov · 2019-03-21T11:31:24Z

More precise benchmarking:
asv continuous -f 1.01 upstream/master HEAD -b io.csv -a sample_time=2 -a warmup_time=2

before	after	ratio	test name
[`fbe2523`]	[`da9d4e6`]
master	khash-na-sets
1.55±0.02ms	1.49±0ms	0.96	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'round_trip')
1.55±0.01ms	1.48±0.01ms	0.96	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', None)
13.2±0.4ms	12.5±0.1ms	0.95	io.csv.ReadCSVSkipRows.time_skipprows(None)
1.56±0.02ms	1.48±0.01ms	0.95	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
1.84±0.02ms	1.74±0.01ms	0.95	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'round_trip')
1.81±0.02ms	1.72±0.01ms	0.95	io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
1.56±0.02ms	1.47±0.01ms	0.94	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', None)
1.84±0.01ms	1.73±0.02ms	0.94	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
1.57±0.03ms	1.47±0.03ms	0.94	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
1.34±0.01ms	1.24±0.03ms	0.92	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
1.35±0.02ms	1.24±0.01ms	0.92	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
1.43±0.02ms	1.30±0.01ms	0.91	io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
1.43±0.03ms	1.29±0.01ms	0.90	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
31.5±0.6ms	28.3±0.3ms	0.90	io.csv.ReadCSVCategorical.time_convert_direct
2.95±0.07ms	2.61±0.06ms	0.89	io.csv.ReadUint64Integers.time_read_uint64
13.1±0.1ms	11.2±0.05ms	0.86	io.csv.ReadCSVThousands.time_thousands('\|', ',')
10.8±0.04ms	9.12±0.08ms	0.84	io.csv.ReadCSVThousands.time_thousands(',', None)
12.8±0.04ms	10.8±0.1ms	0.84	io.csv.ReadCSVThousands.time_thousands(',', ',')
11.1±0.2ms	9.11±0.05ms	0.82	io.csv.ReadCSVThousands.time_thousands('\|', None)

jreback · 2019-03-22T11:36:16Z

lgtm. @chris-b1 @gfyoung

chris-b1

good stuff! looks good to me

chris-b1 · 2019-03-22T14:45:00Z

pandas/_libs/parsers.pyx

@@ -71,9 +73,6 @@ cdef:
    float64_t NEGINF = -INF


-cdef extern from "errno.h":


Pulling out errno ideally could have been a separate PR but I'm fine doing it here

WillAyd · 2019-03-22T15:02:41Z

pandas/_libs/parsers.pyx

+        # Resize the hash table to make it almost empty, this
+        # reduces amount of hash collisions on lookup thus
+        # "key not in table" case is faster.
+        # Note that this trades table memory footprint for lookup speed.


Is there a ballpark estimate for how much more memory this might be using? I feel if anything we get more issues for read_csv in regards to high memory usage than slow parsing so just want to be careful of tradeoffs that we make, especially since we don't have any memory benchmarks set up

Worst case would be when bucket_size was increased from 128 to 1024.
Memory for 128-sized table (on x86_64) ~3.6 KB, for 1024-sized table (on x86_64) ~21.04 KB.

For default case bucket size is 32, so memory is increased from 1.67 KB to 6.04 KB.

I think it's acceptable to trade about 15 more KB (as there are typically three such maps - NA values, true-ish values, false-ish values) to gain up to 15% of parsing speed.

P.S. Memory size formula where B is bucket_size:

sizeof(khint32_t) * 4 + sizeof(void*) * 3 + B * (sizeof(khint32_t) + sizeof(char*) + sizeof(size_t)) + 256 * sizeof(int)

which gives 1064 + B * 20 bytes (again on x86_64).

My math might be fuzzy but assuming someone is parsing say 10M records and this increases the size of each record by 15kb, would that require roughly 150MB more of memory?

It is 15 KB per parser, not per record :)
So in a typical case (when one parser exists at a time) it's only 15 KB more.

Ha thanks for clarifying. I suppose that's quite the difference

jreback · 2019-03-22T16:19:42Z

thanks @vnlitvin

gfyoung · 2019-03-22T18:50:58Z

@vnlitvin : Nice work!

jreback · 2019-03-27T00:56:04Z

@vnlitvin getting tons of build warnings like: https://travis-ci.org/pandas-dev/pandas/jobs/511269649

if you can do anything about this.

n file included from pandas/_libs/parsers.c:663:0:
pandas/_libs/src/klib/khash_python.h:109:27: note: expected ‘struct kh_str_starts_t *’ but argument is of type ‘const struct kh_str_starts_t *’
 khint_t PANDAS_INLINE kh_get_str_starts_item(kh_str_starts_t* table, char* key) {
                           ^
pandas/_libs/parsers.c:26118:7: warning: passing argument 2 of ‘kh_get_str_starts_item’ discards ‘const’ qualifier from pointer target type [enabled by default]
       __pyx_t_1 = (kh_get_str_starts_item(__pyx_v_na_hashset, __pyx_v_word) != 0);
       ^
In file included from pandas/_libs/parsers.c:663:0:
pandas/_libs/src/klib/khash_python.h:109:27: note: expected ‘char *’ but argument is of type ‘const char *’
 khint_t PANDAS_INLINE kh_get_str_starts_item(kh_str_starts_t* table, char* key) {
                           ^
pandas/_libs/parsers.c: In function ‘__pyx_f_6pandas_5_libs_7parsers__try_bool_flex_nogil’:

vnlitvinov and others added 10 commits March 20, 2019 19:17

Increase hash table size to reduce hash collisions on NA_VALUES lookup

995da78

changed macro errno to _error variable

1835051

speed up kh_get_str by lookup table

7e144b9

fix memory leak, try to speed up xstrtod

86031b7

more clean code in khash.pxd, parsers.pyx

391d0a7

fix indents

143deb1

Correct destroying of NA set table

1e25c6b

cast first char symbol to unsigned char for functions for kh_str_star…

16b5922

…ts_t khash struct

Use PANDAS_INLINE instead of plain "static inline"

0755c64

Fix linting errors

da9d4e6

vnlitvinov changed the title ~~Khash na sets~~ Improve performance of hash sets in read_csv Mar 20, 2019

Added whatsnew entry

2ceb850

vnlitvinov changed the title ~~Improve performance of hash sets in read_csv~~ PERF: Improve performance of hash sets in read_csv Mar 20, 2019

jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 22, 2019

jreback added this to the 0.25.0 milestone Mar 22, 2019

chris-b1 approved these changes Mar 22, 2019

View reviewed changes

WillAyd reviewed Mar 22, 2019

View reviewed changes

jreback merged commit cd057c7 into pandas-dev:master Mar 22, 2019

vnlitvinov deleted the khash-na-sets branch March 25, 2019 11:05

anmyachev pushed a commit to anmyachev/pandas that referenced this pull request Apr 18, 2019

PERF: Improve performance of hash sets in read_csv (pandas-dev#25804)

ec5488f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improve performance of hash sets in read_csv #25804

PERF: Improve performance of hash sets in read_csv #25804

vnlitvinov commented Mar 20, 2019 •

edited

Loading

codecov bot commented Mar 20, 2019

codecov bot commented Mar 20, 2019

codecov bot commented Mar 20, 2019 •

edited

Loading

vnlitvinov commented Mar 20, 2019

vnlitvinov commented Mar 21, 2019

jreback commented Mar 22, 2019

chris-b1 left a comment

chris-b1 Mar 22, 2019

WillAyd Mar 22, 2019

vnlitvinov Mar 22, 2019 •

edited

Loading

WillAyd Mar 22, 2019

vnlitvinov Mar 22, 2019

WillAyd Mar 22, 2019

jreback commented Mar 22, 2019

gfyoung commented Mar 22, 2019

jreback commented Mar 27, 2019

		@@ -71,9 +73,6 @@ cdef:
		float64_t NEGINF = -INF


		cdef extern from "errno.h":

PERF: Improve performance of hash sets in read_csv #25804

PERF: Improve performance of hash sets in read_csv #25804

Conversation

vnlitvinov commented Mar 20, 2019 • edited Loading

codecov bot commented Mar 20, 2019

Codecov Report

codecov bot commented Mar 20, 2019

Codecov Report

codecov bot commented Mar 20, 2019 • edited Loading

Codecov Report

vnlitvinov commented Mar 20, 2019

vnlitvinov commented Mar 21, 2019

jreback commented Mar 22, 2019

chris-b1 left a comment

Choose a reason for hiding this comment

chris-b1 Mar 22, 2019

Choose a reason for hiding this comment

WillAyd Mar 22, 2019

Choose a reason for hiding this comment

vnlitvinov Mar 22, 2019 • edited Loading

Choose a reason for hiding this comment

WillAyd Mar 22, 2019

Choose a reason for hiding this comment

vnlitvinov Mar 22, 2019

Choose a reason for hiding this comment

WillAyd Mar 22, 2019

Choose a reason for hiding this comment

jreback commented Mar 22, 2019

gfyoung commented Mar 22, 2019

jreback commented Mar 27, 2019

vnlitvinov commented Mar 20, 2019 •

edited

Loading

codecov bot commented Mar 20, 2019 •

edited

Loading

vnlitvinov Mar 22, 2019 •

edited

Loading