Allow parsing in skipped row for C engine #12900

gfyoung · 2016-04-14T17:55:52Z

Changes behaviour of C engine parser so that parsing is done on skipped rows so that they
are properly skipped.

gfyoung · 2016-04-14T17:57:32Z

The issue with allowing parsing on skipped rows is that there is a performance hit since we're spending more time processing rows that don't really matter. How significant is it though do you think?

jreback · 2016-04-16T01:33:19Z

this for sure needs performance testing. You are allowing a special case (that of a terminator character IN the skiprows) to influence the workhorse parsing. I am not sure this is the correct soln.

gfyoung · 2016-04-16T01:38:22Z

@jreback : it's actually a little more general than that. I am addressing all of the cases that occur in tokenizer.c, which goes beyond just newline characters in quoted fields bug. The other option is to add new logic in the SKIP_LINE state to just check for quotes but I wasn't sure about how it would handle quotes within quotes?

jreback · 2016-04-16T01:45:28Z

well, you can run the benchmarks in the test suite and see what shows up. prob need some addtl ones that hae embeeded terminators and such. I would special case thinks for a situation like this.

gfyoung · 2016-04-16T01:46:32Z

"special case thinks" ?

jreback · 2016-04-16T01:55:56Z

I think

gfyoung · 2016-04-16T13:15:39Z

Added some additional unit tests that my current fixes should be robust against. Let me know if there are any additional unit tests that should be included/written.

Also, which benchmarks should I run my changes against? I had some difficulty finding relevant benchmarks in the code base.

jorisvandenbossche · 2016-04-17T09:25:52Z

@gfyoung There is one specific asv bench for skiprows (io_bench/read_csv_skiprows), but you can eg run all tests that contain read_csv, or all tests in the io_bench.py file

gfyoung · 2016-04-17T11:40:20Z

@jorisvandenbossche : how do I exactly run only one of the benchmark tests (i.e. just io_bench.py)? The vb_suite/run_suite.py file doesn't seem to have any way to configure which benchmarks to run.

jorisvandenbossche · 2016-04-17T11:44:05Z

See the contributing docs on performance testing: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-performance-test-suite (if anything is not clear, please indicate so we can improve the explanation)

gfyoung · 2016-04-17T12:51:59Z

@jorisvandenbossche : Thank you! That link was very helpful. I was looking on the github.io page for vbench, which was not as detailed as this.

@everyone: I ran the benchmark for io_bench.csv_skiprows. master currently runs it in 18.62 ms, while my PR runs it in 26.76 ms.

jreback · 2016-04-17T13:12:01Z

pls run all of the csv benchmarks

gfyoung · 2016-04-17T13:53:53Z

	master (ms)	PR (ms)
time_frame_to_csv	220.30	227.45
time_frame_to_csv2	377.33	364.85
time_frame_to_csv_date_formatting	17.01	15.51
time_frame_to_csv_mixed	269.36	257.73
time_read_csv_infer_datetime_format_custom	18.10	18.30
time_read_csv_infer_datetime_format_iso8601	2.97	3.10
time_read_csv_infer_datetime_format_ymd	3.30	3.21
time_read_csv_skiprows	18.23	25.74
time_read_csv_standard	14.95	15.65
time_read_parse_dates_iso8601	2.05	2.14
time_write_csv_standard	58.79	59.01

As can be observed according to asv, the performance has not been significantly impacted by my changes AFAICT.

gfyoung · 2016-04-17T13:55:53Z

If there are no complaints about the performance or the changes (they pass currently against Travis, I added [no skip] to change commit message), this PR is ready to merge.

jreback · 2016-04-17T14:05:10Z

pandas/src/parser/tokenizer.c

-                break;
-            }
-            else if (c == '\n') {
+            if (c == '\n') {


why is this not lineterminator?

Because the lineterminator is not custom in this case? It never was lineterminator AFAICT, as you can see here here.

I get that you changed the code, I am looking at why

There is no change here. This function was always checking for \n or \r.

and why is that (and not lineterminator)

Because of this logic here

I c, can you add some comments to that effect then.

jreback · 2016-04-17T14:06:36Z

pls have a look thru any open csv parsing issues. you changed some things which may allow some of them to be fixed e.g. things related to lineterminator and such.

gfyoung · 2016-04-17T15:16:19Z

I took a look through all of the issues, and I think the two that I mentioned so far are the only ones that can be closed with this PR.

jreback · 2016-04-17T15:23:30Z

https://github.com/pydata/pandas/search?q=csv+lineterminator&type=Issues&utf8=%E2%9C%93

pls have a look at these and see if any are relevant

gfyoung · 2016-04-17T15:25:47Z

@jreback : I looked through all of the open issues tagged with CSV before I made that comment above.

FYI, you also should have received emails about several of the older CSV issues that either should be closed, except for one that I think can be patched.

gfyoung · 2016-04-17T21:09:29Z

@jreback : if there is nothing else, I think this is ready to merge.

jorisvandenbossche · 2016-04-17T22:10:00Z

For the skiprows benchmark, there is a considerable slowdown, which is of course logical since there is now more parsing, and the exact slowdown of course depends on the number of skipped rows (this is just the consequence of wanting to solve the bug).
But the question is a bit of we are OK with this performance hit for such a corner case. I can imagine people use skiprows exactly to have a faster parsing when they need only part of the file?
However, I also can't think of a nice way around this (apart from yet another keyword ..).

gfyoung · 2016-04-17T22:16:59Z

@jorisvandenbossche : Yes, that slowdown seems to be an unfortunate necessity at this point for the sake of correctness. As I mentioned earlier, the other option was to add more logic under the SKIP_LINE block, but attempts to do so were not robust against additional testing and only appeared to duplicate existing code.

gfyoung · 2016-04-21T16:48:27Z

@jorisvandenbossche , @jreback : I'm going to table this one for now until #12939 gets merged in. It will be a lot easier to merge a fix for this with only one (hopefully) tokenizing function to work with.

I also realized that an issue with my current fix is that we will be penalizing people if the fields in the skipped row are malformed (since we are parsing them as if they were rows to be included), which is not what we want IINM. Thus, I might need to go back to the drawing board (even if the correctness > the performance as the PR currently stands) to see whether I can build in such safety mechanisms or go back to my other strategy of introducing some additional states for skipped rows.

gfyoung · 2016-04-22T00:14:22Z

I knew there had to be a better way. Just needed some fresh eyes on the issue and managed to get the adding of a couple of new states to work --> there won't be any performance hit now like before. Let's see what Travis thinks.

jreback · 2016-04-22T00:19:58Z

ok pls rerun the perf tests as well (and maybe run vs 0.18.0) as a comparison

gfyoung · 2016-04-22T00:38:27Z

PR Branch

io_bench.frame_to_csv.time_frame_to_csv           212.25ms
io_bench.frame_to_csv2.time_frame_to_csv2         356.49ms
...formatting.time_frame_to_csv_date_formatting    15.38ms
...h.frame_to_csv_mixed.time_frame_to_csv_mixed   253.26ms
...m.time_read_csv_infer_datetime_format_custom    18.20ms
...time_read_csv_infer_datetime_format_iso8601     3.00ms
..._ymd.time_read_csv_infer_datetime_format_ymd     3.21ms
...nch.read_csv_skiprows.time_read_csv_skiprows    20.18ms
...nch.read_csv_standard.time_read_csv_standard    16.00ms
..._dates_iso8601.time_read_parse_dates_iso8601     2.07ms
...h.write_csv_standard.time_write_csv_standard    59.00ms

Master Branch

io_bench.frame_to_csv.time_frame_to_csv           216.67ms
io_bench.frame_to_csv2.time_frame_to_csv2         353.74ms
...formatting.time_frame_to_csv_date_formatting    15.28ms
...h.frame_to_csv_mixed.time_frame_to_csv_mixed   254.94ms
...m.time_read_csv_infer_datetime_format_custom    17.82ms
....time_read_csv_infer_datetime_format_iso8601     3.03ms
..._ymd.time_read_csv_infer_datetime_format_ymd     3.16ms
...nch.read_csv_skiprows.time_read_csv_skiprows    19.74ms
...nch.read_csv_standard.time_read_csv_standard    16.08ms
..._dates_iso8601.time_read_parse_dates_iso8601     2.11ms
...h.write_csv_standard.time_write_csv_standard    59.25ms

gfyoung · 2016-04-22T00:42:58Z

FYI (tangent): I think you have an old build from your most recent PR that you just merged running on Travis right now.

Patches bug in C engine CSV parser in which quotation marks were not being respected in skipped rows. Closes pandas-devgh-10911. Closes pandas-devgh-12775.

gfyoung · 2016-04-22T15:06:33Z

@jreback : Travis is giving the green light, and I don't see any major timing discrepancies. Ready to merge if there is nothing else.

jreback · 2016-04-22T15:16:51Z

pandas/src/parser/tokenizer.h

@@ -124,6 +124,8 @@ typedef enum {
    EAT_LINE_COMMENT,
    WHITESPACE_LINE,
    SKIP_LINE,
+    QUOTE_IN_SKIP_LINE,


at some point maybe document what these states actually mean a bit more (I know its mostly self-explanatory, but for future readers)

jreback · 2016-04-22T15:24:41Z

thanks!

jreback added IO CSV read_csv, to_csv Bug labels Apr 16, 2016

gfyoung force-pushed the skiprows-newlines-patch branch from 081daa9 to 159aff6 Compare April 16, 2016 13:14

gfyoung force-pushed the skiprows-newlines-patch branch from 159aff6 to 9fd5ee3 Compare April 17, 2016 13:55

jreback reviewed Apr 17, 2016
View reviewed changes

gfyoung force-pushed the skiprows-newlines-patch branch from 9fd5ee3 to 0f8bea3 Compare April 17, 2016 14:42

gfyoung force-pushed the skiprows-newlines-patch branch 2 times, most recently from 6e19a8e to 8e198ac Compare April 17, 2016 16:08

gfyoung force-pushed the skiprows-newlines-patch branch 3 times, most recently from a51d59e to 5a936d6 Compare April 19, 2016 18:47

gfyoung force-pushed the skiprows-newlines-patch branch from 5a936d6 to 293d525 Compare April 22, 2016 00:11

Patch handling of quotes in skipped rows

858c673

Patches bug in C engine CSV parser in which quotation marks were not being respected in skipped rows. Closes pandas-devgh-10911. Closes pandas-devgh-12775.

gfyoung force-pushed the skiprows-newlines-patch branch from 293d525 to 858c673 Compare April 22, 2016 07:48

jreback added this to the 0.18.1 milestone Apr 22, 2016

jreback reviewed Apr 22, 2016
View reviewed changes

jreback closed this in 5688d27 Apr 22, 2016

gfyoung deleted the skiprows-newlines-patch branch April 22, 2016 15:22

rahulporuri mentioned this pull request Oct 20, 2016

Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459

Closed

rgieseke mentioned this pull request Dec 8, 2017

CSV files with header break with delim_whitespace and skiprows using the C-engine #18692

Closed

Allow parsing in skipped row for C engine #12900

Allow parsing in skipped row for C engine #12900

Conversation

gfyoung commented Apr 14, 2016

gfyoung commented Apr 14, 2016

jreback commented Apr 16, 2016 • edited Loading

gfyoung commented Apr 16, 2016 • edited Loading

jreback commented Apr 16, 2016

gfyoung commented Apr 16, 2016 • edited Loading

jreback commented Apr 16, 2016

gfyoung commented Apr 16, 2016 • edited Loading

jorisvandenbossche commented Apr 17, 2016

gfyoung commented Apr 17, 2016

jorisvandenbossche commented Apr 17, 2016

gfyoung commented Apr 17, 2016

jreback commented Apr 17, 2016

gfyoung commented Apr 17, 2016

gfyoung commented Apr 17, 2016

jreback Apr 17, 2016

Choose a reason for hiding this comment

gfyoung Apr 17, 2016 • edited Loading

Choose a reason for hiding this comment

jreback Apr 17, 2016

Choose a reason for hiding this comment

gfyoung Apr 17, 2016 • edited Loading

Choose a reason for hiding this comment

jreback Apr 17, 2016

Choose a reason for hiding this comment

gfyoung Apr 17, 2016

Choose a reason for hiding this comment

jreback Apr 17, 2016

Choose a reason for hiding this comment

gfyoung Apr 17, 2016

Choose a reason for hiding this comment

jreback commented Apr 17, 2016

gfyoung commented Apr 17, 2016

jreback commented Apr 17, 2016

gfyoung commented Apr 17, 2016 • edited Loading

gfyoung commented Apr 17, 2016

jorisvandenbossche commented Apr 17, 2016

gfyoung commented Apr 17, 2016

gfyoung commented Apr 21, 2016 • edited Loading

gfyoung commented Apr 22, 2016 • edited Loading

jreback commented Apr 22, 2016

gfyoung commented Apr 22, 2016

PR Branch

Master Branch

gfyoung commented Apr 22, 2016 • edited Loading

gfyoung commented Apr 22, 2016

jreback Apr 22, 2016

Choose a reason for hiding this comment

jreback commented Apr 22, 2016

jreback commented Apr 16, 2016 •

edited

Loading

gfyoung commented Apr 16, 2016 •

edited

Loading

gfyoung commented Apr 16, 2016 •

edited

Loading

gfyoung commented Apr 16, 2016 •

edited

Loading

gfyoung Apr 17, 2016 •

edited

Loading

gfyoung Apr 17, 2016 •

edited

Loading

gfyoung commented Apr 17, 2016 •

edited

Loading

gfyoung commented Apr 21, 2016 •

edited

Loading

gfyoung commented Apr 22, 2016 •

edited

Loading

gfyoung commented Apr 22, 2016 •

edited

Loading