Skip to content

Allow parsing in skipped row for C engine #12900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Apr 14, 2016

Changes behaviour of C engine parser so that parsing is done on skipped rows so that they
are properly skipped.

Closes #10911.
Closes #12775.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 14, 2016

The issue with allowing parsing on skipped rows is that there is a performance hit since we're spending more time processing rows that don't really matter. How significant is it though do you think?

@jreback
Copy link
Contributor

jreback commented Apr 16, 2016

this for sure needs performance testing. You are allowing a special case (that of a terminator character IN the skiprows) to influence the workhorse parsing. I am not sure this is the correct soln.

@jreback jreback added IO CSV read_csv, to_csv Bug labels Apr 16, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Apr 16, 2016

@jreback : it's actually a little more general than that. I am addressing all of the cases that occur in tokenizer.c, which goes beyond just newline characters in quoted fields bug. The other option is to add new logic in the SKIP_LINE state to just check for quotes but I wasn't sure about how it would handle quotes within quotes?

@jreback
Copy link
Contributor

jreback commented Apr 16, 2016

well, you can run the benchmarks in the test suite and see what shows up. prob need some addtl ones that hae embeeded terminators and such. I would special case thinks for a situation like this.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 16, 2016

"special case thinks" ?

@jreback
Copy link
Contributor

jreback commented Apr 16, 2016

I think

@gfyoung gfyoung force-pushed the skiprows-newlines-patch branch from 081daa9 to 159aff6 Compare April 16, 2016 13:14
@gfyoung
Copy link
Member Author

gfyoung commented Apr 16, 2016

Added some additional unit tests that my current fixes should be robust against. Let me know if there are any additional unit tests that should be included/written.

Also, which benchmarks should I run my changes against? I had some difficulty finding relevant benchmarks in the code base.

@jorisvandenbossche
Copy link
Member

@gfyoung There is one specific asv bench for skiprows (io_bench/read_csv_skiprows), but you can eg run all tests that contain read_csv, or all tests in the io_bench.py file

@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

@jorisvandenbossche : how do I exactly run only one of the benchmark tests (i.e. just io_bench.py)? The vb_suite/run_suite.py file doesn't seem to have any way to configure which benchmarks to run.

@jorisvandenbossche
Copy link
Member

See the contributing docs on performance testing: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-performance-test-suite (if anything is not clear, please indicate so we can improve the explanation)

@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

@jorisvandenbossche : Thank you! That link was very helpful. I was looking on the github.io page for vbench, which was not as detailed as this.

@everyone: I ran the benchmark for io_bench.csv_skiprows. master currently runs it in 18.62 ms, while my PR runs it in 26.76 ms.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

pls run all of the csv benchmarks

@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

master (ms) PR (ms)
time_frame_to_csv 220.30 227.45
time_frame_to_csv2 377.33 364.85
time_frame_to_csv_date_formatting 17.01 15.51
time_frame_to_csv_mixed 269.36 257.73
time_read_csv_infer_datetime_format_custom 18.10 18.30
time_read_csv_infer_datetime_format_iso8601 2.97 3.10
time_read_csv_infer_datetime_format_ymd 3.30 3.21
time_read_csv_skiprows 18.23 25.74
time_read_csv_standard 14.95 15.65
time_read_parse_dates_iso8601 2.05 2.14
time_write_csv_standard 58.79 59.01

As can be observed according to asv, the performance has not been significantly impacted by my changes AFAICT.

@gfyoung gfyoung force-pushed the skiprows-newlines-patch branch from 159aff6 to 9fd5ee3 Compare April 17, 2016 13:55
@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

If there are no complaints about the performance or the changes (they pass currently against Travis, I added [no skip] to change commit message), this PR is ready to merge.

break;
}
else if (c == '\n') {
if (c == '\n') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not lineterminator?

Copy link
Member Author

@gfyoung gfyoung Apr 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the lineterminator is not custom in this case? It never was lineterminator AFAICT, as you can see here here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that you changed the code, I am looking at why

Copy link
Member Author

@gfyoung gfyoung Apr 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no change here. This function was always checking for \n or \r.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why is that (and not lineterminator)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this logic here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I c, can you add some comments to that effect then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

pls have a look thru any open csv parsing issues. you changed some things which may allow some of them to be fixed e.g. things related to lineterminator and such.

@gfyoung gfyoung force-pushed the skiprows-newlines-patch branch from 9fd5ee3 to 0f8bea3 Compare April 17, 2016 14:42
@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

I took a look through all of the issues, and I think the two that I mentioned so far are the only ones that can be closed with this PR.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

https://github.com/pydata/pandas/search?q=csv+lineterminator&type=Issues&utf8=%E2%9C%93

pls have a look at these and see if any are relevant

@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

@jreback : I looked through all of the open issues tagged with CSV before I made that comment above.

FYI, you also should have received emails about several of the older CSV issues that either should be closed, except for one that I think can be patched.

@gfyoung gfyoung force-pushed the skiprows-newlines-patch branch 2 times, most recently from 6e19a8e to 8e198ac Compare April 17, 2016 16:08
@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

@jreback : if there is nothing else, I think this is ready to merge.

@jorisvandenbossche
Copy link
Member

For the skiprows benchmark, there is a considerable slowdown, which is of course logical since there is now more parsing, and the exact slowdown of course depends on the number of skipped rows (this is just the consequence of wanting to solve the bug).
But the question is a bit of we are OK with this performance hit for such a corner case. I can imagine people use skiprows exactly to have a faster parsing when they need only part of the file?
However, I also can't think of a nice way around this (apart from yet another keyword ..).

@gfyoung
Copy link
Member Author

gfyoung commented Apr 17, 2016

@jorisvandenbossche : Yes, that slowdown seems to be an unfortunate necessity at this point for the sake of correctness. As I mentioned earlier, the other option was to add more logic under the SKIP_LINE block, but attempts to do so were not robust against additional testing and only appeared to duplicate existing code.

@gfyoung gfyoung force-pushed the skiprows-newlines-patch branch 3 times, most recently from a51d59e to 5a936d6 Compare April 19, 2016 18:47
@gfyoung
Copy link
Member Author

gfyoung commented Apr 21, 2016

@jorisvandenbossche , @jreback : I'm going to table this one for now until #12939 gets merged in. It will be a lot easier to merge a fix for this with only one (hopefully) tokenizing function to work with.

I also realized that an issue with my current fix is that we will be penalizing people if the fields in the skipped row are malformed (since we are parsing them as if they were rows to be included), which is not what we want IINM. Thus, I might need to go back to the drawing board (even if the correctness > the performance as the PR currently stands) to see whether I can build in such safety mechanisms or go back to my other strategy of introducing some additional states for skipped rows.

@gfyoung gfyoung force-pushed the skiprows-newlines-patch branch from 5a936d6 to 293d525 Compare April 22, 2016 00:11
@gfyoung
Copy link
Member Author

gfyoung commented Apr 22, 2016

I knew there had to be a better way. Just needed some fresh eyes on the issue and managed to get the adding of a couple of new states to work --> there won't be any performance hit now like before. Let's see what Travis thinks.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

ok pls rerun the perf tests as well (and maybe run vs 0.18.0) as a comparison

@gfyoung
Copy link
Member Author

gfyoung commented Apr 22, 2016

PR Branch

io_bench.frame_to_csv.time_frame_to_csv           212.25ms
io_bench.frame_to_csv2.time_frame_to_csv2         356.49ms
...formatting.time_frame_to_csv_date_formatting    15.38ms
...h.frame_to_csv_mixed.time_frame_to_csv_mixed   253.26ms
...m.time_read_csv_infer_datetime_format_custom    18.20ms
...time_read_csv_infer_datetime_format_iso8601     3.00ms
..._ymd.time_read_csv_infer_datetime_format_ymd     3.21ms
...nch.read_csv_skiprows.time_read_csv_skiprows    20.18ms
...nch.read_csv_standard.time_read_csv_standard    16.00ms
..._dates_iso8601.time_read_parse_dates_iso8601     2.07ms
...h.write_csv_standard.time_write_csv_standard    59.00ms

Master Branch

io_bench.frame_to_csv.time_frame_to_csv           216.67ms
io_bench.frame_to_csv2.time_frame_to_csv2         353.74ms
...formatting.time_frame_to_csv_date_formatting    15.28ms
...h.frame_to_csv_mixed.time_frame_to_csv_mixed   254.94ms
...m.time_read_csv_infer_datetime_format_custom    17.82ms
....time_read_csv_infer_datetime_format_iso8601     3.03ms
..._ymd.time_read_csv_infer_datetime_format_ymd     3.16ms
...nch.read_csv_skiprows.time_read_csv_skiprows    19.74ms
...nch.read_csv_standard.time_read_csv_standard    16.08ms
..._dates_iso8601.time_read_parse_dates_iso8601     2.11ms
...h.write_csv_standard.time_write_csv_standard    59.25ms

@gfyoung
Copy link
Member Author

gfyoung commented Apr 22, 2016

FYI (tangent): I think you have an old build from your most recent PR that you just merged running on Travis right now.

Patches bug in C engine CSV parser in
which quotation marks were not being
respected in skipped rows.

Closes pandas-devgh-10911.
Closes pandas-devgh-12775.
@gfyoung gfyoung force-pushed the skiprows-newlines-patch branch from 293d525 to 858c673 Compare April 22, 2016 07:48
@gfyoung
Copy link
Member Author

gfyoung commented Apr 22, 2016

@jreback : Travis is giving the green light, and I don't see any major timing discrepancies. Ready to merge if there is nothing else.

@jreback jreback added this to the 0.18.1 milestone Apr 22, 2016
@@ -124,6 +124,8 @@ typedef enum {
EAT_LINE_COMMENT,
WHITESPACE_LINE,
SKIP_LINE,
QUOTE_IN_SKIP_LINE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point maybe document what these states actually mean a bit more (I know its mostly self-explanatory, but for future readers)

@jreback jreback closed this in 5688d27 Apr 22, 2016
@gfyoung gfyoung deleted the skiprows-newlines-patch branch April 22, 2016 15:22
@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants