-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Allow parsing in skipped row for C engine #12900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The issue with allowing parsing on skipped rows is that there is a performance hit since we're spending more time processing rows that don't really matter. How significant is it though do you think? |
this for sure needs performance testing. You are allowing a special case (that of a terminator character IN the skiprows) to influence the workhorse parsing. I am not sure this is the correct soln. |
@jreback : it's actually a little more general than that. I am addressing all of the cases that occur in |
well, you can run the benchmarks in the test suite and see what shows up. prob need some addtl ones that hae embeeded terminators and such. I would special case thinks for a situation like this. |
"special case thinks" ? |
I think |
081daa9
to
159aff6
Compare
Added some additional unit tests that my current fixes should be robust against. Let me know if there are any additional unit tests that should be included/written. Also, which benchmarks should I run my changes against? I had some difficulty finding relevant benchmarks in the code base. |
@gfyoung There is one specific asv bench for skiprows (io_bench/read_csv_skiprows), but you can eg run all tests that contain |
@jorisvandenbossche : how do I exactly run only one of the benchmark tests (i.e. just |
See the contributing docs on performance testing: http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-performance-test-suite (if anything is not clear, please indicate so we can improve the explanation) |
@jorisvandenbossche : Thank you! That link was very helpful. I was looking on the @everyone: I ran the benchmark for |
pls run all of the csv benchmarks |
As can be observed according to |
159aff6
to
9fd5ee3
Compare
If there are no complaints about the performance or the changes (they pass currently against Travis, I added |
break; | ||
} | ||
else if (c == '\n') { | ||
if (c == '\n') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not lineterminator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the lineterminator
is not custom in this case? It never was lineterminator
AFAICT, as you can see here here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that you changed the code, I am looking at why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no change here. This function was always checking for \n
or \r
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and why is that (and not lineterminator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of this logic here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I c, can you add some comments to that effect then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pls have a look thru any open csv parsing issues. you changed some things which may allow some of them to be fixed e.g. things related to lineterminator and such. |
9fd5ee3
to
0f8bea3
Compare
I took a look through all of the issues, and I think the two that I mentioned so far are the only ones that can be closed with this PR. |
https://github.com/pydata/pandas/search?q=csv+lineterminator&type=Issues&utf8=%E2%9C%93 pls have a look at these and see if any are relevant |
@jreback : I looked through all of the open issues tagged with FYI, you also should have received emails about several of the older |
6e19a8e
to
8e198ac
Compare
@jreback : if there is nothing else, I think this is ready to merge. |
For the skiprows benchmark, there is a considerable slowdown, which is of course logical since there is now more parsing, and the exact slowdown of course depends on the number of skipped rows (this is just the consequence of wanting to solve the bug). |
@jorisvandenbossche : Yes, that slowdown seems to be an unfortunate necessity at this point for the sake of correctness. As I mentioned earlier, the other option was to add more logic under the |
a51d59e
to
5a936d6
Compare
@jorisvandenbossche , @jreback : I'm going to table this one for now until #12939 gets merged in. It will be a lot easier to merge a fix for this with only one (hopefully) tokenizing function to work with. I also realized that an issue with my current fix is that we will be penalizing people if the fields in the skipped row are malformed (since we are parsing them as if they were rows to be included), which is not what we want IINM. Thus, I might need to go back to the drawing board (even if the correctness > the performance as the PR currently stands) to see whether I can build in such safety mechanisms or go back to my other strategy of introducing some additional states for skipped rows. |
5a936d6
to
293d525
Compare
I knew there had to be a better way. Just needed some fresh eyes on the issue and managed to get the adding of a couple of new states to work --> there won't be any performance hit now like before. Let's see what Travis thinks. |
ok pls rerun the perf tests as well (and maybe run vs 0.18.0) as a comparison |
PR Branchio_bench.frame_to_csv.time_frame_to_csv 212.25ms io_bench.frame_to_csv2.time_frame_to_csv2 356.49ms ...formatting.time_frame_to_csv_date_formatting 15.38ms ...h.frame_to_csv_mixed.time_frame_to_csv_mixed 253.26ms ...m.time_read_csv_infer_datetime_format_custom 18.20ms ...time_read_csv_infer_datetime_format_iso8601 3.00ms ..._ymd.time_read_csv_infer_datetime_format_ymd 3.21ms ...nch.read_csv_skiprows.time_read_csv_skiprows 20.18ms ...nch.read_csv_standard.time_read_csv_standard 16.00ms ..._dates_iso8601.time_read_parse_dates_iso8601 2.07ms ...h.write_csv_standard.time_write_csv_standard 59.00ms Master Branchio_bench.frame_to_csv.time_frame_to_csv 216.67ms io_bench.frame_to_csv2.time_frame_to_csv2 353.74ms ...formatting.time_frame_to_csv_date_formatting 15.28ms ...h.frame_to_csv_mixed.time_frame_to_csv_mixed 254.94ms ...m.time_read_csv_infer_datetime_format_custom 17.82ms ....time_read_csv_infer_datetime_format_iso8601 3.03ms ..._ymd.time_read_csv_infer_datetime_format_ymd 3.16ms ...nch.read_csv_skiprows.time_read_csv_skiprows 19.74ms ...nch.read_csv_standard.time_read_csv_standard 16.08ms ..._dates_iso8601.time_read_parse_dates_iso8601 2.11ms ...h.write_csv_standard.time_write_csv_standard 59.25ms |
FYI (tangent): I think you have an old build from your most recent PR that you just merged running on Travis right now. |
Patches bug in C engine CSV parser in which quotation marks were not being respected in skipped rows. Closes pandas-devgh-10911. Closes pandas-devgh-12775.
293d525
to
858c673
Compare
@jreback : Travis is giving the green light, and I don't see any major timing discrepancies. Ready to merge if there is nothing else. |
@@ -124,6 +124,8 @@ typedef enum { | |||
EAT_LINE_COMMENT, | |||
WHITESPACE_LINE, | |||
SKIP_LINE, | |||
QUOTE_IN_SKIP_LINE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point maybe document what these states actually mean a bit more (I know its mostly self-explanatory, but for future readers)
thanks! |
Changes behaviour of C engine parser so that parsing is done on skipped rows so that they
are properly skipped.
Closes #10911.
Closes #12775.