-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add ability to process bad lines for read_csv #5686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 Useful feature for building robust data-import |
+1 |
+1 |
Similar suggestion here. Are any pandas developers able to comment on the difficulty and amount of work required to do this? E.g. would one need to implement this in both the C and Python engine? |
This is why issues for this matter continue to arise even though PR's have been put up to address some of these cases.
|
It would also be nice if there was an attribute in the dataframe or something that could be inspected after the dataframe is populated that stored the count of lines which were skipped or a list with the line numbers that were skipped. I've seen suggestions to capture the warning messages and then inspect those but this seems too cumbersome. |
I need to create a report containing details about every line with bad data (including those with too many fields). How am I supposed to do that if pandas won't let me act upon each occurence?? |
Could have a parameter to pass a function to process (fix or skip) the bad lines. |
+1 |
+1 Would be great to have this. Working right now to resolve this exact issue. This functionality would be amazing. :) |
@udaychadha : Awesome to hear! Look forward to seeing what you come up with. |
I'm just outputting every line where there's an error into a txt. Parse all the line number from that file into an array and open csv in python for only those error lines. Probably not the most ideal way but I think it should work. |
+1 |
+1 |
I am using pandas 1.3.0 read_csv method with on_bad_lines="warn" but it is throwing error instead of warning and skipping. Using this "warn" argument to get line number of incorrect line and want to throw that error along with line number. Any update on this issue? |
What PRs exist that address this? I would like to take a look and see if I can do anything to push this long-requested feature forward. |
there are quite a lot of PRs about parsing csvs that have note been not been merged and a lot that have - you can take a look thru the history but i think a solution to either (or both)
would be useful |
I think the ideal solution is to allow passing a callable to the I did a search of PRs on github and did not find any offering an implementation. I read through the search results for "read_csv bad" and for "on_bad_lines" and found nothing relevant. If you know of some please help me find them. I don't want to write a PR for this if a starting point already exists. I recognize this is not a trivial task but to move it forward we at least need a candidate PR. If some exist I need to see what is wrong with them before creating my own. |
Is there any way to have count the number of bad lines. Assume you have 100 bad lines out of 200, that's bad... but 100 out of 1M, it might not be a big idea...so maybe let the user set a threshold (fraction of len(df))? |
CSV files can contains some errors, for example:
to skip really bad lines exist
error_bad_lines=False
parameter.Another example without quotes and delimiters in field:
Which with quotes will look like this:
So it easy fix this line if know that first field not contain extra separators.
Also extra trailing delimiters issue: #2886
More real life example:
There are difference for
FDP employee
andFDP, employee
. So it will be grate to have ability process this bad lines with own handler.My proposition add additional parameter
process_bad_lines
forread_csv
.For example, if I want fix line:
error_bad_line
andwarn_bad_line
can work as before but at first once try replace bad string withprocess_bad_lines
handler.process_bad_lines
will returnNone
when probably better just skip this line without exceptions (probably it more flexible), to store compatibility just return unchangeditems
parameter. OtherwiseNone
can be equal to bad line and better raise exceptions fromprocess_bad_lines
handler.Some additions:
For example I have no much string fields and can assume that one of strings contains separator:
But it can work bad for many strings:
However it also be grate have default methods to fix this strings with concatenating left strings:
or right strings:
and also removing extra trailing delimiters:
The text was updated successfully, but these errors were encountered: