Skip to content

Add ability to process bad lines for read_csv #5686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tbicr opened this issue Dec 12, 2013 · 21 comments · Fixed by #45146
Closed

Add ability to process bad lines for read_csv #5686

tbicr opened this issue Dec 12, 2013 · 21 comments · Fixed by #45146
Labels
Enhancement IO CSV read_csv, to_csv
Milestone

Comments

@tbicr
Copy link

tbicr commented Dec 12, 2013

CSV files can contains some errors, for example:

NAME,PAT
Peter,cat
really bad line
Fedor,cat

to skip really bad lines exist error_bad_lines=False parameter.
Another example without quotes and delimiters in field:

NAME,PAT
Peter,cat
Ira,cat,dog
Fedor,cat

Which with quotes will look like this:

NAME,PAT
Peter,cat
Ira,"cat,dog"
Fedor,cat

So it easy fix this line if know that first field not contain extra separators.
Also extra trailing delimiters issue: #2886

More real life example:

82,52,29,11,2,2013-08-02 00:00:00,,,gen,,FDP, employee,0,1,gen,,,0
55,69,36,19,2,2013-10-28 00:00:00,,,gen,,FDP employee,0,1,gen,,,0

There are difference for FDP employee and FDP, employee. So it will be grate to have ability process this bad lines with own handler.

My proposition add additional parameter process_bad_lines for read_csv.
For example, if I want fix line:

def bad_line_handler(items):
    '''probably ugly example,
       but lets imagine that `FDP, employee` is half of our data'''
    fdp_index = items.index('FDP')
    return items[:fdp_index] + ['FDP, employee'] + items[fdp_index + 2:]

pd.read_csv(file, process_bad_lines=bad_line_handler)
  1. error_bad_line and warn_bad_line can work as before but at first once try replace bad string with process_bad_lines handler.
  2. if process_bad_lines will return None when probably better just skip this line without exceptions (probably it more flexible), to store compatibility just return unchanged items parameter. Otherwise None can be equal to bad line and better raise exceptions from process_bad_lines handler.
  3. not always I can already create CSV file with quotes, somebody already send me bad CSV.
  4. I can pre-process file but it will take more time and work, see for example http://stackoverflow.com/questions/14550441/problems-reading-csv-file-with-commas-and-characters-in-pandas

Some additions:
For example I have no much string fields and can assume that one of strings contains separator:

int,int,int,int,int,datetime,,,str,,str,str,int,int,str,,,int
int,int,int,int,int,datetime,,,str,,str    ,int,int,str,,,int

But it can work bad for many strings:

int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,  str,str,str,str,str  ,int,int,str,,,int

However it also be grate have default methods to fix this strings with concatenating left strings:

int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,  STR  ,str,str,str,str,int,int,str,,,int

# for example with next syntax
pd.read_csv(file, process_bad_lines='try_concat_left')

or right strings:

int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,str,str,str,str,  STR  ,int,int,str,,,int

# for example with next syntax
pd.read_csv(file, process_bad_lines='try_concat_right')

and also removing extra trailing delimiters:

pd.read_csv(file, process_bad_lines='skip_right_delimiters')
@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 15, 2014
@bobcolner
Copy link

+1

Useful feature for building robust data-import

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@William-Sang
Copy link

+1

@FroxieYe
Copy link

+1

@gfyoung
Copy link
Member

gfyoung commented Jan 12, 2017

This request / discussion has been brought up on several occasions now, so I'm going to start linking issues to avoid duplicate discussions:

xref #9549
xref #9729

@kodonnell
Copy link

Similar suggestion here. Are any pandas developers able to comment on the difficulty and amount of work required to do this? E.g. would one need to implement this in both the C and Python engine?

@gfyoung
Copy link
Member

gfyoung commented Jan 31, 2017

  1. I don't believe it will be super difficult to implement, but it won't be trivial. There are many different cases you might have to consider when writing tests, and ensuring that you have behavior that is consistent and "logical" (what this means is somewhat arbitrary when it comes to defining your API).

This is why issues for this matter continue to arise even though PR's have been put up to address some of these cases.

  1. Yes, you would need to implement for both the C and Python engines.

@kinzleb
Copy link

kinzleb commented Apr 6, 2018

It would also be nice if there was an attribute in the dataframe or something that could be inspected after the dataframe is populated that stored the count of lines which were skipped or a list with the line numbers that were skipped.

I've seen suggestions to capture the warning messages and then inspect those but this seems too cumbersome.

@reinaldo-gomes
Copy link

reinaldo-gomes commented May 22, 2019

I need to create a report containing details about every line with bad data (including those with too many fields). How am I supposed to do that if pandas won't let me act upon each occurence??

@daniel-leonel
Copy link

Could have a parameter to pass a function to process (fix or skip) the bad lines.

@jbrockmendel jbrockmendel removed the Ideas Long-Term Enhancement Discussions label Dec 18, 2019
@dpb-bah
Copy link

dpb-bah commented Jan 7, 2020

+1

@udaychadha
Copy link

+1 Would be great to have this. Working right now to resolve this exact issue. This functionality would be amazing. :)

@gfyoung
Copy link
Member

gfyoung commented Jan 10, 2020

Working right now to resolve this exact issue.

@udaychadha : Awesome to hear! Look forward to seeing what you come up with.

@udaychadha
Copy link

I'm just outputting every line where there's an error into a txt. Parse all the line number from that file into an array and open csv in python for only those error lines. Probably not the most ideal way but I think it should work.

@valinolucas407
Copy link

Working right now to resolve this exact issue.

@udaychadha : Awesome to hear! Look forward to seeing what you come up with.

+1

@tddouglas
Copy link

+1

@pranshuag9
Copy link

I am using pandas 1.3.0 read_csv method with on_bad_lines="warn" but it is throwing error instead of warning and skipping.

Using this "warn" argument to get line number of incorrect line and want to throw that error along with line number. Any update on this issue?

@7yl4r
Copy link

7yl4r commented Oct 16, 2021

This is why issues for this matter continue to arise even though PR's have been put up to address some of these cases.

What PRs exist that address this? I would like to take a look and see if I can do anything to push this long-requested feature forward.

@jreback
Copy link
Contributor

jreback commented Oct 16, 2021

there are quite a lot of PRs about parsing csvs that have note been not been merged and a lot that have - you can take a look thru the history

but i think a solution to either (or both)

  • allow a callable for this arg
  • return the unprocessed text of failed to parse lines

would be useful

@7yl4r
Copy link

7yl4r commented Oct 16, 2021

I think the ideal solution is to allow passing a callable to the on_bad_lines arg. When I saw on_bad_lines I assumed I could pass a callable so that usage is certainly intuitive to me.

I did a search of PRs on github and did not find any offering an implementation. I read through the search results for "read_csv bad" and for "on_bad_lines" and found nothing relevant. If you know of some please help me find them. I don't want to write a PR for this if a starting point already exists.

I recognize this is not a trivial task but to move it forward we at least need a candidate PR. If some exist I need to see what is wrong with them before creating my own.

@jreback jreback modified the milestones: Contributions Welcome, 1.4 Jan 6, 2022
@stevenlis
Copy link

Is there any way to have count the number of bad lines. Assume you have 100 bad lines out of 200, that's bad... but 100 out of 1M, it might not be a big idea...so maybe let the user set a threshold (fraction of len(df))?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.