Add ability to process bad lines for read_csv

CSV files can contains some errors, for example:

```
NAME,PAT
Peter,cat
really bad line
Fedor,cat
```

to skip really bad lines exist `error_bad_lines=False` parameter.
Another example without quotes and delimiters in field:

```
NAME,PAT
Peter,cat
Ira,cat,dog
Fedor,cat
```

Which with quotes will look like this:

```
NAME,PAT
Peter,cat
Ira,"cat,dog"
Fedor,cat
```

So it easy fix this line if know that first field not contain extra separators.
Also extra trailing delimiters issue: https://github.com/pydata/pandas/issues/2886

More real life example:

```
82,52,29,11,2,2013-08-02 00:00:00,,,gen,,FDP, employee,0,1,gen,,,0
55,69,36,19,2,2013-10-28 00:00:00,,,gen,,FDP employee,0,1,gen,,,0
```

There are difference for `FDP employee` and `FDP, employee`. So it will be grate to have ability process this bad lines with own handler.

My proposition add additional parameter `process_bad_lines` for `read_csv`.
For example, if I want fix line:

```
def bad_line_handler(items):
    '''probably ugly example,
       but lets imagine that `FDP, employee` is half of our data'''
    fdp_index = items.index('FDP')
    return items[:fdp_index] + ['FDP, employee'] + items[fdp_index + 2:]

pd.read_csv(file, process_bad_lines=bad_line_handler)
```
1. `error_bad_line` and `warn_bad_line` can work as before but at first once try replace bad string with `process_bad_lines` handler.
2. if `process_bad_lines` will return `None` when probably better just skip this line without exceptions (probably it more flexible), to store compatibility just return unchanged `items` parameter. Otherwise `None` can be equal to bad line and better raise exceptions from `process_bad_lines` handler.
3. not always I can already create CSV file with quotes, somebody already send me bad CSV.
4. I can pre-process file but it will take more time and work, see for example http://stackoverflow.com/questions/14550441/problems-reading-csv-file-with-commas-and-characters-in-pandas

Some additions:
For example I have no much string fields and can assume that one of strings contains separator:

```
int,int,int,int,int,datetime,,,str,,str,str,int,int,str,,,int
int,int,int,int,int,datetime,,,str,,str    ,int,int,str,,,int
```

But it can work bad for many strings:

```
int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,  str,str,str,str,str  ,int,int,str,,,int
```

However it also be grate have default methods to fix this strings with concatenating left strings: 

```
int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,  STR  ,str,str,str,str,int,int,str,,,int

# for example with next syntax
pd.read_csv(file, process_bad_lines='try_concat_left')
```

or right strings:

```
int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,str,str,str,str,  STR  ,int,int,str,,,int

# for example with next syntax
pd.read_csv(file, process_bad_lines='try_concat_right')
```

and also removing extra trailing delimiters:

```
pd.read_csv(file, process_bad_lines='skip_right_delimiters')
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add ability to process bad lines for read_csv #5686

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add ability to process bad lines for read_csv #5686

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions