Skip to content

[Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN) #17319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joshjacobson opened this issue Aug 23, 2017 · 8 comments · Fixed by #44652
Labels
Docs IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@joshjacobson
Copy link

joshjacobson commented Aug 23, 2017

Given data, such as below in 'csv' format:

header 1, header 2
data 1, data 2
data 3, data 4, data 5

On import, Pandas will fail with the error:

ParserError: Error tokenizing data. C error: Expected 2 fields in line 2, saw 3

There are many ways around this error:

  1. Loading the file in a spreadsheet program (i.e. Excel), and saving it as a new file
  2. Specifying usecols
  3. Setting error_bad_lines to False

But there are instances where these workarounds are impractical:

  1. When you want to preserve all of your data (usecols and error_bad_lines will cause loss)
  2. When using a spreadsheet program is not feasible (i.e. encodings not supported by Excel, or when dealing with a large number of files (or in particular, the combination of the two))

Desired feature

In read_csv and read_table, add an expand_header option, which when set to True will set the width of the DataFrame to that of the widest row. Cells left without data due to this would be blank or NaN.

@joshjacobson joshjacobson changed the title [Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN), or 'use_cols' option to avoid error [Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN) Aug 23, 2017
@gfyoung gfyoung added Enhancement IO CSV read_csv, to_csv labels Aug 23, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 23, 2017

@joshjacobson : Thanks for your report! This is an interesting proposal, but I'm wary about adding this because I have very rarely seen such a use case + the signature for read_csv is pretty bloated as it is.

I should add you can read the CSV in your yourself with Python's open command and patch each incorrect line by adding additional delimiters to indicate missing fields.

@gfyoung gfyoung added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 23, 2017
@jorisvandenbossche
Copy link
Member

There is another workaround that does preserve all data: specify header names manually that are long enough:

In [90]: data = """a,b
    ...: 1,2
    ...: 1,2,3"""

In [92]: pd.read_csv(StringIO(data))
...
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

In [93]: pd.read_csv(StringIO(data), names=['a', 'b', 'c'])
Out[93]: 
   a  b    c
0  a  b  NaN
1  1  2  NaN
2  1  2  3.0

@gfyoung
Copy link
Member

gfyoung commented Aug 23, 2017

@jorisvandenbossche : Ah, yes, that's a good point.

@jreback
Copy link
Contributor

jreback commented Aug 23, 2017

I suppose could add to cookbook. e.g. how to handle csv special cases. I agree no new options are needed, already way too many.

@jreback jreback closed this as completed Aug 23, 2017
@jreback jreback added this to the No action milestone Aug 23, 2017
@jorisvandenbossche
Copy link
Member

See also #15122 for a general issue on improving the handling of 'bad' lines.
Although this exact case is not included in that list. The reason for that is also that this feature would be very expensive: it has first to check the length of all rows, to then parse again with the appropriate number of rows.

But I think it would be a nice addition in this section of the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#handling-bad-lines (it already shows how to drop the extra value of too long row, we could re-use the same example to show how to not drop that value)

@jorisvandenbossche jorisvandenbossche removed this from the No action milestone Aug 24, 2017
@jorisvandenbossche
Copy link
Member

Would you like to do a PR for that?

@joshjacobson
Copy link
Author

joshjacobson commented Aug 30, 2017

The comments here (by @gfyoung and the workaround by @jorisvandenbossche) seem to rely on this being a rare case. I'd like to push back on that belief, since this is an issue I run into quite frequently in my work (the circumstances of which aren't rare and make this issue especially likely). I estimate over the past year I would have used the proposed expand_header about once every two weeks, on about 1/3 of my data imports to pandas.

I believe there are a set of conditions that make this especially difficult to resolve, and that can co-occur with some frequency:

  • Data with 'bad lines' of unpredictable widths
  • Desire to keep all data
  • Many source data files
  • Foreign characters (such that encoding needs to be specified)

Candidate source data in which I experience 'bad lines' of unpredictable width:

  • Data without column headers, that also has missing values
  • Data collected for a variety of non-standardized sources, collated together
  • Data parsed from pdf

I run into these types of files from the following:

  • Data collected within developing countries
    -- This can frequently involve little to no auditing / professional oversight, so if there's someone collecting 10 fields of data in one geography, and someone collecting 25 in another, they may be simply combined as lines within a single data source (without headers). The next week may have a quite different presentation in terms of number of fields.
    -- I've had workflows involve handling a large number of these at once... for example 8 years of weekly reports across ~1000 geographic administrations (appx. 400,000 data files).

  • Machine generated datasets without headers
    -- For example, I worked with a dataset of weather stations (each station to a row) that collected data (without headers) in which rainfall was reported for each instance as a new column. This was of unpredictable width, and there were a number of different files to deal with (various geographies, various time periods)

Updated status:

In the initial post, I specified why I find workarounds involving Excel, usecols, and error_bad_lines to be inadequate (encoding issues, the desire to preserve all data).

@gfyoung suggested editing the data after reading it in using Python's open command

Under the circumstances I initially described (wanting to preserve all data, many data sets of uncertain width, and non-standard encoding) I believe this workflow would look something like this:

  1. Read in each line
  2. Detect the width of the longest line
  3. Reading in each line again, appending commas to match the width of the longest line
  4. Write file with amended lines
  5. Import to pandas
import csv
with open('some.csv', newline='', encoding='latin-1') as f:
    reader = csv.reader(f)
    max_width = 0
    for row in reader:
        length = row.count(',')
        if length > max_width:
             max_width = length
    
    amended_rows = []
    for row in reader:
        length = row.count(',')
        if length < max_width:
             for _ in range(max_width - length):
                  row = row + ','
        amended_rows.append(row)

    writer = csv.writer(f)
    writer.writerows(amended_rows)

pd.read_csv('some.csv')

@jorisvandenbossche identified an additional workaround: Specifying header names manually that are long enough. This workaround seems best suited for circumstances involving a singular dataset in which the maximum width is known.

For the circumstances I describe, I believe the workflow would be:

  1. Read in each line using Python's open command
  2. Detect the width of the longest line
  3. Generate a list of labels of length matching the width of the longest line
  4. Import to pandas while specifying label list
import csv
with open('some.csv', newline='', encoding='latin-1') as f:
    reader = csv.reader(f)
    max_width = 0
    for row in reader:
        length = row.count(',')
        if length > max_width:
             max_width = length

label = 'column_'
col_labels = []
for col_num in range(max_width):
    label = label + str(col_num)
    col_labels.append(label)

pd.read_csv('some.csv', names=col_labels)

Given the perhaps greater frequency than initially communicated, and/or the workflows that provide solutions, it may be worth considering a few ways forward:

  1. Accept that while maybe not incredibly rare, this is still a rare enough event that no changes to pandas are warranted, and move forward as previously determined by others by adding this case to the docs.

  2. Include in pandas some of the necessary functionality to make the workarounds successful. For example, adding a function that detects the maximum width of a row in a data file (without importing it as a DataFrame).

  3. Eliminate the need for some of the activities in the workarounds. For example, allow the specification of the number of columns on import without providing names (avoiding the label generation for names).

  4. [Initial proposal] In read_csv and read_table, add an expand_header option, which when set to True will set the width of the DataFrame to that of the widest row. Cells left without data due to this would be blank or NaN.

@gfyoung
Copy link
Member

gfyoung commented Aug 30, 2017

@joshjacobson : Thanks for your reply. I think I would agree with @jorisvandenbossche that your first option is the best way to move forward.

Good to know that you have encountered this issue with your datasets, but it's important to separate between data issues and read_csv being too strict. In this case, I think it falls more on the former than the latter given your description of the data sources.

Also, I understand that you believe that this problem is prevalent given that you encounter it on so many occasions, but for us as maintainers to agree with that assessment, we generally need to see a good number of users independently bring up this issue.

Option 2 is not generalizable because we don't want to loosen our requirements on files being correctly formed. The responsibility should be on the user to make sure that their file is formatted correctly to be read into pandas.

Option 3 would likely be best integrated as another parameter, but we have too many parameters already. The same goes for Option 4.

Again, if more people bring up this issue, we can always revisit, but at this point, I would stick with option 1 as your way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
5 participants