-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN) #17319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@joshjacobson : Thanks for your report! This is an interesting proposal, but I'm wary about adding this because I have very rarely seen such a use case + the signature for I should add you can read the CSV in your yourself with Python's |
There is another workaround that does preserve all data: specify header names manually that are long enough:
|
@jorisvandenbossche : Ah, yes, that's a good point. |
I suppose could add to cookbook. e.g. how to handle csv special cases. I agree no new options are needed, already way too many. |
See also #15122 for a general issue on improving the handling of 'bad' lines. But I think it would be a nice addition in this section of the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#handling-bad-lines (it already shows how to drop the extra value of too long row, we could re-use the same example to show how to not drop that value) |
Would you like to do a PR for that? |
The comments here (by @gfyoung and the workaround by @jorisvandenbossche) seem to rely on this being a rare case. I'd like to push back on that belief, since this is an issue I run into quite frequently in my work (the circumstances of which aren't rare and make this issue especially likely). I estimate over the past year I would have used the proposed I believe there are a set of conditions that make this especially difficult to resolve, and that can co-occur with some frequency:
Candidate source data in which I experience 'bad lines' of unpredictable width:
I run into these types of files from the following:
Updated status: In the initial post, I specified why I find workarounds involving Excel, @gfyoung suggested editing the data after reading it in using Python's Under the circumstances I initially described (wanting to preserve all data, many data sets of uncertain width, and non-standard encoding) I believe this workflow would look something like this:
@jorisvandenbossche identified an additional workaround: Specifying header names manually that are long enough. This workaround seems best suited for circumstances involving a singular dataset in which the maximum width is known. For the circumstances I describe, I believe the workflow would be:
Given the perhaps greater frequency than initially communicated, and/or the workflows that provide solutions, it may be worth considering a few ways forward:
|
@joshjacobson : Thanks for your reply. I think I would agree with @jorisvandenbossche that your first option is the best way to move forward. Good to know that you have encountered this issue with your datasets, but it's important to separate between data issues and Also, I understand that you believe that this problem is prevalent given that you encounter it on so many occasions, but for us as maintainers to agree with that assessment, we generally need to see a good number of users independently bring up this issue. Option 2 is not generalizable because we don't want to loosen our requirements on files being correctly formed. The responsibility should be on the user to make sure that their file is formatted correctly to be read into Option 3 would likely be best integrated as another parameter, but we have too many parameters already. The same goes for Option 4. Again, if more people bring up this issue, we can always revisit, but at this point, I would stick with option 1 as your way to go. |
Given data, such as below in 'csv' format:
header 1, header 2
data 1, data 2
data 3, data 4, data 5
On import, Pandas will fail with the error:
There are many ways around this error:
usecols
error_bad_lines
to FalseBut there are instances where these workarounds are impractical:
Desired feature
In
read_csv
andread_table
, add anexpand_header
option, which when set toTrue
will set the width of the DataFrame to that of the widest row. Cells left without data due to this would be blank orNaN
.The text was updated successfully, but these errors were encountered: