-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Adding methods for "bad" lines that preserve all data #17385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
joshjacobson
commented
Aug 30, 2017
- closes [Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN) #17319
@@ -1175,6 +1175,80 @@ data that appear in some lines but not others: | |||
0 1 2 3 | |||
1 4 5 6 | |||
2 8 9 10 | |||
|
|||
Handling "bad" lines - preserving the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you need this. I think one giant section will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we eliminate the open
workaround, I agree. Otherwise it felt like the open
workaround might be difficult to understand in context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even still, you can add a transition sentence instead between your code-blocks, adding this open
workaround is also for preserving data.
doc/source/io.rst
Outdated
Handling "bad" lines - preserving the data | ||
'''''''''''''''''''' | ||
|
||
To preserve all data, you can specify header ``names`` that are long enough: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that are long enough" might come across as that the actual string name elements are long length-wise (also names
is singular, not plural, in the context). I think you can reword to make it clearer that we're talking about about the array itself being long itself.
1 4 5 6 7 | ||
2 8 9 10 NaN | ||
|
||
or you can use Python's ``open`` command to detect the length of the widest row: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you need to include this workaround. @jorisvandenbossche suggestion, which you added above, should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, that you can always over-specify the length and cut down later, which is a lot less suffering even than this, even though I proposed it 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the generation of arbitrary labels is generally an approach that should be avoided:
- To others reviewing the code, the DataFrame, or a potentially exported file, it implies some level of purpose, or customization in label specification. They might seek to understand why these labels were used, and if there's a matching codebook or other DataFrames.
- It's messy. Using a
for
loop to create strings for the number of variables necessary doesn't feel like something that should be required.
So I like including at least the first open
example, because it's the only way to preserve all data without customizing header labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really in fact. Just use names = ['dummy'] * n
, and we'll take of the duplicates for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung Can you clarify? Where's the value for n
coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshjacobson : Replace n
with whatever number of columns you want. This is a placeholder. My point is that it isn't very hard to specify a list of dummy names.
Codecov Report
@@ Coverage Diff @@
## master #17385 +/- ##
==========================================
- Coverage 91.03% 91.02% -0.02%
==========================================
Files 163 163
Lines 49580 49580
==========================================
- Hits 45137 45128 -9
- Misses 4443 4452 +9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a much simpler example; no need to show writing a new csv. just show how to use names.
closing as stale. its a worthwhile addition i you can respond to comment (ping and we can reopen). |