DOC: Adding methods for "bad" lines that preserve all data #17385

joshjacobson · 2017-08-30T21:18:34Z

closes [Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN) #17319

Addresses Issue [pandas-dev#17319](pandas-dev#17319)

gfyoung · 2017-08-30T21:42:06Z

doc/source/io.rst

@@ -1175,6 +1175,80 @@ data that appear in some lines but not others:
    0  1  2   3
    1  4  5   6
    2  8  9  10
+
+Handling "bad" lines - preserving the data


Not sure if you need this. I think one giant section will work.

If we eliminate the open workaround, I agree. Otherwise it felt like the open workaround might be difficult to understand in context.

Even still, you can add a transition sentence instead between your code-blocks, adding this open workaround is also for preserving data.

gfyoung · 2017-08-30T21:43:25Z

doc/source/io.rst

+Handling "bad" lines - preserving the data
+''''''''''''''''''''
+
+To preserve all data, you can specify header ``names`` that are long enough:


"that are long enough" might come across as that the actual string name elements are long length-wise (also names is singular, not plural, in the context). I think you can reword to make it clearer that we're talking about about the array itself being long itself.

gfyoung · 2017-08-30T21:44:03Z

doc/source/io.rst

+    1  4  5   6  7
+    2  8  9  10  NaN
+
+or you can use Python's ``open`` command to detect the length of the widest row:


I'm not sure you need to include this workaround. @jorisvandenbossche suggestion, which you added above, should suffice.

Note, that you can always over-specify the length and cut down later, which is a lot less suffering even than this, even though I proposed it 😄

I think the generation of arbitrary labels is generally an approach that should be avoided:

To others reviewing the code, the DataFrame, or a potentially exported file, it implies some level of purpose, or customization in label specification. They might seek to understand why these labels were used, and if there's a matching codebook or other DataFrames.

It's messy. Using a for loop to create strings for the number of variables necessary doesn't feel like something that should be required.

So I like including at least the first open example, because it's the only way to preserve all data without customizing header labels.

Not really in fact. Just use names = ['dummy'] * n, and we'll take of the duplicates for you.

@gfyoung Can you clarify? Where's the value for n coming from?

@joshjacobson : Replace n with whatever number of columns you want. This is a placeholder. My point is that it isn't very hard to specify a list of dummy names.

codecov · 2017-08-30T23:10:35Z

Codecov Report

Merging #17385 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #17385      +/-   ##
==========================================
- Coverage   91.03%   91.02%   -0.02%     
==========================================
  Files         163      163              
  Lines       49580    49580              
==========================================
- Hits        45137    45128       -9     
- Misses       4443     4452       +9

Flag	Coverage Δ
#multiple	`88.8% <ø> (ø)`	⬆️
#single	`40.26% <ø> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.72% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64c8a8d...2481db9. Read the comment docs.

jreback

this should be a much simpler example; no need to show writing a new csv. just show how to use names.

jreback · 2017-10-28T15:54:15Z

closing as stale. its a worthwhile addition i you can respond to comment (ping and we can reopen).

joshjacobson added 2 commits August 30, 2017 16:43

Update io.rst

7d9a075

Updating to include methods for "bad" lines that preserve all data

0eb187b

Addresses Issue [pandas-dev#17319](pandas-dev#17319)

gfyoung added Docs IO CSV read_csv, to_csv labels Aug 30, 2017

gfyoung reviewed Aug 30, 2017

View reviewed changes

Rephrasing specifying header names approach to bad lines

2481db9

jreback requested changes Aug 31, 2017

View reviewed changes

jreback closed this Oct 28, 2017

jreback added this to the No action milestone Oct 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Adding methods for "bad" lines that preserve all data #17385

DOC: Adding methods for "bad" lines that preserve all data #17385

joshjacobson commented Aug 30, 2017

gfyoung Aug 30, 2017

joshjacobson Aug 30, 2017

gfyoung Aug 30, 2017

gfyoung Aug 30, 2017

gfyoung Aug 30, 2017

gfyoung Aug 30, 2017

joshjacobson Aug 30, 2017

gfyoung Aug 30, 2017

joshjacobson Sep 13, 2017

gfyoung Sep 13, 2017 •

edited

Loading

codecov bot commented Aug 30, 2017 •

edited

Loading

jreback left a comment

jreback commented Oct 28, 2017

DOC: Adding methods for "bad" lines that preserve all data #17385

DOC: Adding methods for "bad" lines that preserve all data #17385

Conversation

joshjacobson commented Aug 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Sep 13, 2017 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Aug 30, 2017 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

jreback commented Oct 28, 2017

gfyoung Sep 13, 2017 •

edited

Loading

codecov bot commented Aug 30, 2017 •

edited

Loading