Skip to content

DOC: Adding methods for "bad" lines that preserve all data #17385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

@gfyoung gfyoung added Docs IO CSV read_csv, to_csv labels Aug 30, 2017
@@ -1175,6 +1175,80 @@ data that appear in some lines but not others:
0 1 2 3
1 4 5 6
2 8 9 10

Handling "bad" lines - preserving the data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you need this. I think one giant section will work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we eliminate the open workaround, I agree. Otherwise it felt like the open workaround might be difficult to understand in context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even still, you can add a transition sentence instead between your code-blocks, adding this open workaround is also for preserving data.

Handling "bad" lines - preserving the data
''''''''''''''''''''

To preserve all data, you can specify header ``names`` that are long enough:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that are long enough" might come across as that the actual string name elements are long length-wise (also names is singular, not plural, in the context). I think you can reword to make it clearer that we're talking about about the array itself being long itself.

1 4 5 6 7
2 8 9 10 NaN

or you can use Python's ``open`` command to detect the length of the widest row:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you need to include this workaround. @jorisvandenbossche suggestion, which you added above, should suffice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, that you can always over-specify the length and cut down later, which is a lot less suffering even than this, even though I proposed it 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the generation of arbitrary labels is generally an approach that should be avoided:

  • To others reviewing the code, the DataFrame, or a potentially exported file, it implies some level of purpose, or customization in label specification. They might seek to understand why these labels were used, and if there's a matching codebook or other DataFrames.
  • It's messy. Using a for loop to create strings for the number of variables necessary doesn't feel like something that should be required.

So I like including at least the first open example, because it's the only way to preserve all data without customizing header labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really in fact. Just use names = ['dummy'] * n, and we'll take of the duplicates for you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung Can you clarify? Where's the value for n coming from?

Copy link
Member

@gfyoung gfyoung Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshjacobson : Replace n with whatever number of columns you want. This is a placeholder. My point is that it isn't very hard to specify a list of dummy names.

@codecov
Copy link

codecov bot commented Aug 30, 2017

Codecov Report

Merging #17385 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17385      +/-   ##
==========================================
- Coverage   91.03%   91.02%   -0.02%     
==========================================
  Files         163      163              
  Lines       49580    49580              
==========================================
- Hits        45137    45128       -9     
- Misses       4443     4452       +9
Flag Coverage Δ
#multiple 88.8% <ø> (ø) ⬆️
#single 40.26% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64c8a8d...2481db9. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a much simpler example; no need to show writing a new csv. just show how to use names.

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

closing as stale. its a worthwhile addition i you can respond to comment (ping and we can reopen).

@jreback jreback closed this Oct 28, 2017
@jreback jreback added this to the No action milestone Oct 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv
Projects
None yet
3 participants