Skip to content

DOC: Adding methods for "bad" lines that preserve all data #17385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 75 additions & 1 deletion doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1130,7 +1130,7 @@ options:

.. _io.bad_lines:

Handling "bad" lines
Handling "bad" lines - excluding the data
''''''''''''''''''''

Some files may have malformed lines with too few fields or too many. Lines with
Expand Down Expand Up @@ -1175,6 +1175,80 @@ data that appear in some lines but not others:
0 1 2 3
1 4 5 6
2 8 9 10

Handling "bad" lines - preserving the data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you need this. I think one giant section will work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we eliminate the open workaround, I agree. Otherwise it felt like the open workaround might be difficult to understand in context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even still, you can add a transition sentence instead between your code-blocks, adding this open workaround is also for preserving data.

''''''''''''''''''''

To preserve all data, you can specify a sufficient number of header ``names``:

.. code-block:: ipython

In [31]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])

Out[31]:
a b c d
0 1 2 3 NaN
1 4 5 6 7
2 8 9 10 NaN

or you can use Python's ``open`` command to detect the length of the widest row:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you need to include this workaround. @jorisvandenbossche suggestion, which you added above, should suffice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, that you can always over-specify the length and cut down later, which is a lot less suffering even than this, even though I proposed it 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the generation of arbitrary labels is generally an approach that should be avoided:

  • To others reviewing the code, the DataFrame, or a potentially exported file, it implies some level of purpose, or customization in label specification. They might seek to understand why these labels were used, and if there's a matching codebook or other DataFrames.
  • It's messy. Using a for loop to create strings for the number of variables necessary doesn't feel like something that should be required.

So I like including at least the first open example, because it's the only way to preserve all data without customizing header labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really in fact. Just use names = ['dummy'] * n, and we'll take of the duplicates for you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung Can you clarify? Where's the value for n coming from?

Copy link
Member

@gfyoung gfyoung Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshjacobson : Replace n with whatever number of columns you want. This is a placeholder. My point is that it isn't very hard to specify a list of dummy names.


.. code-block:: ipython

In [32]:
import csv
with open('data.csv', newline='') as f:
reader = csv.reader(f)
max_width = 0
for row in reader:
length = row.count(',')
if length > max_width:
max_width = length

and then choose to edit the csv itself:

.. code-block:: ipython

In [32] (cont'd):

amended_rows = []
for row in reader:
length = row.count(',')
if length < max_width:
for _ in range(max_width - length):
row = row + ','
amended_rows.append(row)

writer = csv.writer(f)
writer.writerows(amended_rows)

pd.read_csv('data.csv')

Out[32]:
a b c d
0 1 2 3 NaN
1 4 5 6 7
2 8 9 10 NaN

or to specify ``names`` based on the length of the widest row:

.. code-block:: ipython

In [32] (cont'd):

label = 'c'
col_labels = []
for col_num in range(max_width):
label = label + str(col_num)
col_labels.append(label)

pd.read_csv('data.csv', names=col_labels)

Out[32]:
c1 c2 c3 c4
0 1 2 3 NaN
1 4 5 6 7
2 8 9 10 NaN

.. _io.dialect:

Expand Down