-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Adding methods for "bad" lines that preserve all data #17385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1130,7 +1130,7 @@ options: | |
|
||
.. _io.bad_lines: | ||
|
||
Handling "bad" lines | ||
Handling "bad" lines - excluding the data | ||
'''''''''''''''''''' | ||
|
||
Some files may have malformed lines with too few fields or too many. Lines with | ||
|
@@ -1175,6 +1175,80 @@ data that appear in some lines but not others: | |
0 1 2 3 | ||
1 4 5 6 | ||
2 8 9 10 | ||
|
||
Handling "bad" lines - preserving the data | ||
'''''''''''''''''''' | ||
|
||
To preserve all data, you can specify a sufficient number of header ``names``: | ||
|
||
.. code-block:: ipython | ||
|
||
In [31]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd']) | ||
|
||
Out[31]: | ||
a b c d | ||
0 1 2 3 NaN | ||
1 4 5 6 7 | ||
2 8 9 10 NaN | ||
|
||
or you can use Python's ``open`` command to detect the length of the widest row: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure you need to include this workaround. @jorisvandenbossche suggestion, which you added above, should suffice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note, that you can always over-specify the length and cut down later, which is a lot less suffering even than this, even though I proposed it 😄 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the generation of arbitrary labels is generally an approach that should be avoided:
So I like including at least the first There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really in fact. Just use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gfyoung Can you clarify? Where's the value for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @joshjacobson : Replace |
||
|
||
.. code-block:: ipython | ||
|
||
In [32]: | ||
import csv | ||
with open('data.csv', newline='') as f: | ||
reader = csv.reader(f) | ||
max_width = 0 | ||
for row in reader: | ||
length = row.count(',') | ||
if length > max_width: | ||
max_width = length | ||
|
||
and then choose to edit the csv itself: | ||
|
||
.. code-block:: ipython | ||
|
||
In [32] (cont'd): | ||
|
||
amended_rows = [] | ||
for row in reader: | ||
length = row.count(',') | ||
if length < max_width: | ||
for _ in range(max_width - length): | ||
row = row + ',' | ||
amended_rows.append(row) | ||
|
||
writer = csv.writer(f) | ||
writer.writerows(amended_rows) | ||
|
||
pd.read_csv('data.csv') | ||
|
||
Out[32]: | ||
a b c d | ||
0 1 2 3 NaN | ||
1 4 5 6 7 | ||
2 8 9 10 NaN | ||
|
||
or to specify ``names`` based on the length of the widest row: | ||
|
||
.. code-block:: ipython | ||
|
||
In [32] (cont'd): | ||
|
||
label = 'c' | ||
col_labels = [] | ||
for col_num in range(max_width): | ||
label = label + str(col_num) | ||
col_labels.append(label) | ||
|
||
pd.read_csv('data.csv', names=col_labels) | ||
|
||
Out[32]: | ||
c1 c2 c3 c4 | ||
0 1 2 3 NaN | ||
1 4 5 6 7 | ||
2 8 9 10 NaN | ||
|
||
.. _io.dialect: | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you need this. I think one giant section will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we eliminate the
open
workaround, I agree. Otherwise it felt like theopen
workaround might be difficult to understand in context.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even still, you can add a transition sentence instead between your code-blocks, adding this
open
workaround is also for preserving data.