Skip to content

Commit 40ad7db

Browse files
committed
DOC: Document behaviour for keep_whitespace and whitespace_chars options
to `read_fwf`. (pandas-dev#51659) Signed-off-by: Ronald Barnes <[email protected]>
1 parent 25bf583 commit 40ad7db

File tree

1 file changed

+63
-25
lines changed

1 file changed

+63
-25
lines changed

doc/source/user_guide/io.rst

Lines changed: 63 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1006,7 +1006,7 @@ first read it in as an object dtype and then apply :func:`to_datetime` to each e
10061006

10071007
.. ipython:: python
10081008
1009-
data = io.StringIO("date\n12 Jan 2000\n2000-01-13\n")
1009+
data = StringIO("date\n12 Jan 2000\n2000-01-13\n")
10101010
df = pd.read_csv(data)
10111011
df['date'] = df['date'].apply(pd.to_datetime)
10121012
df
@@ -1373,8 +1373,7 @@ Files with fixed width columns
13731373

13741374
While :func:`read_csv` reads delimited data, the :func:`read_fwf` function works
13751375
with data files that have known and fixed column widths. The function parameters
1376-
to ``read_fwf`` are largely the same as ``read_csv`` with two extra parameters, and
1377-
a different usage of the ``delimiter`` parameter:
1376+
to ``read_fwf`` are largely the same as ``read_csv`` with five extra parameters:
13781377

13791378
* ``colspecs``: A list of pairs (tuples) giving the extents of the
13801379
fixed-width fields of each line as half-open intervals (i.e., [from, to[ ).
@@ -1383,12 +1382,46 @@ a different usage of the ``delimiter`` parameter:
13831382
behavior, if not specified, is to infer.
13841383
* ``widths``: A list of field widths which can be used instead of 'colspecs'
13851384
if the intervals are contiguous.
1386-
* ``delimiter``: Characters to consider as filler characters in the fixed-width file.
1387-
Can be used to specify the filler character of the fields
1388-
if it is not spaces (e.g., '~').
1385+
* ``keep_whitespace``: A boolean or a tuple(bool,bool) indicating how whitespace
1386+
at the (start,end) of each field / column should be handled.
1387+
* ``whitespace_chars``: A string of characters to strip from the start and/or end
1388+
of fields / columns when 'keep_whitespace' contains a False value.
1389+
* ``delimiter``: Character(s) separating columns when inferring 'colspecs'.
13891390

13901391
Consider a typical fixed-width data file:
13911392

1393+
.. ipython:: python
1394+
1395+
data = (
1396+
"name1 VANBCCAN 107.51 46 B 8 E \n"
1397+
"name2 BBYBCCAN* 20.00 5 1 5 7 F E\n"
1398+
"fullname 3VICBCCAN 22.50 3 1 C 5\n"
1399+
)
1400+
df = pd.read_fwf(StringIO(data),
1401+
header=None,
1402+
widths=[10,3,2,3,1,6,3,12],
1403+
keep_whitespace=(True,False),
1404+
names=["Name", "City", "Prov", "Country", "Deleted",
1405+
"TransAvg", "TransCount", "CreditScores"],
1406+
# Do not convert field data to Nan:
1407+
na_filter=False,
1408+
)
1409+
df
1410+
df.values
1411+
1412+
Note that the name field had trailing whitespace removed, as
1413+
did the other text fields. However, the *leading* whitespace in CreditScores was
1414+
preserved.
1415+
1416+
This is due to ``keep_whitespace`` setting of (True,False) representing (start/end) and
1417+
``whitespace_chars`` default of ``' '`` and ``'\t'`` ([space] and [tab]).
1418+
1419+
The TransAvg and TransCount fields had automatic dtype conversion to
1420+
float64 and int64 respectively.
1421+
1422+
1423+
Parsing a table is possible (see also ``read_table``):
1424+
13921425
.. ipython:: python
13931426
13941427
data1 = (
@@ -1398,52 +1431,57 @@ Consider a typical fixed-width data file:
13981431
"id1230 413.836124 184.375703 11916.8\n"
13991432
"id1948 502.953953 173.237159 12468.3"
14001433
)
1401-
with open("bar.csv", "w") as f:
1402-
f.write(data1)
14031434
1404-
In order to parse this file into a ``DataFrame``, we simply need to supply the
1405-
column specifications to the ``read_fwf`` function along with the file name:
1435+
In order to parse this data set into a ``DataFrame``, we simply need to supply the
1436+
column specifications to the ``read_fwf`` function:
14061437

14071438
.. ipython:: python
14081439
14091440
# Column specifications are a list of half-intervals
14101441
colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]
1411-
df = pd.read_fwf("bar.csv", colspecs=colspecs, header=None, index_col=0)
1442+
df = pd.read_fwf(StringIO(data1),
1443+
colspecs=colspecs,
1444+
header=None,
1445+
index_col=0
1446+
)
14121447
df
14131448
14141449
Note how the parser automatically picks column names X.<column number> when
1415-
``header=None`` argument is specified. Alternatively, you can supply just the
1416-
column widths for contiguous columns:
1417-
1418-
.. ipython:: python
1419-
1420-
# Widths are a list of integers
1421-
widths = [6, 14, 13, 10]
1422-
df = pd.read_fwf("bar.csv", widths=widths, header=None)
1423-
df
1450+
``header=None`` argument is specified.
14241451

1425-
The parser will take care of extra white spaces around the columns
1426-
so it's ok to have extra separation between the columns in the file.
1452+
The parser will take care of extra white spaces around the numeric data columns, and
1453+
trailing spaces on string data, so it's ok to have extra separation between the columns
1454+
in the file.
14271455

14281456
By default, ``read_fwf`` will try to infer the file's ``colspecs`` by using the
14291457
first 100 rows of the file. It can do it only in cases when the columns are
14301458
aligned and correctly separated by the provided ``delimiter`` (default delimiter
14311459
is whitespace).
14321460

1461+
14331462
.. ipython:: python
14341463
1435-
df = pd.read_fwf("bar.csv", header=None, index_col=0)
1464+
df = pd.read_fwf(StringIO(data1),
1465+
header=None,
1466+
index_col=0
1467+
)
14361468
df
14371469
14381470
``read_fwf`` supports the ``dtype`` parameter for specifying the types of
14391471
parsed columns to be different from the inferred type.
14401472

14411473
.. ipython:: python
14421474
1443-
pd.read_fwf("bar.csv", header=None, index_col=0).dtypes
1444-
pd.read_fwf("bar.csv", header=None, dtype={2: "object"}).dtypes
1475+
pd.read_fwf(StringIO(data1),
1476+
header=None,
1477+
index_col=0).dtypes
1478+
1479+
pd.read_fwf(StringIO(data1),
1480+
header=None,
1481+
dtype={2: "object"}).dtypes
14451482
14461483
.. ipython:: python
1484+
:okexcept:
14471485
:suppress:
14481486
14491487
os.remove("bar.csv")

0 commit comments

Comments
 (0)