read_csv() option to parse numeric columns to np.float32 or np.float16 #2511

dragoljub · 2012-12-13T02:40:43Z

I noticed in the documentation:

"
Specifying column data types
Starting with v0.10, you can indicate the data type for the whole DataFrame or individual columns:
"

This is indeed a great feature epically for numerical objects codes that would otherwise be parsed as integers with leading zeros removed.

From the example I was hoping I could specify reduced precision int or float for each column, but alas they are up-casted to the 64bit versions. :/ Except....

In [1]:

data = 'a,b,c\n1,2,3\n4,5,6\n7,8,9'

df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': np.float32, 'c': np.int16})

df.dtypes

Out[1]: <-- Upcasted to 64 bit

a object
b float64
c int64

The upcasting occurs even if I explicitly try to set a column to a 32bit float/int.

In [2]:

df['b'] = df['b'].astype(np.float32)
type(df['b'][0])

Out[2]: <-- Upcasted to 64 bit even with explicit column set.

numpy.float64

However, if I start with the object type, and then explicitly set the column to a float16/32 everything it seems to work.

In [3]:

data = 'a,b,c\nCat,2.3456789,3\nDog,5,6\nHat,8,9'

df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': object, 'c': np.int16})

df.dtypes

Out[3]:

a object
b object
c int64 <-- upcasting during read_csv

In [4]:

print type(df['b'][0])

df['b'][0]

<type 'str'> <-- here we have the string type object not parsed

Out[4]:
'2.3456789'

In [5]:

df['b'] = df['b'].astype(np.float16) <-- explicit casting and setting of Object to float16

In [6]:

print type(df['b'][0])

print df['b'][0]

print df.dtypes

<type 'numpy.float16'>
2.3457
a object
b float16 <-- Yay 16 bit!
c int64

<-- Correctly Cast Object into float32, with correct truncation of the data value.

Now my next question is does this have any possibly bad memory implications? When converting many Object columns to np.float32 or np.float16 does pandas properly take care to allow enough memory for the different width numbers? I'm assuming an entirely new column is created and the old object column is just freed from memory. This may be a good workaround for me since I often have several million low resolution real and int columns that are adding extra overhead for storing/reading/parsing/writing in 64bit wide format.

Thanks for any input,
-Gagi

paulproteus · 2012-12-16T06:33:56Z

To anyone who wishes to fix this:

One reasonable next step here would be to write a test case, and see if the issue can be reproduced within the test suite.

jreback mentioned this issue May 8, 2014

truncation issue with pd.read_csv #7072

Closed

jreback mentioned this issue Jun 17, 2015

Incorrect usage of float16 in algos.pyx #10382

Closed

datapythonista mentioned this issue Jan 12, 2018

np.float16 support #9220

Closed

jreback closed this as completed Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv() option to parse numeric columns to np.float32 or np.float16 #2511

read_csv() option to parse numeric columns to np.float32 or np.float16 #2511

dragoljub commented Dec 13, 2012

paulproteus commented Dec 16, 2012

read_csv() option to parse numeric columns to np.float32 or np.float16 #2511

read_csv() option to parse numeric columns to np.float32 or np.float16 #2511

Comments

dragoljub commented Dec 13, 2012

paulproteus commented Dec 16, 2012