Skip to content

read_csv() option to parse numeric columns to np.float32 or np.float16 #2511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dragoljub opened this issue Dec 13, 2012 · 1 comment
Closed
Labels
Enhancement IO Data IO issues that don't fit into a more specific label

Comments

@dragoljub
Copy link

I noticed in the documentation:

"
Specifying column data types
Starting with v0.10, you can indicate the data type for the whole DataFrame or individual columns:
"

This is indeed a great feature epically for numerical objects codes that would otherwise be parsed as integers with leading zeros removed.

From the example I was hoping I could specify reduced precision int or float for each column, but alas they are up-casted to the 64bit versions. :/ Except....

In [1]:

data = 'a,b,c\n1,2,3\n4,5,6\n7,8,9'

df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': np.float32, 'c': np.int16})

df.dtypes

Out[1]: <-- Upcasted to 64 bit

a object
b float64
c int64

The upcasting occurs even if I explicitly try to set a column to a 32bit float/int.

In [2]:

df['b'] = df['b'].astype(np.float32)
type(df['b'][0])

Out[2]: <-- Upcasted to 64 bit even with explicit column set.

numpy.float64

However, if I start with the object type, and then explicitly set the column to a float16/32 everything it seems to work.

In [3]:

data = 'a,b,c\nCat,2.3456789,3\nDog,5,6\nHat,8,9'

df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': object, 'c': np.int16})

df.dtypes

Out[3]:

a object
b object
c int64 <-- upcasting during read_csv

In [4]:

print type(df['b'][0])

df['b'][0]

<type 'str'> <-- here we have the string type object not parsed

Out[4]:
'2.3456789'

In [5]:

df['b'] = df['b'].astype(np.float16) <-- explicit casting and setting of Object to float16

In [6]:

print type(df['b'][0])

print df['b'][0]

print df.dtypes

<type 'numpy.float16'>
2.3457
a object
b float16 <-- Yay 16 bit!
c int64

<-- Correctly Cast Object into float32, with correct truncation of the data value.

Now my next question is does this have any possibly bad memory implications? When converting many Object columns to np.float32 or np.float16 does pandas properly take care to allow enough memory for the different width numbers? I'm assuming an entirely new column is created and the old object column is just freed from memory. This may be a good workaround for me since I often have several million low resolution real and int columns that are adding extra overhead for storing/reading/parsing/writing in 64bit wide format.

Thanks for any input,
-Gagi

@paulproteus
Copy link

To anyone who wishes to fix this:

One reasonable next step here would be to write a test case, and see if the issue can be reproduced within the test suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

3 participants