You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"
Specifying column data types
Starting with v0.10, you can indicate the data type for the whole DataFrame or individual columns:
"
This is indeed a great feature epically for numerical objects codes that would otherwise be parsed as integers with leading zeros removed.
From the example I was hoping I could specify reduced precision int or float for each column, but alas they are up-casted to the 64bit versions. :/ Except....
a object
b object
c int64 <-- upcasting during read_csv
In [4]:
print type(df['b'][0])
df['b'][0]
<type 'str'> <-- here we have the string type object not parsed
Out[4]:
'2.3456789'
In [5]:
df['b'] = df['b'].astype(np.float16) <-- explicit casting and setting of Object to float16
In [6]:
print type(df['b'][0])
print df['b'][0]
print df.dtypes
<type 'numpy.float16'>
2.3457
a object
b float16 <-- Yay 16 bit!
c int64
<-- Correctly Cast Object into float32, with correct truncation of the data value.
Now my next question is does this have any possibly bad memory implications? When converting many Object columns to np.float32 or np.float16 does pandas properly take care to allow enough memory for the different width numbers? I'm assuming an entirely new column is created and the old object column is just freed from memory. This may be a good workaround for me since I often have several million low resolution real and int columns that are adding extra overhead for storing/reading/parsing/writing in 64bit wide format.
Thanks for any input,
-Gagi
The text was updated successfully, but these errors were encountered:
I noticed in the documentation:
"
Specifying column data types
Starting with v0.10, you can indicate the data type for the whole DataFrame or individual columns:
"
This is indeed a great feature epically for numerical objects codes that would otherwise be parsed as integers with leading zeros removed.
From the example I was hoping I could specify reduced precision int or float for each column, but alas they are up-casted to the 64bit versions. :/ Except....
In [1]:
data = 'a,b,c\n1,2,3\n4,5,6\n7,8,9'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': np.float32, 'c': np.int16})
df.dtypes
Out[1]: <-- Upcasted to 64 bit
a object
b float64
c int64
The upcasting occurs even if I explicitly try to set a column to a 32bit float/int.
In [2]:
df['b'] = df['b'].astype(np.float32)
type(df['b'][0])
Out[2]: <-- Upcasted to 64 bit even with explicit column set.
numpy.float64
However, if I start with the object type, and then explicitly set the column to a float16/32 everything it seems to work.
In [3]:
data = 'a,b,c\nCat,2.3456789,3\nDog,5,6\nHat,8,9'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': object, 'b': object, 'c': np.int16})
df.dtypes
Out[3]:
a object
b object
c int64 <-- upcasting during read_csv
In [4]:
print type(df['b'][0])
df['b'][0]
<type 'str'> <-- here we have the string type object not parsed
Out[4]:
'2.3456789'
In [5]:
df['b'] = df['b'].astype(np.float16) <-- explicit casting and setting of Object to float16
In [6]:
print type(df['b'][0])
print df['b'][0]
print df.dtypes
<type 'numpy.float16'>
2.3457
a object
b float16 <-- Yay 16 bit!
c int64
<-- Correctly Cast Object into float32, with correct truncation of the data value.
Now my next question is does this have any possibly bad memory implications? When converting many Object columns to np.float32 or np.float16 does pandas properly take care to allow enough memory for the different width numbers? I'm assuming an entirely new column is created and the old object column is just freed from memory. This may be a good workaround for me since I often have several million low resolution real and int columns that are adding extra overhead for storing/reading/parsing/writing in 64bit wide format.
Thanks for any input,
-Gagi
The text was updated successfully, but these errors were encountered: