Description
When attempting to store data in an HDF5 table, I found a problem where an error is raised if there are multiple object columns containing different data.
import pandas as pd
data = {'ints':pd.Series([1,2,3], index=index), 'Timestamps': pd.Series([pd.Timestamp('2014-1-1 12:00', tz='UTC'), pd.Timestamp('2014-1-2 12:00', tz='UTC'), pd.Timestamp('2014-1-3 12:00', tz='UTC')], index=index), 'strings': pd.Series(['r','g','b'], index=index)}
df = pd.DataFrame(data)
df.to_hdf('test.h5', 'data', format='table')
This leads to an exception: TypeError: Cannot serialize the column [Timestamps] because
its data contents are [datetime] object dtype
However, if I remove the string column:
del df['strings']
df.to_hdf('test.h5', 'data', format='table')
Now it works fine - so it isn't a problem with using the pd.Timestamp type.
Digging a little deeper, it appears the problem is that pandas.io.pytables.Table.create_axes groups the columns by data type, with all columns of type object being grouped into one set of data. Then when set_atom is called, it does this:
rvalues = block.values.ravel()
inferred_type = lib.infer_dtype(rvalues)
This leads to an inferred type of 'mixed' since there are multiple types of objects present, and this isn't handled and throws the exception.
As a fix, it seems that each object column should be handled separately, or at least grouped by the inferred type. I haven't committed to pandas before, or dug this deeply into this section of code, so I'm not sure of the best way to fix this and what other implications there may be, but I'd be happy to help however I can.