read_csv() result column holds same item with different types from 'c' engine only #4681

floux · 2013-08-26T22:33:32Z

dupe of #3866

The same item occurs in one column of dtype object with different types. This can cause problems in later operations that rely on the same item having the same type for all its occurrences, for example, groupby.

In the example below, nothing in the file data.csv is quoted. I have been trying to reproduce this with a simulated dataset but have not been able so far to do so. However, this problem does not exist if I only read a subset of the 3.5 million rows of data.csv.

The problem also doesn't exist when the 'python' engine is used instead of the 'c' engine, see below.

In [204]: df = pd.read_csv('data.csv', index_col=0)

In [205]: df
Out[205]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774283 entries, 0 to 3774282
Columns: 4 entries, sym to name
dtypes: int64(1), object(3)

In [206]: df.dtypes
Out[206]:
sym     object
id       int64
pid     object
name    object
dtype: object

In [207]: df[df.pid == 135].shape # <-- number 135
Out[207]: (311, 4)

In [208]: df[df.pid == '135'].shape # <-- string '135'
Out[208]: (74, 4)

In [209]: for n, g in grouped:
    if n in ('135', 135):
        print(n, g.shape)
   .....:
(135, (311, 4))
('135', (74, 4))

In [210]: min(df[df.pid.isin([135, '135'])].index)
Out[210]: 1966006

# get 4000 lines around the area of the 385 entries with pid=135
In [211]: !head -n 1968006 data.csv | tail -n4000 > small.csv

In [212]: df = pd.read_csv('small.csv', index_col=0, names=['sym', 'id', 'pid', 'name'])

In [213]: df[df.pid == 135].shape
Out[213]: (385, 4)

In [214]: df[df.pid == '135'].shape
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-214-070d5748621c> in <module>()
----> 1 df[df.pid == '135'].shape

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self, other)
    238             if np.isscalar(res):
    239                 raise TypeError('Could not compare %s type with Series'
--> 240                                 % type(other))
    241             return Series(na_op(values, other),
    242                           index=self.index, name=self.name)

TypeError: Could not compare <type 'str'> type with Series

In [215]: df.dtypes
Out[215]:
sym     object
id       int64
pid      int64
name    object
dtype: object

In [216]: df = pd.read_csv('data.csv', index_col=0, engine='python')

In [217]: df[df.pid == 135].shape
Out[217]: (0, 4)

In [218]: df[df.pid == '135'].shape
Out[218]: (385, 4)

The text was updated successfully, but these errors were encountered:

jreback · 2013-08-26T22:42:08Z

@floux see #3866; this is a known issue (but just hasn't been fixed); there is a way to repro on that issue

thanks for the report

its not that hard to fix, just needs some TLC

jreback · 2013-09-28T19:25:53Z

closing as a dupe of #3866

jreback mentioned this issue Aug 27, 2013

BUG: read_csv dtype inferrence is inconsistent for string columns with some integers #4691

Closed

jreback closed this as completed Sep 28, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv() result column holds same item with different types from 'c' engine only #4681

read_csv() result column holds same item with different types from 'c' engine only #4681

floux commented Aug 26, 2013

jreback commented Aug 26, 2013

jreback commented Sep 28, 2013

read_csv() result column holds same item with different types from 'c' engine only #4681

read_csv() result column holds same item with different types from 'c' engine only #4681

Comments

floux commented Aug 26, 2013

jreback commented Aug 26, 2013

jreback commented Sep 28, 2013