Skip to content

read_csv() result column holds same item with different types from 'c' engine only #4681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
floux opened this issue Aug 26, 2013 · 2 comments
Labels
Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Milestone

Comments

@floux
Copy link

floux commented Aug 26, 2013

dupe of #3866

The same item occurs in one column of dtype object with different types. This can cause problems in later operations that rely on the same item having the same type for all its occurrences, for example, groupby.

In the example below, nothing in the file data.csv is quoted. I have been trying to reproduce this with a simulated dataset but have not been able so far to do so. However, this problem does not exist if I only read a subset of the 3.5 million rows of data.csv.

The problem also doesn't exist when the 'python' engine is used instead of the 'c' engine, see below.

In [204]: df = pd.read_csv('data.csv', index_col=0)

In [205]: df
Out[205]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774283 entries, 0 to 3774282
Columns: 4 entries, sym to name
dtypes: int64(1), object(3)

In [206]: df.dtypes
Out[206]:
sym     object
id       int64
pid     object
name    object
dtype: object

In [207]: df[df.pid == 135].shape # <-- number 135
Out[207]: (311, 4)

In [208]: df[df.pid == '135'].shape # <-- string '135'
Out[208]: (74, 4)

In [209]: for n, g in grouped:
    if n in ('135', 135):
        print(n, g.shape)
   .....:
(135, (311, 4))
('135', (74, 4))

In [210]: min(df[df.pid.isin([135, '135'])].index)
Out[210]: 1966006

# get 4000 lines around the area of the 385 entries with pid=135
In [211]: !head -n 1968006 data.csv | tail -n4000 > small.csv

In [212]: df = pd.read_csv('small.csv', index_col=0, names=['sym', 'id', 'pid', 'name'])

In [213]: df[df.pid == 135].shape
Out[213]: (385, 4)

In [214]: df[df.pid == '135'].shape
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-214-070d5748621c> in <module>()
----> 1 df[df.pid == '135'].shape

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self, other)
    238             if np.isscalar(res):
    239                 raise TypeError('Could not compare %s type with Series'
--> 240                                 % type(other))
    241             return Series(na_op(values, other),
    242                           index=self.index, name=self.name)

TypeError: Could not compare <type 'str'> type with Series

In [215]: df.dtypes
Out[215]:
sym     object
id       int64
pid      int64
name    object
dtype: object

In [216]: df = pd.read_csv('data.csv', index_col=0, engine='python')

In [217]: df[df.pid == 135].shape
Out[217]: (0, 4)

In [218]: df[df.pid == '135'].shape
Out[218]: (385, 4)
@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@floux see #3866; this is a known issue (but just hasn't been fixed); there is a way to repro on that issue

thanks for the report

its not that hard to fix, just needs some TLC

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

closing as a dupe of #3866

@jreback jreback closed this as completed Sep 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

2 participants