You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The same item occurs in one column of dtype object with different types. This can cause problems in later operations that rely on the same item having the same type for all its occurrences, for example, groupby.
In the example below, nothing in the file data.csv is quoted. I have been trying to reproduce this with a simulated dataset but have not been able so far to do so. However, this problem does not exist if I only read a subset of the 3.5 million rows of data.csv.
The problem also doesn't exist when the 'python' engine is used instead of the 'c' engine, see below.
In [204]: df = pd.read_csv('data.csv', index_col=0)
In [205]: df
Out[205]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774283 entries, 0 to 3774282
Columns: 4 entries, sym to name
dtypes: int64(1), object(3)
In [206]: df.dtypes
Out[206]:
sym object
id int64
pid object
name object
dtype: object
In [207]: df[df.pid == 135].shape # <-- number 135
Out[207]: (311, 4)
In [208]: df[df.pid == '135'].shape # <-- string '135'
Out[208]: (74, 4)
In [209]: for n, g in grouped:
if n in ('135', 135):
print(n, g.shape)
.....:
(135, (311, 4))
('135', (74, 4))
In [210]: min(df[df.pid.isin([135, '135'])].index)
Out[210]: 1966006
# get 4000 lines around the area of the 385 entries with pid=135
In [211]: !head -n 1968006 data.csv | tail -n4000 > small.csv
In [212]: df = pd.read_csv('small.csv', index_col=0, names=['sym', 'id', 'pid', 'name'])
In [213]: df[df.pid == 135].shape
Out[213]: (385, 4)
In [214]: df[df.pid == '135'].shape
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-214-070d5748621c> in <module>()
----> 1 df[df.pid == '135'].shape
/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self, other)
238 if np.isscalar(res):
239 raise TypeError('Could not compare %s type with Series'
--> 240 % type(other))
241 return Series(na_op(values, other),
242 index=self.index, name=self.name)
TypeError: Could not compare <type 'str'> type with Series
In [215]: df.dtypes
Out[215]:
sym object
id int64
pid int64
name object
dtype: object
In [216]: df = pd.read_csv('data.csv', index_col=0, engine='python')
In [217]: df[df.pid == 135].shape
Out[217]: (0, 4)
In [218]: df[df.pid == '135'].shape
Out[218]: (385, 4)
The text was updated successfully, but these errors were encountered:
dupe of #3866
The same item occurs in one column of dtype object with different types. This can cause problems in later operations that rely on the same item having the same type for all its occurrences, for example, groupby.
In the example below, nothing in the file data.csv is quoted. I have been trying to reproduce this with a simulated dataset but have not been able so far to do so. However, this problem does not exist if I only read a subset of the 3.5 million rows of data.csv.
The problem also doesn't exist when the 'python' engine is used instead of the 'c' engine, see below.
The text was updated successfully, but these errors were encountered: