When parsing a CSV file without an index, if the list with columns names is too short or too long, one gets a "Index contains duplicate entries" exception. #835

mcobzarenco · 2012-02-27T13:20:57Z

Hi Wes,

Just a minor bug submisson:
When parsing a CSV file without an index, if the list with columns names is too short or too long, one gets a "Index contains duplicate entries" exception.

E.g.:

The file test.csv contains:
abc,1,a
abc,2,b
def,3,c

Running:
table = pandas.read_csv('reports/9/test.csv', header=None, names=['a', 'b'],index_col=None)

Results in:

Exception                                 Traceback (most recent call last)
/home/marius/Code/otsquant/analysis/ccapital/<ipython-input-70-7f388b2776b2> in <module>()
----> 1 table = pandas.read_csv('reports/9/test.csv', header=None, names=['a', 'b'],index_col=None)

/usr/lib64/python2.7/site-packages/pandas/io/parsers.pyc in read_csv(filepath_or_buffer, sep, header, index_col, names, skiprows, na_values, parse_dates, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding)
    125         return parser
    126 
--> 127     return parser.get_chunk()
    128 
    129 @Appender(_read_table_doc)

/usr/lib64/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
    467         if not index._verify_integrity():
    468             dups = index.get_duplicates()
--> 469             raise Exception('Index has duplicates: %s' % str(dups))
    470 
    471         if len(self.columns) != len(zipped_content):

Exception: Index has duplicates: ['abc']

Many thanks,
Marius

The text was updated successfully, but these errors were encountered:

adamklein · 2012-02-28T18:52:20Z

Under the maxim that explicit is better than implicit, to suppress index inference I added the default "infer_index=True"; by setting to false, the right error will occur.

wesm · 2012-02-28T19:09:56Z

Hmm, maybe just should raise a better error like "Tried to use columns 1-X as index but had duplicates"

mcobzarenco · 2012-02-28T19:14:43Z

I'm not sure, I'm thinking that if you set index_col=None, you definitely
don't want an index from the file. Maybe just raise an exception "not
enough column labels"?
On Feb 28, 2012 7:09 PM, "Wes McKinney" <
[email protected]>
wrote:

Hmm, maybe just should raise a better error like "Tried to use columns 1-X
as index but had duplicates"

Reply to this email directly or view it on GitHub:
#835 (comment)

adamklein · 2012-02-28T19:15:00Z

@wesm, fine by me, easy change.

wesm · 2012-02-28T19:17:50Z

@aristotle137 the trouble is that there are cases where you have N - 1 values in the first row and N values in the subsequent rows (with indices) and you want it to "just work". I would rather support this fairly common use case rather than make those people's lives difficult in order to raise a better error message.

* commit 'v0.7.1-1-ga2e86c2': (90 commits) BUG: Fix Series, DataFrame plot() for non numerical/datetime (Multi)Index (closes pandas-dev#741). RLS: Version 0.7.1 DOC: release notes, what's new, change dev version to 0.7.1 BUG: close pandas-dev#839, another case where nan may be assigned to int series ENH: raise NotImplementedError if user tries to iterate over .ix, GH pandas-dev#840 BUG: fixed null-check per pandas-dev#839 BUG: close pandas-dev#839, exception on assigning NA to bool or int64 series TST: more test coverage for release target TST: added core coverage TST: fix lingering line of code from pandas-dev#838 DOC: added yet a bit more to release notes TST: unit test for pandas-dev#838 DOC: added more release notes BUG: raise more helpful error msg for pandas-dev#835 TST: added skip excel test for no xlrd installed BUG: close pandas-dev#835, add option to suppress index inference BUG: close pandas-dev#837, excelfile throws an exception for two-line file ENH: fill_value arg in DataFrame.reindex/reindex_axis, add fillna to sparse objects, GH pandas-dev#784 ENH: add fill_value argument to Series.reindex, DataFrame next, pandas-dev#784 ENH: concat Series with axis=1 for completeness, GH pandas-dev#787 ...

anirbanspace · 2015-08-10T09:01:35Z

My code worked when i used 'sep' parameter. As following:

df = pd.read_csv(loc_train_labels,header=None,sep=";",names=['Class'])

ghost assigned adamklein Feb 28, 2012

adamklein closed this as completed in c2da129 Feb 28, 2012

adamklein added a commit that referenced this issue Feb 28, 2012

BUG: raise more helpful error msg for #835

ae5db23

dengemann mentioned this issue Sep 29, 2013

ENH/DOC: stability guide #5027

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When parsing a CSV file without an index, if the list with columns names is too short or too long, one gets a "Index contains duplicate entries" exception. #835

When parsing a CSV file without an index, if the list with columns names is too short or too long, one gets a "Index contains duplicate entries" exception. #835

mcobzarenco commented Feb 27, 2012

adamklein commented Feb 28, 2012

wesm commented Feb 28, 2012

mcobzarenco commented Feb 28, 2012

adamklein commented Feb 28, 2012

wesm commented Feb 28, 2012

anirbanspace commented Aug 10, 2015

When parsing a CSV file without an index, if the list with columns names is too short or too long, one gets a "Index contains duplicate entries" exception. #835

When parsing a CSV file without an index, if the list with columns names is too short or too long, one gets a "Index contains duplicate entries" exception. #835

Comments

mcobzarenco commented Feb 27, 2012

Results in:

adamklein commented Feb 28, 2012

wesm commented Feb 28, 2012

mcobzarenco commented Feb 28, 2012

adamklein commented Feb 28, 2012

wesm commented Feb 28, 2012

anirbanspace commented Aug 10, 2015