Skip to content

When parsing a CSV file without an index, if the list with columns names is too short or too long, one gets a "Index contains duplicate entries" exception. #835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mcobzarenco opened this issue Feb 27, 2012 · 6 comments
Milestone

Comments

@mcobzarenco
Copy link

Hi Wes,

Just a minor bug submisson:
When parsing a CSV file without an index, if the list with columns names is too short or too long, one gets a "Index contains duplicate entries" exception.

E.g.:

The file test.csv contains:
abc,1,a
abc,2,b
def,3,c

Running:
table = pandas.read_csv('reports/9/test.csv', header=None, names=['a', 'b'],index_col=None)

Results in:

Exception                                 Traceback (most recent call last)
/home/marius/Code/otsquant/analysis/ccapital/<ipython-input-70-7f388b2776b2> in <module>()
----> 1 table = pandas.read_csv('reports/9/test.csv', header=None, names=['a', 'b'],index_col=None)

/usr/lib64/python2.7/site-packages/pandas/io/parsers.pyc in read_csv(filepath_or_buffer, sep, header, index_col, names, skiprows, na_values, parse_dates, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding)
    125         return parser
    126 
--> 127     return parser.get_chunk()
    128 
    129 @Appender(_read_table_doc)

/usr/lib64/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
    467         if not index._verify_integrity():
    468             dups = index.get_duplicates()
--> 469             raise Exception('Index has duplicates: %s' % str(dups))
    470 
    471         if len(self.columns) != len(zipped_content):

Exception: Index has duplicates: ['abc']

Many thanks,
Marius

@ghost ghost assigned adamklein Feb 28, 2012
@adamklein
Copy link
Contributor

Under the maxim that explicit is better than implicit, to suppress index inference I added the default "infer_index=True"; by setting to false, the right error will occur.

@wesm
Copy link
Member

wesm commented Feb 28, 2012

Hmm, maybe just should raise a better error like "Tried to use columns 1-X as index but had duplicates"

@mcobzarenco
Copy link
Author

I'm not sure, I'm thinking that if you set index_col=None, you definitely
don't want an index from the file. Maybe just raise an exception "not
enough column labels"?
On Feb 28, 2012 7:09 PM, "Wes McKinney" <
[email protected]>
wrote:

Hmm, maybe just should raise a better error like "Tried to use columns 1-X
as index but had duplicates"


Reply to this email directly or view it on GitHub:
#835 (comment)

@adamklein
Copy link
Contributor

@wesm, fine by me, easy change.

@wesm
Copy link
Member

wesm commented Feb 28, 2012

@aristotle137 the trouble is that there are cases where you have N - 1 values in the first row and N values in the subsequent rows (with indices) and you want it to "just work". I would rather support this fairly common use case rather than make those people's lives difficult in order to raise a better error message.

yarikoptic added a commit to neurodebian/pandas that referenced this issue Mar 2, 2012
* commit 'v0.7.1-1-ga2e86c2': (90 commits)
  BUG: Fix Series, DataFrame plot() for non numerical/datetime (Multi)Index (closes pandas-dev#741).
  RLS: Version 0.7.1
  DOC: release notes, what's new, change dev version to 0.7.1
  BUG: close pandas-dev#839, another case where nan may be assigned to int series
  ENH: raise NotImplementedError if user tries to iterate over .ix, GH pandas-dev#840
  BUG: fixed null-check per pandas-dev#839
  BUG: close pandas-dev#839, exception on assigning NA to bool or int64 series
  TST: more test coverage for release target
  TST: added core coverage
  TST: fix lingering line of code from pandas-dev#838
  DOC: added yet a bit more to release notes
  TST: unit test for pandas-dev#838
  DOC: added more release notes
  BUG: raise more helpful error msg for pandas-dev#835
  TST: added skip excel test for no xlrd installed
  BUG: close pandas-dev#835, add option to suppress index inference
  BUG: close pandas-dev#837, excelfile throws an exception for two-line file
  ENH: fill_value arg in DataFrame.reindex/reindex_axis, add fillna to sparse objects, GH pandas-dev#784
  ENH: add fill_value argument to Series.reindex, DataFrame next, pandas-dev#784
  ENH: concat Series with axis=1 for completeness, GH pandas-dev#787
  ...
@anirbanspace
Copy link

My code worked when i used 'sep' parameter. As following:

df = pd.read_csv(loc_train_labels,header=None,sep=";",names=['Class'])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants