BUG: read_csv() raises IndexError when dict supplied as dtype uses integer keys of same sign #57944

EsmeMaxwell · 2024-03-21T01:21:12Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
from pathlib import Path

# Create a simple df with a RangeIndex for the columns. N.B. The RangeIndex
# start and stop have the same sign.
range_idx = pd.RangeIndex(start=-10, stop=-5, step=1)
df = pd.DataFrame(np.random.randn(3, len(range_idx)), columns=range_idx)

csv_file = Path("./data/test_range_index.csv")
csv_file.parent.mkdir(exist_ok=True)

# Write df to a CSV along with row+col indices
df.to_csv(
    csv_file,
    header=True,
    index=True,
)

# Create a dictionary to specify the dtypes
dtype_spec  = {}
for v in range_idx:
    dtype_spec[v] = "float64"
    # dtype_spec[f"{v}"] = "float64"   # * Alternatively, convert to str, which works

# Reload the CSV
df_reload = pd.read_csv(
    csv_file,
    header=0,
    index_col=0,
    float_precision='round_trip',
    dtype=dtype_spec,
)

Issue Description

This may in part be my misunderstanding of how dtypes should be specified and the implicit conversion of the column index to str.

Notwithstanding that, the problem is when reading a CSV using read_csv() with a header row that consists of integers of the same sign and specifying the dtype with a dictionary that has integer keys: it fails with an IndexError: list index out of range.

Traceback

IndexError                                Traceback (most recent call last)
Cell In[106], [line 1](vscode-notebook-cell:?execution_count=106&line=1)
----> [1](vscode-notebook-cell:?execution_count=106&line=1) df_reload = pd.read_csv(
      [2](vscode-notebook-cell:?execution_count=106&line=2)     csv_file,
      [3](vscode-notebook-cell:?execution_count=106&line=3)     header=0,
      [4](vscode-notebook-cell:?execution_count=106&line=4)     index_col=0,
      [5](vscode-notebook-cell:?execution_count=106&line=5)     float_precision='round_trip',
      [6](vscode-notebook-cell:?execution_count=106&line=6)     dtype=dtype_spec,
      [7](vscode-notebook-cell:?execution_count=106&line=7) )

File [~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026), in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   [1013](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1013) kwds_defaults = _refine_defaults_read(
   [1014](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1014)     dialect,
   [1015](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1015)     delimiter,
   (...)
   [1022](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1022)     dtype_backend=dtype_backend,
   [1023](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1023) )
   [1024](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1024) kwds.update(kwds_defaults)
-> [1026](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026) return _read(filepath_or_buffer, kwds)

File [~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:626](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:626), in _read(filepath_or_buffer, kwds)
    [623](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:623)     return parser
    [625](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:625) with parser:
--> [626](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:626)     return parser.read(nrows)

File [~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1923](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1923), in TextFileReader.read(self, nrows)
   [1916](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1916) nrows = validate_integer("nrows", nrows)
   [1917](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1917) try:
   [1918](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1918)     # error: "ParserBase" has no attribute "read"
   [1919](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1919)     (
   [1920](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1920)         index,
   [1921](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1921)         columns,
   [1922](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1922)         col_dict,
-> [1923](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1923)     ) = self._engine.read(  # type: ignore[attr-defined]
   [1924](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1924)         nrows
   [1925](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1925)     )
   [1926](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1926) except Exception:
   [1927](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1927)     self.close()

File [~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:333](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:333), in CParserWrapper.read(self, nrows)
    [330](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:330)     data = {k: v for k, (i, v) in zip(names, data_tups)}
    [332](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:332)     names, date_data = self._do_date_conversions(names, data)
--> [333](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:333)     index, column_names = self._make_index(date_data, alldata, names)
    [335](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:335) return index, column_names, date_data

File [~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:372](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:372), in ParserBase._make_index(self, data, alldata, columns, indexnamerow)
    [370](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:370) elif not self._has_complex_date_col:
    [371](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:371)     simple_index = self._get_simple_index(alldata, columns)
--> [372](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:372)     index = self._agg_index(simple_index)
    [373](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:373) elif self._has_complex_date_col:
    [374](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:374)     if not self._name_processed:

File [~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:489](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:489), in ParserBase._agg_index(self, index, try_parse_dates)
    [484](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:484)     if col_name is not None:
    [485](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:485)         col_na_values, col_na_fvalues = _get_na_values(
    [486](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:486)             col_name, self.na_values, self.na_fvalues, self.keep_default_na
    [487](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:487)         )
--> [489](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:489) clean_dtypes = self._clean_mapping(self.dtype)
    [491](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:491) cast_type = None
    [492](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:492) index_converter = False

File [~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:455](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:455), in ParserBase._clean_mapping(self, mapping)
    [453](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:453) for col, v in mapping.items():
    [454](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:454)     if isinstance(col, int) and col not in self.orig_names:
--> [455](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:455)         col = self.orig_names[col]
    [456](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:456)     clean[col] = v
    [457](https://file+.vscode-resource.vscode-cdn.net/home/peterma/exio_dev/repo/df_file_interchange/pandas_github_issues/~/miniconda3/envs/exio-datamanager/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py:457) if isinstance(mapping, defaultdict):

IndexError: list index out of range

Further discussion

The IndexError does not get raised if the original RangeIndex has a start and stop of different signs.

Neither does it occur if the dtype= is not specified in read_csv()'s arguments or if the keys in the dictionary of the dtypes are converted to str.

In all successful cases, the columns are an Index that contain the str representation of the integers. Does this mean they should always be integers for the purposes of read_csv() and that any conversion back to integer has to happen later? If so, the docs might benefit from being a little bit clearer.

Expected Behavior

That read_csv() would succeed when specifying the dtypes with a dict with integer keys (of the same sign).

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.11.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-101-generic
Version : #111~20.04.1-Ubuntu SMP Mon Mar 11 15:44:43 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.2.1
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.2.0
pip : 24.0
Cython : None
pytest : 8.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 15.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

gb0808 · 2024-04-06T19:46:01Z

take

EsmeMaxwell added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 21, 2024

github-actions bot assigned gb0808 Apr 6, 2024

gb0808 pushed a commit to gb0808/pandas that referenced this issue Apr 21, 2024

fixed issue pandas-dev#57944

1d04d6f

gb0808 added a commit to gb0808/pandas that referenced this issue Apr 21, 2024

fixed issue pandas-dev#57944

3157458

gb0808 mentioned this issue Apr 21, 2024

Csv issue #58348

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv() raises IndexError when dict supplied as dtype uses integer keys of same sign #57944

BUG: read_csv() raises IndexError when dict supplied as dtype uses integer keys of same sign #57944

EsmeMaxwell commented Mar 21, 2024

INSTALLED VERSIONS

gb0808 commented Apr 6, 2024

BUG: read_csv() raises IndexError when dict supplied as dtype uses integer keys of same sign #57944

BUG: read_csv() raises IndexError when dict supplied as dtype uses integer keys of same sign #57944

Comments

EsmeMaxwell commented Mar 21, 2024

Pandas version checks

Reproducible Example

Issue Description

Traceback

Further discussion

Expected Behavior

Installed Versions

INSTALLED VERSIONS

gb0808 commented Apr 6, 2024