Skip to content

Commit 37ccc41

Browse files
topper-123Terji PetersenTerji Petersen
authored andcommitted
API: read_stata with index_col=None should return RangeIndex (pandas-dev#49745)
* API: read_stata with index_col=None return RangeIndex * fix comments * fix comments II Co-authored-by: Terji Petersen <[email protected]> Co-authored-by: Terji Petersen <[email protected]>
1 parent 6ac011b commit 37ccc41

File tree

3 files changed

+17
-2
lines changed

3 files changed

+17
-2
lines changed

doc/source/whatsnew/v2.0.0.rst

+2
Original file line numberDiff line numberDiff line change
@@ -342,6 +342,7 @@ Other API changes
342342
- Passing strings that cannot be parsed as datetimes to :class:`Series` or :class:`DataFrame` with ``dtype="datetime64[ns]"`` will raise instead of silently ignoring the keyword and returning ``object`` dtype (:issue:`24435`)
343343
- Passing a sequence containing a type that cannot be converted to :class:`Timedelta` to :func:`to_timedelta` or to the :class:`Series` or :class:`DataFrame` constructor with ``dtype="timedelta64[ns]"`` or to :class:`TimedeltaIndex` now raises ``TypeError`` instead of ``ValueError`` (:issue:`49525`)
344344
- Changed behavior of :class:`Index` constructor with sequence containing at least one ``NaT`` and everything else either ``None`` or ``NaN`` to infer ``datetime64[ns]`` dtype instead of ``object``, matching :class:`Series` behavior (:issue:`49340`)
345+
- :func:`read_stata` with parameter ``index_col`` set to ``None`` (the default) will now set the index on the returned :class:`DataFrame` to a :class:`RangeIndex` instead of a :class:`Int64Index` (:issue:`49745`)
345346
- Changed behavior of :class:`Index` constructor with an object-dtype ``numpy.ndarray`` containing all-``bool`` values or all-complex values, this will now retain object dtype, consistent with the :class:`Series` behavior (:issue:`49594`)
346347
-
347348

@@ -596,6 +597,7 @@ Performance improvements
596597
- Memory improvement in :meth:`RangeIndex.sort_values` (:issue:`48801`)
597598
- Performance improvement in :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` when ``by`` is a categorical type and ``sort=False`` (:issue:`48976`)
598599
- Performance improvement in :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` when ``by`` is a categorical type and ``observed=False`` (:issue:`49596`)
600+
- Performance improvement in :func:`read_stata` with parameter ``index_col`` set to ``None`` (the default). Now the index will be a :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`49745`)
599601
- Performance improvement in :func:`merge` when not merging on the index - the new index will now be :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`49478`)
600602

601603
.. ---------------------------------------------------------------------------

pandas/io/stata.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1725,7 +1725,7 @@ def read(
17251725
# If index is not specified, use actual row number rather than
17261726
# restarting at 0 for each chunk.
17271727
if index_col is None:
1728-
rng = np.arange(self._lines_read - read_lines, self._lines_read)
1728+
rng = range(self._lines_read - read_lines, self._lines_read)
17291729
data.index = Index(rng) # set attr instead of set_index to avoid copy
17301730

17311731
if columns is not None:

pandas/tests/io/test_stata.py

+14-1
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,19 @@ def test_read_empty_dta(self, version):
7373
empty_ds2 = read_stata(path)
7474
tm.assert_frame_equal(empty_ds, empty_ds2)
7575

76+
@pytest.mark.parametrize("version", [114, 117, 118, 119, None])
77+
def test_read_index_col_none(self, version):
78+
df = DataFrame({"a": range(5), "b": ["b1", "b2", "b3", "b4", "b5"]})
79+
# GH 7369, make sure can read a 0-obs dta file
80+
with tm.ensure_clean() as path:
81+
df.to_stata(path, write_index=False, version=version)
82+
read_df = read_stata(path)
83+
84+
assert isinstance(read_df.index, pd.RangeIndex)
85+
expected = df.copy()
86+
expected["a"] = expected["a"].astype(np.int32)
87+
tm.assert_frame_equal(read_df, expected, check_index_type=True)
88+
7689
@pytest.mark.parametrize("file", ["stata1_114", "stata1_117"])
7790
def test_read_dta1(self, file, datapath):
7891

@@ -1054,7 +1067,7 @@ def test_categorical_sorting(self, file, datapath):
10541067
parsed = parsed.sort_values("srh", na_position="first")
10551068

10561069
# Don't sort index
1057-
parsed.index = np.arange(parsed.shape[0])
1070+
parsed.index = pd.RangeIndex(len(parsed))
10581071
codes = [-1, -1, 0, 1, 1, 1, 2, 2, 3, 4]
10591072
categories = ["Poor", "Fair", "Good", "Very good", "Excellent"]
10601073
cat = pd.Categorical.from_codes(

0 commit comments

Comments
 (0)