Skip to content

Commit b14746c

Browse files
committed
Refactor read_csv bug fix with PR comments (#18186)
1 parent dc21a9d commit b14746c

File tree

3 files changed

+6
-14
lines changed

3 files changed

+6
-14
lines changed

doc/source/whatsnew/v0.21.1.txt

+1-9
Original file line numberDiff line numberDiff line change
@@ -56,15 +56,6 @@ Documentation Changes
5656

5757
Bug Fixes
5858
~~~~~~~~~
59-
- Bug in ``DataFrame.resample(...).apply(...)`` when there is a callable that returns different columns (:issue:`15169`)
60-
- Bug in :class:`TimedeltaIndex` subtraction could incorrectly overflow when ``NaT`` is present (:issue:`17791`)
61-
- Bug in :class:`DatetimeIndex` subtracting datetimelike from DatetimeIndex could fail to overflow (:issue:`18020`)
62-
- Bug in ``pd.Series.rolling.skew()`` and ``rolling.kurt()`` with all equal values has floating issue (:issue:`18044`)
63-
- Bug in ``pd.DataFrameGroupBy.count()`` when counting over a datetimelike column (:issue:`13393`)
64-
- Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`)
65-
- Bug in :class:`IntervalIndex` constructor when a list of intervals is passed with non-default ``closed`` (:issue:`18334`)
66-
- Bug in :meth:`IntervalIndex.copy` when copying and ``IntervalIndex`` with non-default ``closed`` (:issue:`18339`)
67-
- Bug in ``pd.read_csv`` when reading numeric category fields with high cardinality (:issue `18186`)
6859

6960
Conversion
7061
^^^^^^^^^^
@@ -91,6 +82,7 @@ I/O
9182
- Bug in class:`~pandas.io.stata.StataReader` not converting date/time columns with display formatting addressed (:issue:`17990`). Previously columns with display formatting were normally left as ordinal numbers and not converted to datetime objects.
9283
- Bug in :func:`read_csv` when reading a compressed UTF-16 encoded file (:issue:`18071`)
9384
- Bug in :func:`read_csv` for handling null values in index columns when specifying ``na_filter=False`` (:issue:`5239`)
85+
- Bug in :func:`read_csv` when reading numeric category fields with high cardinality (:issue:`18186`)
9486
- Bug in :meth:`DataFrame.to_csv` when the table had ``MultiIndex`` columns, and a list of strings was passed in for ``header`` (:issue:`5539`)
9587
- :func:`read_parquet` now allows to specify the columns to read from a parquet file (:issue:`18154`)
9688
- :func:`read_parquet` now allows to specify kwargs which are passed to the respective engine (:issue:`18216`)

pandas/_libs/parsers.pyx

+1-1
Original file line numberDiff line numberDiff line change
@@ -2227,7 +2227,7 @@ def _concatenate_chunks(list chunks):
22272227
for name in names:
22282228
arrs = [chunk.pop(name) for chunk in chunks]
22292229
# Check each arr for consistent types.
2230-
dtypes = set([a.dtype for a in arrs])
2230+
dtypes = {a.dtype for a in arrs}
22312231
numpy_dtypes = {x for x in dtypes if not is_categorical_dtype(x)}
22322232
if len(numpy_dtypes) > 1:
22332233
common_type = np.find_common_type(numpy_dtypes, [])

pandas/tests/io/parser/dtypes.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -117,12 +117,12 @@ def test_categorical_dtype(self):
117117
@pytest.mark.slow
118118
def test_categorical_dtype_high_cardinality_numeric(self):
119119
# GH 18186
120-
data = sorted([str(i) for i in range(10**6)])
121-
expected = pd.DataFrame({'a': Categorical(data, ordered=True)})
120+
data = np.sort([str(i) for i in range(524289)])
121+
expected = DataFrame({'a': Categorical(data, ordered=True)})
122122
actual = self.read_csv(StringIO('a\n' + '\n'.join(data)),
123123
dtype='category')
124-
actual.a.cat.reorder_categories(sorted(actual.a.cat.categories),
125-
ordered=True, inplace=True)
124+
actual["a"] = actual["a"].cat.reorder_categories(
125+
np.sort(actual.a.cat.categories), ordered=True)
126126
tm.assert_frame_equal(actual, expected)
127127

128128
def test_categorical_dtype_encoding(self):

0 commit comments

Comments
 (0)