Skip to content

Commit 21fa21e

Browse files
committed
Merge remote-tracking branch 'upstream/master' into to_html-to_string
* upstream/master: BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527) DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635) More helpful Stata string length error. (pandas-dev#23629) BUG: astype fill_value for SparseArray.astype (pandas-dev#23547) CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587) CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627) CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249) DOC: Enhancing pivot / reshape docs (pandas-dev#21038) TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
2 parents 7186aaf + 011b79f commit 21fa21e

34 files changed

+711
-410
lines changed

ci/code_checks.sh

+4
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,10 @@ if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then
122122
! grep -R --include="*.py" --include="*.pyx" --include="*.rst" -E "\.\. (autosummary|contents|currentmodule|deprecated|function|image|important|include|ipython|literalinclude|math|module|note|raw|seealso|toctree|versionadded|versionchanged|warning):[^:]" ./pandas ./doc/source
123123
RET=$(($RET + $?)) ; echo $MSG "DONE"
124124

125+
MSG='Check that the deprecated `assert_raises_regex` is not used (`pytest.raises(match=pattern)` should be used instead)' ; echo $MSG
126+
! grep -R --exclude=*.pyc --exclude=testing.py --exclude=test_testing.py assert_raises_regex pandas
127+
RET=$(($RET + $?)) ; echo $MSG "DONE"
128+
125129
MSG='Check for modules that pandas should not import' ; echo $MSG
126130
python -c "
127131
import sys

doc/source/io.rst

+5
Original file line numberDiff line numberDiff line change
@@ -2854,6 +2854,11 @@ It is often the case that users will insert columns to do temporary computations
28542854
in Excel and you may not want to read in those columns. ``read_excel`` takes
28552855
a ``usecols`` keyword to allow you to specify a subset of columns to parse.
28562856

2857+
.. deprecated:: 0.24.0
2858+
2859+
Passing in an integer for ``usecols`` has been deprecated. Please pass in a list
2860+
of ints from 0 to ``usecols`` inclusive instead.
2861+
28572862
If ``usecols`` is an integer, then it is assumed to indicate the last column
28582863
to be parsed.
28592864

doc/source/reshaping.rst

+104-6
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ Reshaping and Pivot Tables
1717
Reshaping by pivoting DataFrame objects
1818
---------------------------------------
1919

20+
.. image:: _static/reshaping_pivot.png
21+
2022
.. ipython::
2123
:suppress:
2224

@@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects
3335

3436
In [3]: df = unpivot(tm.makeTimeDataFrame())
3537

36-
Data is often stored in CSV files or databases in so-called "stacked" or
37-
"record" format:
38+
Data is often stored in so-called "stacked" or "record" format:
3839

3940
.. ipython:: python
4041
@@ -66,8 +67,6 @@ To select out everything for variable ``A`` we could do:
6667
6768
df[df['variable'] == 'A']
6869
69-
.. image:: _static/reshaping_pivot.png
70-
7170
But suppose we wish to do time series operations with the variables. A better
7271
representation would be where the ``columns`` are the unique variables and an
7372
``index`` of dates identifies individual observations. To reshape the data into
@@ -87,7 +86,7 @@ column:
8786
.. ipython:: python
8887
8988
df['value2'] = df['value'] * 2
90-
pivoted = df.pivot('date', 'variable')
89+
pivoted = df.pivot(index='date', columns='variable')
9190
pivoted
9291
9392
You can then select subsets from the pivoted ``DataFrame``:
@@ -99,6 +98,12 @@ You can then select subsets from the pivoted ``DataFrame``:
9998
Note that this returns a view on the underlying data in the case where the data
10099
are homogeneously-typed.
101100

101+
.. note::
102+
:func:`~pandas.pivot` will error with a ``ValueError: Index contains duplicate
103+
entries, cannot reshape`` if the index/column pair is not unique. In this
104+
case, consider using :func:`~pandas.pivot_table` which is a generalization
105+
of pivot that can handle duplicate values for one index/column pair.
106+
102107
.. _reshaping.stacking:
103108

104109
Reshaping by stacking and unstacking
@@ -704,10 +709,103 @@ handling of NaN:
704709
In [3]: np.unique(x, return_inverse=True)[::-1]
705710
Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
706711
707-
708712
.. note::
709713
If you just want to handle one column as a categorical variable (like R's factor),
710714
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
711715
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
712716
see the :ref:`Categorical introduction <categorical>` and the
713717
:ref:`API documentation <api.categorical>`.
718+
719+
Examples
720+
--------
721+
722+
In this section, we will review frequently asked questions and examples. The
723+
column names and relevant column values are named to correspond with how this
724+
DataFrame will be pivoted in the answers below.
725+
726+
.. ipython:: python
727+
728+
np.random.seed([3, 1415])
729+
n = 20
730+
731+
cols = np.array(['key', 'row', 'item', 'col'])
732+
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
733+
df.columns = cols
734+
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
735+
736+
df
737+
738+
Pivoting with Single Aggregations
739+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
740+
741+
Suppose we wanted to pivot ``df`` such that the ``col`` values are columns,
742+
``row`` values are the index, and the mean of ``val0`` are the values? In
743+
particular, the resulting DataFrame should look like:
744+
745+
.. code-block:: ipython
746+
747+
col col0 col1 col2 col3 col4
748+
row
749+
row0 0.77 0.605 NaN 0.860 0.65
750+
row2 0.13 NaN 0.395 0.500 0.25
751+
row3 NaN 0.310 NaN 0.545 NaN
752+
row4 NaN 0.100 0.395 0.760 0.24
753+
754+
This solution uses :func:`~pandas.pivot_table`. Also note that
755+
``aggfunc='mean'`` is the default. It is included here to be explicit.
756+
757+
.. ipython:: python
758+
759+
df.pivot_table(
760+
values='val0', index='row', columns='col', aggfunc='mean')
761+
762+
Note that we can also replace the missing values by using the ``fill_value``
763+
parameter.
764+
765+
.. ipython:: python
766+
767+
df.pivot_table(
768+
values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)
769+
770+
Also note that we can pass in other aggregation functions as well. For example,
771+
we can also pass in ``sum``.
772+
773+
.. ipython:: python
774+
775+
df.pivot_table(
776+
values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)
777+
778+
Another aggregation we can do is calculate the frequency in which the columns
779+
and rows occur together a.k.a. "cross tabulation". To do this, we can pass
780+
``size`` to the ``aggfunc`` parameter.
781+
782+
.. ipython:: python
783+
784+
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
785+
786+
Pivoting with Multiple Aggregations
787+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
788+
789+
We can also perform multiple aggregations. For example, to perform both a
790+
``sum`` and ``mean``, we can pass in a list to the ``aggfunc`` argument.
791+
792+
.. ipython:: python
793+
794+
df.pivot_table(
795+
values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])
796+
797+
Note to aggregate over multiple value columns, we can pass in a list to the
798+
``values`` parameter.
799+
800+
.. ipython:: python
801+
802+
df.pivot_table(
803+
values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])
804+
805+
Note to subdivide over multiple columns we can pass in a list to the
806+
``columns`` parameter.
807+
808+
.. ipython:: python
809+
810+
df.pivot_table(
811+
values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])

doc/source/whatsnew/v0.24.0.txt

+2
Original file line numberDiff line numberDiff line change
@@ -972,6 +972,7 @@ Deprecations
972972
- The class ``FrozenNDArray`` has been deprecated. When unpickling, ``FrozenNDArray`` will be unpickled to ``np.ndarray`` once this class is removed (:issue:`9031`)
973973
- Deprecated the `nthreads` keyword of :func:`pandas.read_feather` in favor of
974974
`use_threads` to reflect the changes in pyarrow 0.11.0. (:issue:`23053`)
975+
- :func:`pandas.read_excel` has deprecated accepting ``usecols`` as an integer. Please pass in a list of ints from 0 to ``usecols`` inclusive instead (:issue:`23527`)
975976
- Constructing a :class:`TimedeltaIndex` from data with ``datetime64``-dtyped data is deprecated, will raise ``TypeError`` in a future version (:issue:`23539`)
976977

977978
.. _whatsnew_0240.deprecations.datetimelike_int_ops:
@@ -1300,6 +1301,7 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
13001301
- :func:`read_excel()` will correctly show the deprecation warning for previously deprecated ``sheetname`` (:issue:`17994`)
13011302
- :func:`read_csv()` and func:`read_table()` will throw ``UnicodeError`` and not coredump on badly encoded strings (:issue:`22748`)
13021303
- :func:`read_csv()` will correctly parse timezone-aware datetimes (:issue:`22256`)
1304+
- Bug in :func:`read_csv()` in which memory management was prematurely optimized for the C engine when the data was being read in chunks (:issue:`23509`)
13031305
- :func:`read_sas()` will parse numbers in sas7bdat-files that have width less than 8 bytes correctly. (:issue:`21616`)
13041306
- :func:`read_sas()` will correctly parse sas7bdat files with many columns (:issue:`22628`)
13051307
- :func:`read_sas()` will correctly parse sas7bdat files with data page types having also bit 7 set (so page type is 128 + 256 = 384) (:issue:`16615`)

pandas/_libs/parsers.pyx

+1
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@ cdef extern from "parser/tokenizer.h":
132132
int64_t *word_starts # where we are in the stream
133133
int64_t words_len
134134
int64_t words_cap
135+
int64_t max_words_cap # maximum word cap encountered
135136

136137
char *pword_start # pointer to stream start of current field
137138
int64_t word_start # position start of current field

pandas/_libs/src/parser/tokenizer.c

+31-2
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,7 @@ int parser_init(parser_t *self) {
197197
sz = sz ? sz : 1;
198198
self->words = (char **)malloc(sz * sizeof(char *));
199199
self->word_starts = (int64_t *)malloc(sz * sizeof(int64_t));
200+
self->max_words_cap = sz;
200201
self->words_cap = sz;
201202
self->words_len = 0;
202203

@@ -247,7 +248,7 @@ void parser_del(parser_t *self) {
247248
}
248249

249250
static int make_stream_space(parser_t *self, size_t nbytes) {
250-
int64_t i, cap;
251+
int64_t i, cap, length;
251252
int status;
252253
void *orig_ptr, *newptr;
253254

@@ -287,8 +288,23 @@ static int make_stream_space(parser_t *self, size_t nbytes) {
287288
*/
288289

289290
cap = self->words_cap;
291+
292+
/**
293+
* If we are reading in chunks, we need to be aware of the maximum number
294+
* of words we have seen in previous chunks (self->max_words_cap), so
295+
* that way, we can properly allocate when reading subsequent ones.
296+
*
297+
* Otherwise, we risk a buffer overflow if we mistakenly under-allocate
298+
* just because a recent chunk did not have as many words.
299+
*/
300+
if (self->words_len + nbytes < self->max_words_cap) {
301+
length = self->max_words_cap - nbytes;
302+
} else {
303+
length = self->words_len;
304+
}
305+
290306
self->words =
291-
(char **)grow_buffer((void *)self->words, self->words_len,
307+
(char **)grow_buffer((void *)self->words, length,
292308
(int64_t*)&self->words_cap, nbytes,
293309
sizeof(char *), &status);
294310
TRACE(
@@ -1241,6 +1257,19 @@ int parser_trim_buffers(parser_t *self) {
12411257

12421258
int64_t i;
12431259

1260+
/**
1261+
* Before we free up space and trim, we should
1262+
* save how many words we saw when parsing, if
1263+
* it exceeds the maximum number we saw before.
1264+
*
1265+
* This is important for when we read in chunks,
1266+
* so that we can inform subsequent chunk parsing
1267+
* as to how many words we could possibly see.
1268+
*/
1269+
if (self->words_cap > self->max_words_cap) {
1270+
self->max_words_cap = self->words_cap;
1271+
}
1272+
12441273
/* trim words, word_starts */
12451274
new_cap = _next_pow2(self->words_len) + 1;
12461275
if (new_cap < self->words_cap) {

pandas/_libs/src/parser/tokenizer.h

+1
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,7 @@ typedef struct parser_t {
142142
int64_t *word_starts; // where we are in the stream
143143
int64_t words_len;
144144
int64_t words_cap;
145+
int64_t max_words_cap; // maximum word cap encountered
145146

146147
char *pword_start; // pointer to stream start of current field
147148
int64_t word_start; // position start of current field

pandas/core/arrays/datetimelike.py

+6-2
Original file line numberDiff line numberDiff line change
@@ -124,8 +124,12 @@ def asi8(self):
124124
# do not cache or you'll create a memory leak
125125
return self._data.view('i8')
126126

127-
# ------------------------------------------------------------------
128-
# Array-like Methods
127+
# ----------------------------------------------------------------
128+
# Array-Like / EA-Interface Methods
129+
130+
@property
131+
def nbytes(self):
132+
return self._data.nbytes
129133

130134
@property
131135
def shape(self):

pandas/core/arrays/datetimes.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -385,7 +385,7 @@ def _resolution(self):
385385
return libresolution.resolution(self.asi8, self.tz)
386386

387387
# ----------------------------------------------------------------
388-
# Array-like Methods
388+
# Array-Like / EA-Interface Methods
389389

390390
def __array__(self, dtype=None):
391391
if is_object_dtype(dtype):

0 commit comments

Comments
 (0)