Skip to content

Commit 4e6f3a0

Browse files
committed
Merge remote-tracking branch 'upstream/master' into io_csv_docstring_fixed
* upstream/master: DOC: Enhancing pivot / reshape docs (pandas-dev#21038) TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620) BUILD: Simplifying contributor dependencies (pandas-dev#23522) BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539) BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524) BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544) API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561) CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583) DOC: Fix Order of parameters in docstrings (pandas-dev#23611) TST: Unskip some Categorical Tests (pandas-dev#23613) TST: Fix integer ops comparison test (pandas-dev#23619)
2 parents 3f5fbcd + dcb8b6a commit 4e6f3a0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+1939
-1325
lines changed

ci/code_checks.sh

+15-5
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,19 @@
99
# In the future we may want to add the validation of docstrings and other checks here.
1010
#
1111
# Usage:
12-
# $ ./ci/code_checks.sh # run all checks
13-
# $ ./ci/code_checks.sh lint # run linting only
14-
# $ ./ci/code_checks.sh patterns # check for patterns that should not exist
15-
# $ ./ci/code_checks.sh doctests # run doctests
12+
# $ ./ci/code_checks.sh # run all checks
13+
# $ ./ci/code_checks.sh lint # run linting only
14+
# $ ./ci/code_checks.sh patterns # check for patterns that should not exist
15+
# $ ./ci/code_checks.sh doctests # run doctests
16+
# $ ./ci/code_checks.sh dependencies # check that dependencies are consistent
1617

1718
echo "inside $0"
1819
[[ $LINT ]] || { echo "NOT Linting. To lint use: LINT=true $0 $1"; exit 0; }
19-
[[ -z "$1" || "$1" == "lint" || "$1" == "patterns" || "$1" == "doctests" ]] || { echo "Unknown command $1. Usage: $0 [lint|patterns|doctests]"; exit 9999; }
20+
[[ -z "$1" || "$1" == "lint" || "$1" == "patterns" || "$1" == "doctests" || "$1" == "dependencies" ]] \
21+
|| { echo "Unknown command $1. Usage: $0 [lint|patterns|doctests|dependencies]"; exit 9999; }
2022

2123
source activate pandas
24+
BASE_DIR="$(dirname $0)/.."
2225
RET=0
2326
CHECK=$1
2427

@@ -172,4 +175,11 @@ if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then
172175

173176
fi
174177

178+
### DEPENDENCIES ###
179+
if [[ -z "$CHECK" || "$CHECK" == "dependencies" ]]; then
180+
MSG='Check that requirements-dev.txt has been generated from environment.yml' ; echo $MSG
181+
$BASE_DIR/scripts/generate_pip_deps_from_conda.py --compare
182+
RET=$(($RET + $?)) ; echo $MSG "DONE"
183+
fi
184+
175185
exit $RET

ci/environment-dev.yaml

-20
This file was deleted.

ci/requirements-optional-conda.txt

-28
This file was deleted.

ci/requirements_dev.txt

-16
This file was deleted.

doc/source/contributing.rst

+3-8
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ We'll now kick off a three-step process:
170170
.. code-block:: none
171171
172172
# Create and activate the build environment
173-
conda env create -f ci/environment-dev.yaml
173+
conda env create -f environment.yml
174174
conda activate pandas-dev
175175
176176
# or with older versions of Anaconda:
@@ -180,9 +180,6 @@ We'll now kick off a three-step process:
180180
python setup.py build_ext --inplace -j 4
181181
python -m pip install -e .
182182
183-
# Install the rest of the optional dependencies
184-
conda install -c defaults -c conda-forge --file=ci/requirements-optional-conda.txt
185-
186183
At this point you should be able to import pandas from your locally built version::
187184

188185
$ python # start an interpreter
@@ -221,14 +218,12 @@ You'll need to have at least python3.5 installed on your system.
221218
. ~/virtualenvs/pandas-dev/bin/activate
222219
223220
# Install the build dependencies
224-
python -m pip install -r ci/requirements_dev.txt
221+
python -m pip install -r requirements-dev.txt
222+
225223
# Build and install pandas
226224
python setup.py build_ext --inplace -j 4
227225
python -m pip install -e .
228226
229-
# Install additional dependencies
230-
python -m pip install -r ci/requirements-optional-pip.txt
231-
232227
Creating a branch
233228
-----------------
234229

doc/source/io.rst

+28-1
Original file line numberDiff line numberDiff line change
@@ -2861,7 +2861,13 @@ to be parsed.
28612861
28622862
read_excel('path_to_file.xls', 'Sheet1', usecols=2)
28632863
2864-
If `usecols` is a list of integers, then it is assumed to be the file column
2864+
You can also specify a comma-delimited set of Excel columns and ranges as a string:
2865+
2866+
.. code-block:: python
2867+
2868+
read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')
2869+
2870+
If ``usecols`` is a list of integers, then it is assumed to be the file column
28652871
indices to be parsed.
28662872

28672873
.. code-block:: python
@@ -2870,6 +2876,27 @@ indices to be parsed.
28702876
28712877
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
28722878

2879+
.. versionadded:: 0.24
2880+
2881+
If ``usecols`` is a list of strings, it is assumed that each string corresponds
2882+
to a column name provided either by the user in ``names`` or inferred from the
2883+
document header row(s). Those strings define which columns will be parsed:
2884+
2885+
.. code-block:: python
2886+
2887+
read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
2888+
2889+
Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``.
2890+
2891+
.. versionadded:: 0.24
2892+
2893+
If ``usecols`` is callable, the callable function will be evaluated against
2894+
the column names, returning names where the callable function evaluates to ``True``.
2895+
2896+
.. code-block:: python
2897+
2898+
read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
2899+
28732900
Parsing Dates
28742901
+++++++++++++
28752902

doc/source/reshaping.rst

+104-6
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ Reshaping and Pivot Tables
1717
Reshaping by pivoting DataFrame objects
1818
---------------------------------------
1919

20+
.. image:: _static/reshaping_pivot.png
21+
2022
.. ipython::
2123
:suppress:
2224

@@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects
3335

3436
In [3]: df = unpivot(tm.makeTimeDataFrame())
3537

36-
Data is often stored in CSV files or databases in so-called "stacked" or
37-
"record" format:
38+
Data is often stored in so-called "stacked" or "record" format:
3839

3940
.. ipython:: python
4041
@@ -66,8 +67,6 @@ To select out everything for variable ``A`` we could do:
6667
6768
df[df['variable'] == 'A']
6869
69-
.. image:: _static/reshaping_pivot.png
70-
7170
But suppose we wish to do time series operations with the variables. A better
7271
representation would be where the ``columns`` are the unique variables and an
7372
``index`` of dates identifies individual observations. To reshape the data into
@@ -87,7 +86,7 @@ column:
8786
.. ipython:: python
8887
8988
df['value2'] = df['value'] * 2
90-
pivoted = df.pivot('date', 'variable')
89+
pivoted = df.pivot(index='date', columns='variable')
9190
pivoted
9291
9392
You can then select subsets from the pivoted ``DataFrame``:
@@ -99,6 +98,12 @@ You can then select subsets from the pivoted ``DataFrame``:
9998
Note that this returns a view on the underlying data in the case where the data
10099
are homogeneously-typed.
101100

101+
.. note::
102+
:func:`~pandas.pivot` will error with a ``ValueError: Index contains duplicate
103+
entries, cannot reshape`` if the index/column pair is not unique. In this
104+
case, consider using :func:`~pandas.pivot_table` which is a generalization
105+
of pivot that can handle duplicate values for one index/column pair.
106+
102107
.. _reshaping.stacking:
103108

104109
Reshaping by stacking and unstacking
@@ -704,10 +709,103 @@ handling of NaN:
704709
In [3]: np.unique(x, return_inverse=True)[::-1]
705710
Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
706711
707-
708712
.. note::
709713
If you just want to handle one column as a categorical variable (like R's factor),
710714
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
711715
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
712716
see the :ref:`Categorical introduction <categorical>` and the
713717
:ref:`API documentation <api.categorical>`.
718+
719+
Examples
720+
--------
721+
722+
In this section, we will review frequently asked questions and examples. The
723+
column names and relevant column values are named to correspond with how this
724+
DataFrame will be pivoted in the answers below.
725+
726+
.. ipython:: python
727+
728+
np.random.seed([3, 1415])
729+
n = 20
730+
731+
cols = np.array(['key', 'row', 'item', 'col'])
732+
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
733+
df.columns = cols
734+
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
735+
736+
df
737+
738+
Pivoting with Single Aggregations
739+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
740+
741+
Suppose we wanted to pivot ``df`` such that the ``col`` values are columns,
742+
``row`` values are the index, and the mean of ``val0`` are the values? In
743+
particular, the resulting DataFrame should look like:
744+
745+
.. code-block:: ipython
746+
747+
col col0 col1 col2 col3 col4
748+
row
749+
row0 0.77 0.605 NaN 0.860 0.65
750+
row2 0.13 NaN 0.395 0.500 0.25
751+
row3 NaN 0.310 NaN 0.545 NaN
752+
row4 NaN 0.100 0.395 0.760 0.24
753+
754+
This solution uses :func:`~pandas.pivot_table`. Also note that
755+
``aggfunc='mean'`` is the default. It is included here to be explicit.
756+
757+
.. ipython:: python
758+
759+
df.pivot_table(
760+
values='val0', index='row', columns='col', aggfunc='mean')
761+
762+
Note that we can also replace the missing values by using the ``fill_value``
763+
parameter.
764+
765+
.. ipython:: python
766+
767+
df.pivot_table(
768+
values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)
769+
770+
Also note that we can pass in other aggregation functions as well. For example,
771+
we can also pass in ``sum``.
772+
773+
.. ipython:: python
774+
775+
df.pivot_table(
776+
values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)
777+
778+
Another aggregation we can do is calculate the frequency in which the columns
779+
and rows occur together a.k.a. "cross tabulation". To do this, we can pass
780+
``size`` to the ``aggfunc`` parameter.
781+
782+
.. ipython:: python
783+
784+
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
785+
786+
Pivoting with Multiple Aggregations
787+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
788+
789+
We can also perform multiple aggregations. For example, to perform both a
790+
``sum`` and ``mean``, we can pass in a list to the ``aggfunc`` argument.
791+
792+
.. ipython:: python
793+
794+
df.pivot_table(
795+
values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])
796+
797+
Note to aggregate over multiple value columns, we can pass in a list to the
798+
``values`` parameter.
799+
800+
.. ipython:: python
801+
802+
df.pivot_table(
803+
values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])
804+
805+
Note to subdivide over multiple columns we can pass in a list to the
806+
``columns`` parameter.
807+
808+
.. ipython:: python
809+
810+
df.pivot_table(
811+
values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])

doc/source/whatsnew/v0.24.0.txt

+10
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,7 @@ Other Enhancements
238238
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
239239
- :func:`~DataFrame.to_parquet` now supports writing a ``DataFrame`` as a directory of parquet files partitioned by a subset of the columns when ``engine = 'pyarrow'`` (:issue:`23283`)
240240
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
241+
- :meth:`read_excel()` now accepts ``usecols`` as a list of column names or callable (:issue:`18273`)
241242

242243
.. _whatsnew_0240.api_breaking:
243244

@@ -246,6 +247,7 @@ Backwards incompatible API changes
246247

247248
- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
248249
- :meth:`Series.str.cat` will now raise if `others` is a `set` (:issue:`23009`)
250+
- Passing scalar values to :class:`DatetimeIndex` or :class:`TimedeltaIndex` will now raise ``TypeError`` instead of ``ValueError`` (:issue:`23539`)
249251

250252
.. _whatsnew_0240.api_breaking.deps:
251253

@@ -562,6 +564,7 @@ changes were made:
562564
- The result of concatenating a mix of sparse and dense Series is a Series with sparse values, rather than a ``SparseSeries``.
563565
- ``SparseDataFrame.combine`` and ``DataFrame.combine_first`` no longer supports combining a sparse column with a dense column while preserving the sparse subtype. The result will be an object-dtype SparseArray.
564566
- Setting :attr:`SparseArray.fill_value` to a fill value with a different dtype is now allowed.
567+
- ``DataFrame[column]`` is now a :class:`Series` with sparse values, rather than a :class:`SparseSeries`, when slicing a single column with sparse values (:issue:`23559`).
565568

566569
Some new warnings are issued for operations that require or are likely to materialize a large dense array:
567570

@@ -967,6 +970,7 @@ Deprecations
967970
- The class ``FrozenNDArray`` has been deprecated. When unpickling, ``FrozenNDArray`` will be unpickled to ``np.ndarray`` once this class is removed (:issue:`9031`)
968971
- Deprecated the `nthreads` keyword of :func:`pandas.read_feather` in favor of
969972
`use_threads` to reflect the changes in pyarrow 0.11.0. (:issue:`23053`)
973+
- Constructing a :class:`TimedeltaIndex` from data with ``datetime64``-dtyped data is deprecated, will raise ``TypeError`` in a future version (:issue:`23539`)
970974

971975
.. _whatsnew_0240.deprecations.datetimelike_int_ops:
972976

@@ -1126,6 +1130,9 @@ Datetimelike
11261130
- Bug in :class:`PeriodIndex` with attribute ``freq.n`` greater than 1 where adding a :class:`DateOffset` object would return incorrect results (:issue:`23215`)
11271131
- Bug in :class:`Series` that interpreted string indices as lists of characters when setting datetimelike values (:issue:`23451`)
11281132
- Bug in :class:`Timestamp` constructor which would drop the frequency of an input :class:`Timestamp` (:issue:`22311`)
1133+
- Bug in :class:`DatetimeIndex` where calling ``np.array(dtindex, dtype=object)`` would incorrectly return an array of ``long`` objects (:issue:`23524`)
1134+
- Bug in :class:`Index` where passing a timezone-aware :class:`DatetimeIndex` and `dtype=object` would incorrectly raise a ``ValueError`` (:issue:`23524`)
1135+
- Bug in :class:`Index` where calling ``np.array(dtindex, dtype=object)`` on a timezone-naive :class:`DatetimeIndex` would return an array of ``datetime`` objects instead of :class:`Timestamp` objects, potentially losing nanosecond portions of the timestamps (:issue:`23524`)
11291136

11301137
Timedelta
11311138
^^^^^^^^^
@@ -1172,6 +1179,7 @@ Offsets
11721179
- Bug in :class:`FY5253` where date offsets could incorrectly raise an ``AssertionError`` in arithmetic operatons (:issue:`14774`)
11731180
- Bug in :class:`DateOffset` where keyword arguments ``week`` and ``milliseconds`` were accepted and ignored. Passing these will now raise ``ValueError`` (:issue:`19398`)
11741181
- Bug in adding :class:`DateOffset` with :class:`DataFrame` or :class:`PeriodIndex` incorrectly raising ``TypeError`` (:issue:`23215`)
1182+
- Bug in comparing :class:`DateOffset` objects with non-DateOffset objects, particularly strings, raising ``ValueError`` instead of returning ``False`` for equality checks and ``True`` for not-equal checks (:issue:`23524`)
11751183

11761184
Numeric
11771185
^^^^^^^
@@ -1299,6 +1307,8 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
12991307
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
13001308
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
13011309
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
1310+
- Bug in :meth:`read_excel()` in which ``index_col=None`` was not being respected and parsing index columns anyway (:issue:`20480`)
1311+
- Bug in :meth:`read_excel()` in which ``usecols`` was not being validated for proper column names when passed in as a string (:issue:`20480`)
13021312

13031313
Plotting
13041314
^^^^^^^^

0 commit comments

Comments
 (0)