From f5e0af09dd98cbd14aca1675d215a9d26d847e47 Mon Sep 17 00:00:00 2001 From: sinhrks Date: Wed, 20 Jul 2016 14:44:57 +0900 Subject: [PATCH 1/6] DOC: Merge FAQ and gotcha --- doc/source/10min.rst | 2 +- doc/source/advanced.rst | 126 ++++++++++++ doc/source/basics.rst | 4 +- doc/source/ecosystem.rst | 7 +- doc/source/faq.rst | 115 ----------- doc/source/{gotchas.rst => gotcha.rst} | 259 ++++++++----------------- doc/source/install.rst | 10 +- doc/source/io.rst | 5 +- doc/source/whatsnew/v0.13.0.txt | 2 +- doc/source/whatsnew/v0.15.0.txt | 2 +- 10 files changed, 225 insertions(+), 307 deletions(-) delete mode 100644 doc/source/faq.rst rename doc/source/{gotchas.rst => gotcha.rst} (67%) diff --git a/doc/source/10min.rst b/doc/source/10min.rst index 0612e86134cf2..dc334b988e911 100644 --- a/doc/source/10min.rst +++ b/doc/source/10min.rst @@ -810,4 +810,4 @@ If you are trying an operation and you see an exception like: See :ref:`Comparisons` for an explanation and what to do. -See :ref:`Gotchas` as well. +See :ref:`FAQ` as well. diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst index 9247652388c5b..44ba3bab7ca3f 100644 --- a/doc/source/advanced.rst +++ b/doc/source/advanced.rst @@ -844,3 +844,129 @@ Of course if you need integer based selection, then use ``iloc`` .. ipython:: python dfir.iloc[0:5] + +Miscellaneous indexing FAQ +-------------------------- + +Integer indexing with ix +~~~~~~~~~~~~~~~~~~~~~~~~ + +Label-based indexing with integer axis labels is a thorny topic. It has been +discussed heavily on mailing lists and among various members of the scientific +Python community. In pandas, our general viewpoint is that labels matter more +than integer locations. Therefore, with an integer axis index *only* +label-based indexing is possible with the standard tools like ``.ix``. The +following code will generate exceptions: + +.. code-block:: python + + s = pd.Series(range(5)) + s[-1] + df = pd.DataFrame(np.random.randn(5, 4)) + df + df.ix[-2:] + +This deliberate decision was made to prevent ambiguities and subtle bugs (many +users reported finding bugs when the API change was made to stop "falling back" +on position-based indexing). + +Non-monotonic indexes require exact matches +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If the index of a ``Series`` or ``DataFrame`` is monotonically increasing or decreasing, then the bounds +of a label-based slice can be outside the range of the index, much like slice indexing a +normal Python ``list``. Monotonicity of an index can be tested with the ``is_monotonic_increasing`` and +``is_monotonic_decreasing`` attributes. + +.. ipython:: python + + df = pd.DataFrame(index=[2,3,3,4,5], columns=['data'], data=range(5)) + df.index.is_monotonic_increasing + + # no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4: + df.loc[0:4, :] + + # slice is are outside the index, so empty DataFrame is returned + df.loc[13:15, :] + +On the other hand, if the index is not monotonic, then both slice bounds must be +*unique* members of the index. + +.. ipython:: python + + df = pd.DataFrame(index=[2,3,1,4,3,5], columns=['data'], data=range(6)) + df.index.is_monotonic_increasing + + # OK because 2 and 4 are in the index + df.loc[2:4, :] + +.. code-block:: python + + # 0 is not in the index + In [9]: df.loc[0:4, :] + KeyError: 0 + + # 3 is not a unique label + In [11]: df.loc[2:3, :] + KeyError: 'Cannot get right slice bound for non-unique label: 3' + + +Endpoints are inclusive +~~~~~~~~~~~~~~~~~~~~~~~ + +Compared with standard Python sequence slicing in which the slice endpoint is +not inclusive, label-based slicing in pandas **is inclusive**. The primary +reason for this is that it is often not possible to easily determine the +"successor" or next element after a particular label in an index. For example, +consider the following Series: + +.. ipython:: python + + s = pd.Series(np.random.randn(6), index=list('abcdef')) + s + +Suppose we wished to slice from ``c`` to ``e``, using integers this would be + +.. ipython:: python + + s[2:5] + +However, if you only had ``c`` and ``e``, determining the next element in the +index can be somewhat complicated. For example, the following does not work: + +:: + + s.loc['c':'e'+1] + +A very common use case is to limit a time series to start and end at two +specific dates. To enable this, we made the design design to make label-based +slicing include both endpoints: + +.. ipython:: python + + s.loc['c':'e'] + +This is most definitely a "practicality beats purity" sort of thing, but it is +something to watch out for if you expect label-based slicing to behave exactly +in the way that standard Python integer slicing works. + + +Indexing potentially changes underlying Series dtype +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The use of ``reindex_like`` can potentially change the dtype of a ``Series``. + +.. ipython:: python + + series = pd.Series([1, 2, 3]) + x = pd.Series([True]) + x.dtype + x = pd.Series([True]).reindex_like(series) + x.dtype + +This is because ``reindex_like`` silently inserts ``NaNs`` and the ``dtype`` +changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs`` +such as ``numpy.logical_and``. + +See the `this old issue `__ for a more +detailed discussion. diff --git a/doc/source/basics.rst b/doc/source/basics.rst index 6f711323695c8..c86b2dea593c9 100644 --- a/doc/source/basics.rst +++ b/doc/source/basics.rst @@ -313,7 +313,7 @@ To evaluate single-element pandas objects in a boolean context, use the method ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all(). -See :ref:`gotchas` for a more detailed discussion. +See :ref:`FAQ` for a more detailed discussion. .. _basics.equals: @@ -1889,7 +1889,7 @@ gotchas Performing selection operations on ``integer`` type data can easily upcast the data to ``floating``. The dtype of the input data will be preserved in cases where ``nans`` are not introduced (starting in 0.11.0) -See also :ref:`integer na gotchas ` +See also :ref:`Support for integer ``NA`` ` .. ipython:: python diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index 087b265ee83f2..5a7d6a11d293d 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -93,12 +93,13 @@ targets the IPython Notebook environment. `Plotly’s `__ `Python API `__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and `D3.js `__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of `matplotlib, ggplot for Python, and Seaborn `__ can convert figures into interactive web-based plots. Plots can be drawn in `IPython Notebooks `__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has `cloud `__, `offline `__, or `on-premise `__ accounts for private use. -`Pandas-Qt `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Visualizing Data in Qt applications +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Spun off from the main pandas library, the `Pandas-Qt `__ +Spun off from the main pandas library, the `qtpandas `__ library enables DataFrame visualization and manipulation in PyQt4 and PySide applications. + .. _ecosystem.ide: IDE diff --git a/doc/source/faq.rst b/doc/source/faq.rst deleted file mode 100644 index b96660be97d71..0000000000000 --- a/doc/source/faq.rst +++ /dev/null @@ -1,115 +0,0 @@ -.. currentmodule:: pandas -.. _faq: - -******************************** -Frequently Asked Questions (FAQ) -******************************** - -.. ipython:: python - :suppress: - - import numpy as np - np.random.seed(123456) - np.set_printoptions(precision=4, suppress=True) - import pandas as pd - pd.options.display.max_rows = 15 - import matplotlib - matplotlib.style.use('ggplot') - import matplotlib.pyplot as plt - plt.close('all') - -.. _df-memory-usage: - -DataFrame memory usage ----------------------- -As of pandas version 0.15.0, the memory usage of a dataframe (including -the index) is shown when accessing the ``info`` method of a dataframe. A -configuration option, ``display.memory_usage`` (see :ref:`options`), -specifies if the dataframe's memory usage will be displayed when -invoking the ``df.info()`` method. - -For example, the memory usage of the dataframe below is shown -when calling ``df.info()``: - -.. ipython:: python - - dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]', - 'complex128', 'object', 'bool'] - n = 5000 - data = dict([ (t, np.random.randint(100, size=n).astype(t)) - for t in dtypes]) - df = pd.DataFrame(data) - df['categorical'] = df['object'].astype('category') - - df.info() - -The ``+`` symbol indicates that the true memory usage could be higher, because -pandas does not count the memory used by values in columns with -``dtype=object``. - -.. versionadded:: 0.17.1 - -Passing ``memory_usage='deep'`` will enable a more accurate memory usage report, -that accounts for the full usage of the contained objects. This is optional -as it can be expensive to do this deeper introspection. - -.. ipython:: python - - df.info(memory_usage='deep') - -By default the display option is set to ``True`` but can be explicitly -overridden by passing the ``memory_usage`` argument when invoking ``df.info()``. - -The memory usage of each column can be found by calling the ``memory_usage`` -method. This returns a Series with an index represented by column names -and memory usage of each column shown in bytes. For the dataframe above, -the memory usage of each column and the total memory usage of the -dataframe can be found with the memory_usage method: - -.. ipython:: python - - df.memory_usage() - - # total memory usage of dataframe - df.memory_usage().sum() - -By default the memory usage of the dataframe's index is shown in the -returned Series, the memory usage of the index can be suppressed by passing -the ``index=False`` argument: - -.. ipython:: python - - df.memory_usage(index=False) - -The memory usage displayed by the ``info`` method utilizes the -``memory_usage`` method to determine the memory usage of a dataframe -while also formatting the output in human-readable units (base-2 -representation; i.e., 1KB = 1024 bytes). - -See also :ref:`Categorical Memory Usage `. - -Byte-Ordering Issues --------------------- -Occasionally you may have to deal with data that were created on a machine with -a different byte order than the one on which you are running Python. To deal -with this issue you should convert the underlying NumPy array to the native -system byte order *before* passing it to Series/DataFrame/Panel constructors -using something similar to the following: - -.. ipython:: python - - x = np.array(list(range(10)), '>i4') # big endian - newx = x.byteswap().newbyteorder() # force native byteorder - s = pd.Series(newx) - -See `the NumPy documentation on byte order -`__ for more -details. - - -Visualizing Data in Qt applications ------------------------------------ - -There is no support for such visualization in pandas. However, the external -package `qtpandas `_ does -provide this functionality. diff --git a/doc/source/gotchas.rst b/doc/source/gotcha.rst similarity index 67% rename from doc/source/gotchas.rst rename to doc/source/gotcha.rst index e582d4f42894b..63b92afe2be78 100644 --- a/doc/source/gotchas.rst +++ b/doc/source/gotcha.rst @@ -1,20 +1,94 @@ .. currentmodule:: pandas -.. _gotchas: +.. _faq: + +******************************** +Frequently Asked Questions (FAQ) +******************************** .. ipython:: python :suppress: import numpy as np + np.random.seed(123456) np.set_printoptions(precision=4, suppress=True) import pandas as pd - pd.options.display.max_rows=15 + pd.options.display.max_rows = 15 + import matplotlib + matplotlib.style.use('ggplot') + import matplotlib.pyplot as plt + plt.close('all') + +.. _df-memory-usage: + +DataFrame memory usage +---------------------- +As of pandas version 0.15.0, the memory usage of a dataframe (including +the index) is shown when accessing the ``info`` method of a dataframe. A +configuration option, ``display.memory_usage`` (see :ref:`options`), +specifies if the dataframe's memory usage will be displayed when +invoking the ``df.info()`` method. + +For example, the memory usage of the dataframe below is shown +when calling ``df.info()``: + +.. ipython:: python + + dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]', + 'complex128', 'object', 'bool'] + n = 5000 + data = dict([ (t, np.random.randint(100, size=n).astype(t)) + for t in dtypes]) + df = pd.DataFrame(data) + df['categorical'] = df['object'].astype('category') + + df.info() + +The ``+`` symbol indicates that the true memory usage could be higher, because +pandas does not count the memory used by values in columns with +``dtype=object``. + +.. versionadded:: 0.17.1 + +Passing ``memory_usage='deep'`` will enable a more accurate memory usage report, +that accounts for the full usage of the contained objects. This is optional +as it can be expensive to do this deeper introspection. + +.. ipython:: python + + df.info(memory_usage='deep') + +By default the display option is set to ``True`` but can be explicitly +overridden by passing the ``memory_usage`` argument when invoking ``df.info()``. + +The memory usage of each column can be found by calling the ``memory_usage`` +method. This returns a Series with an index represented by column names +and memory usage of each column shown in bytes. For the dataframe above, +the memory usage of each column and the total memory usage of the +dataframe can be found with the memory_usage method: + +.. ipython:: python + + df.memory_usage() + + # total memory usage of dataframe + df.memory_usage().sum() + +By default the memory usage of the dataframe's index is shown in the +returned Series, the memory usage of the index can be suppressed by passing +the ``index=False`` argument: + +.. ipython:: python + + df.memory_usage(index=False) +The memory usage displayed by the ``info`` method utilizes the +``memory_usage`` method to determine the memory usage of a dataframe +while also formatting the output in human-readable units (base-2 +representation; i.e., 1KB = 1024 bytes). -******************* -Caveats and Gotchas -******************* +See also :ref:`Categorical Memory Usage `. -.. _gotchas.truth: +.. _faq.truth: Using If/Truth Statements with pandas ------------------------------------- @@ -134,7 +208,7 @@ detect NA values. However, it comes with it a couple of trade-offs which I most certainly have not ignored. -.. _gotchas.intna: +.. _faq.intna: Support for integer ``NA`` ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -214,175 +288,6 @@ and traded integer ``NA`` capability for a much simpler approach of using a special value in float and object arrays to denote ``NA``, and promoting integer arrays to floating when NAs must be introduced. -Integer indexing ----------------- - -Label-based indexing with integer axis labels is a thorny topic. It has been -discussed heavily on mailing lists and among various members of the scientific -Python community. In pandas, our general viewpoint is that labels matter more -than integer locations. Therefore, with an integer axis index *only* -label-based indexing is possible with the standard tools like ``.loc``. The -following code will generate exceptions: - -.. code-block:: python - - s = pd.Series(range(5)) - s[-1] - df = pd.DataFrame(np.random.randn(5, 4)) - df - df.loc[-2:] - -This deliberate decision was made to prevent ambiguities and subtle bugs (many -users reported finding bugs when the API change was made to stop "falling back" -on position-based indexing). - -Label-based slicing conventions -------------------------------- - -Non-monotonic indexes require exact matches -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If the index of a ``Series`` or ``DataFrame`` is monotonically increasing or decreasing, then the bounds -of a label-based slice can be outside the range of the index, much like slice indexing a -normal Python ``list``. Monotonicity of an index can be tested with the ``is_monotonic_increasing`` and -``is_monotonic_decreasing`` attributes. - -.. ipython:: python - - df = pd.DataFrame(index=[2,3,3,4,5], columns=['data'], data=list(range(5))) - df.index.is_monotonic_increasing - - # no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4: - df.loc[0:4, :] - - # slice is are outside the index, so empty DataFrame is returned - df.loc[13:15, :] - -On the other hand, if the index is not monotonic, then both slice bounds must be -*unique* members of the index. - -.. ipython:: python - - df = pd.DataFrame(index=[2,3,1,4,3,5], columns=['data'], data=list(range(6))) - df.index.is_monotonic_increasing - - # OK because 2 and 4 are in the index - df.loc[2:4, :] - -.. code-block:: python - - # 0 is not in the index - In [9]: df.loc[0:4, :] - KeyError: 0 - - # 3 is not a unique label - In [11]: df.loc[2:3, :] - KeyError: 'Cannot get right slice bound for non-unique label: 3' - - -Endpoints are inclusive -~~~~~~~~~~~~~~~~~~~~~~~ - -Compared with standard Python sequence slicing in which the slice endpoint is -not inclusive, label-based slicing in pandas **is inclusive**. The primary -reason for this is that it is often not possible to easily determine the -"successor" or next element after a particular label in an index. For example, -consider the following Series: - -.. ipython:: python - - s = pd.Series(np.random.randn(6), index=list('abcdef')) - s - -Suppose we wished to slice from ``c`` to ``e``, using integers this would be - -.. ipython:: python - - s[2:5] - -However, if you only had ``c`` and ``e``, determining the next element in the -index can be somewhat complicated. For example, the following does not work: - -:: - - s.loc['c':'e'+1] - -A very common use case is to limit a time series to start and end at two -specific dates. To enable this, we made the design design to make label-based -slicing include both endpoints: - -.. ipython:: python - - s.loc['c':'e'] - -This is most definitely a "practicality beats purity" sort of thing, but it is -something to watch out for if you expect label-based slicing to behave exactly -in the way that standard Python integer slicing works. - -Miscellaneous indexing gotchas ------------------------------- - -Reindex potentially changes underlying Series dtype -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The use of ``reindex_like`` can potentially change the dtype of a ``Series``. - -.. ipython:: python - - series = pd.Series([1, 2, 3]) - x = pd.Series([True]) - x.dtype - x = pd.Series([True]).reindex_like(series) - x.dtype - -This is because ``reindex_like`` silently inserts ``NaNs`` and the ``dtype`` -changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs`` -such as ``numpy.logical_and``. - -See the `this old issue `__ for a more -detailed discussion. - -Parsing Dates from Text Files ------------------------------ - -When parsing multiple text file columns into a single date column, the new date -column is prepended to the data and then `index_col` specification is indexed off -of the new set of columns rather than the original ones: - -.. ipython:: python - :suppress: - - data = ("KORD,19990127, 19:00:00, 18:56:00, 0.8100\n" - "KORD,19990127, 20:00:00, 19:56:00, 0.0100\n" - "KORD,19990127, 21:00:00, 20:56:00, -0.5900\n" - "KORD,19990127, 21:00:00, 21:18:00, -0.9900\n" - "KORD,19990127, 22:00:00, 21:56:00, -0.5900\n" - "KORD,19990127, 23:00:00, 22:56:00, -0.5900") - - with open('tmp.csv', 'w') as fh: - fh.write(data) - -.. ipython:: python - - print(open('tmp.csv').read()) - - date_spec = {'nominal': [1, 2], 'actual': [1, 3]} - df = pd.read_csv('tmp.csv', header=None, - parse_dates=date_spec, - keep_date_col=True, - index_col=0) - - # index_col=0 refers to the combined column "nominal" and not the original - # first column of 'KORD' strings - - df - -.. ipython:: python - :suppress: - - import os - os.remove('tmp.csv') - Differences with NumPy ---------------------- @@ -403,7 +308,7 @@ where the data copying occurs. See `this link `__ for more information. -.. _html-gotchas: +.. _faq.html: HTML Table Parsing ------------------ diff --git a/doc/source/install.rst b/doc/source/install.rst index 4787b3356ee9f..a5bf86ee9c555 100644 --- a/doc/source/install.rst +++ b/doc/source/install.rst @@ -285,7 +285,7 @@ Optional Dependencies okay.) * `BeautifulSoup4`_ and `lxml`_ * `BeautifulSoup4`_ and `html5lib`_ and `lxml`_ - * Only `lxml`_, although see :ref:`HTML reading gotchas ` + * Only `lxml`_, although see :ref:`HTML Table Parsing ` for reasons as to why you should probably **not** take this approach. .. warning:: @@ -294,9 +294,11 @@ Optional Dependencies `lxml`_ or `html5lib`_ or both. :func:`~pandas.read_html` will **not** work with *only* `BeautifulSoup4`_ installed. - * You are highly encouraged to read :ref:`HTML reading gotchas - `. It explains issues surrounding the installation and - usage of the above three libraries + * You are highly encouraged to read :ref:`HTML Table Parsing `. + It explains issues surrounding the installation and + usage of the above three libraries. + Additionally, if you're using `Anaconda`_ you should definitely + read. * You may need to install an older version of `BeautifulSoup4`_: Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and 32-bit Ubuntu/Debian diff --git a/doc/source/io.rst b/doc/source/io.rst index d4bbb3d114b52..39e00d4bd44e9 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -2043,9 +2043,8 @@ Reading HTML Content .. warning:: - We **highly encourage** you to read the :ref:`HTML parsing gotchas - ` regarding the issues surrounding the - BeautifulSoup4/html5lib/lxml parsers. + We **highly encourage** you to read the :ref:`HTML Table Parsing` + regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers. .. versionadded:: 0.12.0 diff --git a/doc/source/whatsnew/v0.13.0.txt b/doc/source/whatsnew/v0.13.0.txt index 118632cc2c0ee..4a0589d80dd45 100644 --- a/doc/source/whatsnew/v0.13.0.txt +++ b/doc/source/whatsnew/v0.13.0.txt @@ -111,7 +111,7 @@ API changes - Infer and downcast dtype if ``downcast='infer'`` is passed to ``fillna/ffill/bfill`` (:issue:`4604`) - ``__nonzero__`` for all NDFrame objects, will now raise a ``ValueError``, this reverts back to (:issue:`1073`, :issue:`4633`) - behavior. See :ref:`gotchas` for a more detailed discussion. + behavior. See :ref:`FAQ` for a more detailed discussion. This prevents doing boolean comparison on *entire* pandas objects, which is inherently ambiguous. These all will raise a ``ValueError``. diff --git a/doc/source/whatsnew/v0.15.0.txt b/doc/source/whatsnew/v0.15.0.txt index aff8ec9092cdc..36e8f8da49d4e 100644 --- a/doc/source/whatsnew/v0.15.0.txt +++ b/doc/source/whatsnew/v0.15.0.txt @@ -864,7 +864,7 @@ a transparent change with only very limited API implications (:issue:`5080`, :is - you may need to unpickle pandas version < 0.15.0 pickles using ``pd.read_pickle`` rather than ``pickle.load``. See :ref:`pickle docs ` - when plotting with a ``PeriodIndex``, the matplotlib internal axes will now be arrays of ``Period`` rather than a ``PeriodIndex`` (this is similar to how a ``DatetimeIndex`` passes arrays of ``datetimes`` now) -- MultiIndexes will now raise similary to other pandas objects w.r.t. truth testing, see :ref:`here ` (:issue:`7897`). +- MultiIndexes will now raise similary to other pandas objects w.r.t. truth testing, see :ref:`here ` (:issue:`7897`). - When plotting a DatetimeIndex directly with matplotlib's `plot` function, the axis labels will no longer be formatted as dates but as integers (the internal representation of a ``datetime64``). **UPDATE** This is fixed From ab7fdf0e536cb5760fb5f3e257b5677d18a5e97f Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Wed, 25 Jan 2017 11:19:23 +0100 Subject: [PATCH 2/6] restore file name --- doc/source/{gotcha.rst => gotchas.rst} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename doc/source/{gotcha.rst => gotchas.rst} (100%) diff --git a/doc/source/gotcha.rst b/doc/source/gotchas.rst similarity index 100% rename from doc/source/gotcha.rst rename to doc/source/gotchas.rst From c9e41cc5cc30b55a276c6529a9d7acd8efca4587 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Wed, 25 Jan 2017 11:22:44 +0100 Subject: [PATCH 3/6] Redo updates after ix deprecation --- doc/source/advanced.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst index 44ba3bab7ca3f..2b33a78a490cd 100644 --- a/doc/source/advanced.rst +++ b/doc/source/advanced.rst @@ -848,14 +848,14 @@ Of course if you need integer based selection, then use ``iloc`` Miscellaneous indexing FAQ -------------------------- -Integer indexing with ix -~~~~~~~~~~~~~~~~~~~~~~~~ +Integer indexing +~~~~~~~~~~~~~~~~ Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index *only* -label-based indexing is possible with the standard tools like ``.ix``. The +label-based indexing is possible with the standard tools like ``.loc``. The following code will generate exceptions: .. code-block:: python @@ -864,7 +864,7 @@ following code will generate exceptions: s[-1] df = pd.DataFrame(np.random.randn(5, 4)) df - df.ix[-2:] + df.loc[-2:] This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop "falling back" From 7185dd4c57b397b6f9b2b8cd73b4c25cf1194831 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Wed, 25 Jan 2017 11:32:31 +0100 Subject: [PATCH 4/6] Keep original gotchas label for references --- doc/source/10min.rst | 2 +- doc/source/basics.rst | 4 ++-- doc/source/gotchas.rst | 8 ++++---- doc/source/install.rst | 7 +------ doc/source/io.rst | 2 +- doc/source/whatsnew/v0.13.0.txt | 2 +- doc/source/whatsnew/v0.15.0.txt | 2 +- 7 files changed, 11 insertions(+), 16 deletions(-) diff --git a/doc/source/10min.rst b/doc/source/10min.rst index dc334b988e911..0612e86134cf2 100644 --- a/doc/source/10min.rst +++ b/doc/source/10min.rst @@ -810,4 +810,4 @@ If you are trying an operation and you see an exception like: See :ref:`Comparisons` for an explanation and what to do. -See :ref:`FAQ` as well. +See :ref:`Gotchas` as well. diff --git a/doc/source/basics.rst b/doc/source/basics.rst index c86b2dea593c9..f5f7c73223595 100644 --- a/doc/source/basics.rst +++ b/doc/source/basics.rst @@ -313,7 +313,7 @@ To evaluate single-element pandas objects in a boolean context, use the method ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all(). -See :ref:`FAQ` for a more detailed discussion. +See :ref:`gotchas` for a more detailed discussion. .. _basics.equals: @@ -1889,7 +1889,7 @@ gotchas Performing selection operations on ``integer`` type data can easily upcast the data to ``floating``. The dtype of the input data will be preserved in cases where ``nans`` are not introduced (starting in 0.11.0) -See also :ref:`Support for integer ``NA`` ` +See also :ref:`Support for integer ``NA`` ` .. ipython:: python diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst index 63b92afe2be78..10b08a81fec75 100644 --- a/doc/source/gotchas.rst +++ b/doc/source/gotchas.rst @@ -1,5 +1,5 @@ .. currentmodule:: pandas -.. _faq: +.. _gotchas: ******************************** Frequently Asked Questions (FAQ) @@ -88,7 +88,7 @@ representation; i.e., 1KB = 1024 bytes). See also :ref:`Categorical Memory Usage `. -.. _faq.truth: +.. _gotchas.truth: Using If/Truth Statements with pandas ------------------------------------- @@ -208,7 +208,7 @@ detect NA values. However, it comes with it a couple of trade-offs which I most certainly have not ignored. -.. _faq.intna: +.. _gotchas.intna: Support for integer ``NA`` ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -308,7 +308,7 @@ where the data copying occurs. See `this link `__ for more information. -.. _faq.html: +.. _gotchas.html: HTML Table Parsing ------------------ diff --git a/doc/source/install.rst b/doc/source/install.rst index a5bf86ee9c555..711a682dff442 100644 --- a/doc/source/install.rst +++ b/doc/source/install.rst @@ -294,16 +294,12 @@ Optional Dependencies `lxml`_ or `html5lib`_ or both. :func:`~pandas.read_html` will **not** work with *only* `BeautifulSoup4`_ installed. - * You are highly encouraged to read :ref:`HTML Table Parsing `. + * You are highly encouraged to read :ref:`HTML Table Parsing gotchas `. It explains issues surrounding the installation and usage of the above three libraries. - Additionally, if you're using `Anaconda`_ you should definitely - read. * You may need to install an older version of `BeautifulSoup4`_: Versions 4.2.1, 4.1.3 and 4.0.2 have been confirmed for 64 and 32-bit Ubuntu/Debian - * Additionally, if you're using `Anaconda`_ you should definitely - read :ref:`the gotchas about HTML parsing libraries ` .. note:: @@ -320,7 +316,6 @@ Optional Dependencies .. _html5lib: https://github.com/html5lib/html5lib-python .. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup .. _lxml: http://lxml.de -.. _Anaconda: https://store.continuum.io/cshop/anaconda .. note:: diff --git a/doc/source/io.rst b/doc/source/io.rst index 39e00d4bd44e9..8e59534c391bb 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -2043,7 +2043,7 @@ Reading HTML Content .. warning:: - We **highly encourage** you to read the :ref:`HTML Table Parsing` + We **highly encourage** you to read the :ref:`HTML Table Parsing gotchas` regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers. .. versionadded:: 0.12.0 diff --git a/doc/source/whatsnew/v0.13.0.txt b/doc/source/whatsnew/v0.13.0.txt index 4a0589d80dd45..118632cc2c0ee 100644 --- a/doc/source/whatsnew/v0.13.0.txt +++ b/doc/source/whatsnew/v0.13.0.txt @@ -111,7 +111,7 @@ API changes - Infer and downcast dtype if ``downcast='infer'`` is passed to ``fillna/ffill/bfill`` (:issue:`4604`) - ``__nonzero__`` for all NDFrame objects, will now raise a ``ValueError``, this reverts back to (:issue:`1073`, :issue:`4633`) - behavior. See :ref:`FAQ` for a more detailed discussion. + behavior. See :ref:`gotchas` for a more detailed discussion. This prevents doing boolean comparison on *entire* pandas objects, which is inherently ambiguous. These all will raise a ``ValueError``. diff --git a/doc/source/whatsnew/v0.15.0.txt b/doc/source/whatsnew/v0.15.0.txt index 36e8f8da49d4e..aff8ec9092cdc 100644 --- a/doc/source/whatsnew/v0.15.0.txt +++ b/doc/source/whatsnew/v0.15.0.txt @@ -864,7 +864,7 @@ a transparent change with only very limited API implications (:issue:`5080`, :is - you may need to unpickle pandas version < 0.15.0 pickles using ``pd.read_pickle`` rather than ``pickle.load``. See :ref:`pickle docs ` - when plotting with a ``PeriodIndex``, the matplotlib internal axes will now be arrays of ``Period`` rather than a ``PeriodIndex`` (this is similar to how a ``DatetimeIndex`` passes arrays of ``datetimes`` now) -- MultiIndexes will now raise similary to other pandas objects w.r.t. truth testing, see :ref:`here ` (:issue:`7897`). +- MultiIndexes will now raise similary to other pandas objects w.r.t. truth testing, see :ref:`here ` (:issue:`7897`). - When plotting a DatetimeIndex directly with matplotlib's `plot` function, the axis labels will no longer be formatted as dates but as integers (the internal representation of a ``datetime64``). **UPDATE** This is fixed From 53a69707717dc37221c95dc213e9ed254ebf7965 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Wed, 25 Jan 2017 11:41:31 +0100 Subject: [PATCH 5/6] Move HTML libraries gotchas to html io docs --- doc/source/gotchas.rst | 72 ------------------------------------- doc/source/install.rst | 4 +-- doc/source/io.rst | 81 ++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 81 insertions(+), 76 deletions(-) diff --git a/doc/source/gotchas.rst b/doc/source/gotchas.rst index 10b08a81fec75..11827fe2776cf 100644 --- a/doc/source/gotchas.rst +++ b/doc/source/gotchas.rst @@ -308,78 +308,6 @@ where the data copying occurs. See `this link `__ for more information. -.. _gotchas.html: - -HTML Table Parsing ------------------- -There are some versioning issues surrounding the libraries that are used to -parse HTML tables in the top-level pandas io function ``read_html``. - -**Issues with** |lxml|_ - - * Benefits - - * |lxml|_ is very fast - - * |lxml|_ requires Cython to install correctly. - - * Drawbacks - - * |lxml|_ does *not* make any guarantees about the results of its parse - *unless* it is given |svm|_. - - * In light of the above, we have chosen to allow you, the user, to use the - |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ - fails to parse - - * It is therefore *highly recommended* that you install both - |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid - result (provided everything else is valid) even if |lxml|_ fails. - -**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend** - - * The above issues hold here as well since |BeautifulSoup4|_ is essentially - just a wrapper around a parser backend. - -**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend** - - * Benefits - - * |html5lib|_ is far more lenient than |lxml|_ and consequently deals - with *real-life markup* in a much saner way rather than just, e.g., - dropping an element without notifying you. - - * |html5lib|_ *generates valid HTML5 markup from invalid markup - automatically*. This is extremely important for parsing HTML tables, - since it guarantees a valid document. However, that does NOT mean that - it is "correct", since the process of fixing markup does not have a - single definition. - - * |html5lib|_ is pure Python and requires no additional build steps beyond - its own installation. - - * Drawbacks - - * The biggest drawback to using |html5lib|_ is that it is slow as - molasses. However consider the fact that many tables on the web are not - big enough for the parsing algorithm runtime to matter. It is more - likely that the bottleneck will be in the process of reading the raw - text from the URL over the web, i.e., IO (input-output). For very large - tables, this might not be true. - - -.. |svm| replace:: **strictly valid markup** -.. _svm: http://validator.w3.org/docs/help.html#validation_basics - -.. |html5lib| replace:: **html5lib** -.. _html5lib: https://github.com/html5lib/html5lib-python - -.. |BeautifulSoup4| replace:: **BeautifulSoup4** -.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup - -.. |lxml| replace:: **lxml** -.. _lxml: http://lxml.de - Byte-Ordering Issues -------------------- diff --git a/doc/source/install.rst b/doc/source/install.rst index 711a682dff442..158a6e5562b7a 100644 --- a/doc/source/install.rst +++ b/doc/source/install.rst @@ -285,7 +285,7 @@ Optional Dependencies okay.) * `BeautifulSoup4`_ and `lxml`_ * `BeautifulSoup4`_ and `html5lib`_ and `lxml`_ - * Only `lxml`_, although see :ref:`HTML Table Parsing ` + * Only `lxml`_, although see :ref:`HTML Table Parsing ` for reasons as to why you should probably **not** take this approach. .. warning:: @@ -294,7 +294,7 @@ Optional Dependencies `lxml`_ or `html5lib`_ or both. :func:`~pandas.read_html` will **not** work with *only* `BeautifulSoup4`_ installed. - * You are highly encouraged to read :ref:`HTML Table Parsing gotchas `. + * You are highly encouraged to read :ref:`HTML Table Parsing gotchas `. It explains issues surrounding the installation and usage of the above three libraries. * You may need to install an older version of `BeautifulSoup4`_: diff --git a/doc/source/io.rst b/doc/source/io.rst index 8e59534c391bb..4c78758a0e2d2 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -2043,8 +2043,8 @@ Reading HTML Content .. warning:: - We **highly encourage** you to read the :ref:`HTML Table Parsing gotchas` - regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers. + We **highly encourage** you to read the :ref:`HTML Table Parsing gotchas` + below regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers. .. versionadded:: 0.12.0 @@ -2346,6 +2346,83 @@ Not escaped: Some browsers may not show a difference in the rendering of the previous two HTML tables. + +.. _io.html.gotchas: + +HTML Table Parsing Gotchas +'''''''''''''''''''''''''' + +There are some versioning issues surrounding the libraries that are used to +parse HTML tables in the top-level pandas io function ``read_html``. + +**Issues with** |lxml|_ + + * Benefits + + * |lxml|_ is very fast + + * |lxml|_ requires Cython to install correctly. + + * Drawbacks + + * |lxml|_ does *not* make any guarantees about the results of its parse + *unless* it is given |svm|_. + + * In light of the above, we have chosen to allow you, the user, to use the + |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ + fails to parse + + * It is therefore *highly recommended* that you install both + |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid + result (provided everything else is valid) even if |lxml|_ fails. + +**Issues with** |BeautifulSoup4|_ **using** |lxml|_ **as a backend** + + * The above issues hold here as well since |BeautifulSoup4|_ is essentially + just a wrapper around a parser backend. + +**Issues with** |BeautifulSoup4|_ **using** |html5lib|_ **as a backend** + + * Benefits + + * |html5lib|_ is far more lenient than |lxml|_ and consequently deals + with *real-life markup* in a much saner way rather than just, e.g., + dropping an element without notifying you. + + * |html5lib|_ *generates valid HTML5 markup from invalid markup + automatically*. This is extremely important for parsing HTML tables, + since it guarantees a valid document. However, that does NOT mean that + it is "correct", since the process of fixing markup does not have a + single definition. + + * |html5lib|_ is pure Python and requires no additional build steps beyond + its own installation. + + * Drawbacks + + * The biggest drawback to using |html5lib|_ is that it is slow as + molasses. However consider the fact that many tables on the web are not + big enough for the parsing algorithm runtime to matter. It is more + likely that the bottleneck will be in the process of reading the raw + text from the URL over the web, i.e., IO (input-output). For very large + tables, this might not be true. + + +.. |svm| replace:: **strictly valid markup** +.. _svm: http://validator.w3.org/docs/help.html#validation_basics + +.. |html5lib| replace:: **html5lib** +.. _html5lib: https://github.com/html5lib/html5lib-python + +.. |BeautifulSoup4| replace:: **BeautifulSoup4** +.. _BeautifulSoup4: http://www.crummy.com/software/BeautifulSoup + +.. |lxml| replace:: **lxml** +.. _lxml: http://lxml.de + + + + .. _io.excel: Excel files From 7abb65bcf56d5924feceb6a91cc9d77cd9408074 Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Wed, 25 Jan 2017 11:46:24 +0100 Subject: [PATCH 6/6] Make 'indexing may change dtype' more general --- doc/source/advanced.rst | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst index 2b33a78a490cd..21ae9f1eb8409 100644 --- a/doc/source/advanced.rst +++ b/doc/source/advanced.rst @@ -954,17 +954,24 @@ in the way that standard Python integer slicing works. Indexing potentially changes underlying Series dtype ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The use of ``reindex_like`` can potentially change the dtype of a ``Series``. +The different indexing operation can potentially change the dtype of a ``Series``. .. ipython:: python - series = pd.Series([1, 2, 3]) - x = pd.Series([True]) - x.dtype - x = pd.Series([True]).reindex_like(series) - x.dtype + series1 = pd.Series([1, 2, 3]) + series1.dtype + res = series1[[0,4]] + res.dtype + res -This is because ``reindex_like`` silently inserts ``NaNs`` and the ``dtype`` +.. ipython:: python + series2 = pd.Series([True]) + series2.dtype + res = series2.reindex_like(series1) + res.dtype + res + +This is because the (re)indexing operations above silently inserts ``NaNs`` and the ``dtype`` changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs`` such as ``numpy.logical_and``.