From 80b362775366259b353fe67281cc61e7b3ec64f6 Mon Sep 17 00:00:00 2001 From: Simon Hawkins Date: Wed, 15 Jun 2022 20:16:03 +0100 Subject: [PATCH 1/2] DOC: move enhancements in 1.5 release notes --- doc/source/whatsnew/v1.5.0.rst | 158 +++++++++++++++++---------------- 1 file changed, 82 insertions(+), 76 deletions(-) diff --git a/doc/source/whatsnew/v1.5.0.rst b/doc/source/whatsnew/v1.5.0.rst index 681139fb51272..419e2d5323e44 100644 --- a/doc/source/whatsnew/v1.5.0.rst +++ b/doc/source/whatsnew/v1.5.0.rst @@ -147,6 +147,85 @@ If the compression method cannot be inferred, use the ``compression`` argument: (``mode`` being one of ``tarfile.open``'s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open) +.. _whatsnew_150.enhancements.read_xml_dtypes: + +read_xml now supports ``dtype``, ``converters``, and ``parse_dates`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns, +apply converter methods, and parse dates (:issue:`43567`). + +.. ipython:: python + + xml_dates = """ + + + square + 00360 + 4.0 + 2020-01-01 + + + circle + 00360 + + 2021-01-01 + + + triangle + 00180 + 3.0 + 2022-01-01 + + """ + + df = pd.read_xml( + xml_dates, + dtype={'sides': 'Int64'}, + converters={'degrees': str}, + parse_dates=['date'] + ) + df + df.dtypes + + +.. _whatsnew_150.enhancements.read_xml_iterparse: + +read_xml now supports large XML using ``iterparse`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` +now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_ +which are memory-efficient methods to iterate through XML trees and extract specific elements +and attributes without holding entire tree in memory (:issue:`#45442`). + +.. code-block:: ipython + + In [1]: df = pd.read_xml( + ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", + ... iterparse = {"page": ["title", "ns", "id"]}) + ... ) + df + Out[2]: + title ns id + 0 Gettysburg Address 0 21450 + 1 Main Page 0 42950 + 2 Declaration by United Nations 0 8435 + 3 Constitution of the United States of America 0 8435 + 4 Declaration of Independence (Israel) 0 17858 + ... ... ... ... + 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 + 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 + 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 + 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 + 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450 + + [3578765 rows x 3 columns] + + +.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk +.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse + .. _whatsnew_150.enhancements.other: Other enhancements @@ -293,83 +372,10 @@ upon serialization. (Related issue :issue:`12997`) Backwards incompatible API changes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. _whatsnew_150.api_breaking.read_xml_dtypes: - -read_xml now supports ``dtype``, ``converters``, and ``parse_dates`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns, -apply converter methods, and parse dates (:issue:`43567`). - -.. ipython:: python - - xml_dates = """ - - - square - 00360 - 4.0 - 2020-01-01 - - - circle - 00360 - - 2021-01-01 - - - triangle - 00180 - 3.0 - 2022-01-01 - - """ - - df = pd.read_xml( - xml_dates, - dtype={'sides': 'Int64'}, - converters={'degrees': str}, - parse_dates=['date'] - ) - df - df.dtypes - -.. _whatsnew_150.read_xml_iterparse: - -read_xml now supports large XML using ``iterparse`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` -now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_ -which are memory-efficient methods to iterate through XML trees and extract specific elements -and attributes without holding entire tree in memory (:issue:`#45442`). - -.. code-block:: ipython - - In [1]: df = pd.read_xml( - ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", - ... iterparse = {"page": ["title", "ns", "id"]}) - ... ) - df - Out[2]: - title ns id - 0 Gettysburg Address 0 21450 - 1 Main Page 0 42950 - 2 Declaration by United Nations 0 8435 - 3 Constitution of the United States of America 0 8435 - 4 Declaration of Independence (Israel) 0 17858 - ... ... ... ... - 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 - 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 - 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 - 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 - 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450 - - [3578765 rows x 3 columns] +.. _whatsnew_150.api_breaking.api_breaking1: - -.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk -.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse +api_breaking_change1 +^^^^^^^^^^^^^^^^^^^^ .. _whatsnew_150.api_breaking.api_breaking2: From 1f8b4404dc4679eba43a61107c9f39133a2221eb Mon Sep 17 00:00:00 2001 From: Simon Hawkins Date: Wed, 15 Jun 2022 22:01:26 +0100 Subject: [PATCH 2/2] remove rogue hash --- doc/source/whatsnew/v1.5.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v1.5.0.rst b/doc/source/whatsnew/v1.5.0.rst index 419e2d5323e44..bb7c7a6ed8be5 100644 --- a/doc/source/whatsnew/v1.5.0.rst +++ b/doc/source/whatsnew/v1.5.0.rst @@ -197,7 +197,7 @@ read_xml now supports large XML using ``iterparse`` For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_ which are memory-efficient methods to iterate through XML trees and extract specific elements -and attributes without holding entire tree in memory (:issue:`#45442`). +and attributes without holding entire tree in memory (:issue:`45442`). .. code-block:: ipython