diff --git a/doc/source/whatsnew/v1.5.0.rst b/doc/source/whatsnew/v1.5.0.rst index 76f6e864a174f..6cd44b39a18ad 100644 --- a/doc/source/whatsnew/v1.5.0.rst +++ b/doc/source/whatsnew/v1.5.0.rst @@ -147,6 +147,85 @@ If the compression method cannot be inferred, use the ``compression`` argument: (``mode`` being one of ``tarfile.open``'s modes: https://docs.python.org/3/library/tarfile.html#tarfile.open) +.. _whatsnew_150.enhancements.read_xml_dtypes: + +read_xml now supports ``dtype``, ``converters``, and ``parse_dates`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns, +apply converter methods, and parse dates (:issue:`43567`). + +.. ipython:: python + + xml_dates = """ + + + square + 00360 + 4.0 + 2020-01-01 + + + circle + 00360 + + 2021-01-01 + + + triangle + 00180 + 3.0 + 2022-01-01 + + """ + + df = pd.read_xml( + xml_dates, + dtype={'sides': 'Int64'}, + converters={'degrees': str}, + parse_dates=['date'] + ) + df + df.dtypes + + +.. _whatsnew_150.enhancements.read_xml_iterparse: + +read_xml now supports large XML using ``iterparse`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` +now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_ +which are memory-efficient methods to iterate through XML trees and extract specific elements +and attributes without holding entire tree in memory (:issue:`45442`). + +.. code-block:: ipython + + In [1]: df = pd.read_xml( + ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", + ... iterparse = {"page": ["title", "ns", "id"]}) + ... ) + df + Out[2]: + title ns id + 0 Gettysburg Address 0 21450 + 1 Main Page 0 42950 + 2 Declaration by United Nations 0 8435 + 3 Constitution of the United States of America 0 8435 + 4 Declaration of Independence (Israel) 0 17858 + ... ... ... ... + 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 + 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 + 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 + 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 + 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450 + + [3578765 rows x 3 columns] + + +.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk +.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse + .. _whatsnew_150.enhancements.other: Other enhancements @@ -294,83 +373,10 @@ upon serialization. (Related issue :issue:`12997`) Backwards incompatible API changes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. _whatsnew_150.api_breaking.read_xml_dtypes: - -read_xml now supports ``dtype``, ``converters``, and ``parse_dates`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns, -apply converter methods, and parse dates (:issue:`43567`). - -.. ipython:: python - - xml_dates = """ - - - square - 00360 - 4.0 - 2020-01-01 - - - circle - 00360 - - 2021-01-01 - - - triangle - 00180 - 3.0 - 2022-01-01 - - """ +.. _whatsnew_150.api_breaking.api_breaking1: - df = pd.read_xml( - xml_dates, - dtype={'sides': 'Int64'}, - converters={'degrees': str}, - parse_dates=['date'] - ) - df - df.dtypes - -.. _whatsnew_150.read_xml_iterparse: - -read_xml now supports large XML using ``iterparse`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` -now supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iterparse`_ -which are memory-efficient methods to iterate through XML trees and extract specific elements -and attributes without holding entire tree in memory (:issue:`#45442`). - -.. code-block:: ipython - - In [1]: df = pd.read_xml( - ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", - ... iterparse = {"page": ["title", "ns", "id"]}) - ... ) - df - Out[2]: - title ns id - 0 Gettysburg Address 0 21450 - 1 Main Page 0 42950 - 2 Declaration by United Nations 0 8435 - 3 Constitution of the United States of America 0 8435 - 4 Declaration of Independence (Israel) 0 17858 - ... ... ... ... - 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 - 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 - 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 - 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 - 3578764 Page:Shakespeare of Stratford (1926) Yale.djvu/91 104 21450 - - [3578765 rows x 3 columns] - - -.. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk -.. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse +api_breaking_change1 +^^^^^^^^^^^^^^^^^^^^ .. _whatsnew_150.api_breaking.api_breaking2: