From b2a4f729bb24cc40e9c9e60cc6f6676c8a0cef82 Mon Sep 17 00:00:00 2001 From: Dan Kenefick Date: Thu, 27 Apr 2017 20:55:39 -0400 Subject: [PATCH 1/4] ENH: pandas read_* wildcard #15904 --- doc/source/cookbook.rst | 44 ++++++++++++++++++++++++++++++++++++++--- doc/source/io.rst | 8 ++++++++ 2 files changed, 49 insertions(+), 3 deletions(-) diff --git a/doc/source/cookbook.rst b/doc/source/cookbook.rst index 8fa1283ffc924..f2478f7a4d5b7 100644 --- a/doc/source/cookbook.rst +++ b/doc/source/cookbook.rst @@ -910,9 +910,6 @@ The :ref:`CSV ` docs `appending to a csv `__ -`how to read in multiple files, appending to create a single dataframe -`__ - `Reading a csv chunk-by-chunk `__ @@ -943,6 +940,47 @@ using that handle to read. `Write a multi-row index CSV without writing duplicates `__ +.. _cookbook.csv.multiple_files: + +Reading multiple files to create a single DataFrame +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The best way to combine multiple files into a single DataFrame is to read the individual frames one by one, put all +of the individual frames into a list, and then combine the frames in the list using ``pd.concat``: + +.. ipython:: python + + for i in range(3): + data = pd.DataFrame(np.random.randn(10, 4)) + data.to_csv('file_{}.csv'.format(i)) + + frames = [] + files = ['file_0.csv', 'file_1.csv', 'file_2.csv'] + for f in files: + frames.append(pd.read_csv(f)) + result = pd.concat(frames) + +You can use the same approach to read all files matching a pattern. Here is an example using ``glob``: + +.. ipython:: python + + import glob + frames = [] + for f in glob.glob('file_*.csv'): + frames.append(pd.read_csv(f)) + result = pd.concat(frames) + +This performs significantly better than using ``pd.append`` to add each of the files to an existing DataFrame. +Finally, this strategy will work with the other ``read_`` functions described in the :ref:`io docs`. + +.. ipython:: python + :supress: + for i in range(3): + os.remove('file_{}.csv'.format(i)) + +Parsing date components in multi-columns +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + Parsing date components in multi-columns is faster with a format .. code-block:: python diff --git a/doc/source/io.rst b/doc/source/io.rst index 2b3d2895333d3..1919aefc94da9 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -1439,6 +1439,14 @@ class of the csv module. For this, you have to specify ``sep=None``. print(open('tmp2.sv').read()) pd.read_csv('tmp2.sv', sep=None, engine='python') +.. _io.multiple_files: + +Reading multiple files to create a single DataFrame +''''''''''''''''''''''''''''''''''''''''''''''''''' + +It's best to use ``pd.concat`` to combine multiple files, rather than ``pd.append``. +See the :ref:`cookbook` for an example. + .. _io.chunking: Iterating through files chunk by chunk From ad88cd9b628edd0dcf9a4d469fec8a0c6bfaab7a Mon Sep 17 00:00:00 2001 From: Dan Kenefick Date: Fri, 28 Apr 2017 07:40:19 -0400 Subject: [PATCH 2/4] ENH: pandas read_* wildcard #15904 --- doc/source/cookbook.rst | 14 +++++--------- doc/source/io.rst | 2 +- 2 files changed, 6 insertions(+), 10 deletions(-) diff --git a/doc/source/cookbook.rst b/doc/source/cookbook.rst index f2478f7a4d5b7..c324b4e19672c 100644 --- a/doc/source/cookbook.rst +++ b/doc/source/cookbook.rst @@ -946,7 +946,7 @@ Reading multiple files to create a single DataFrame ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The best way to combine multiple files into a single DataFrame is to read the individual frames one by one, put all -of the individual frames into a list, and then combine the frames in the list using ``pd.concat``: +of the individual frames into a list, and then combine the frames in the list using :func:`pd.concat`: .. ipython:: python @@ -956,9 +956,7 @@ of the individual frames into a list, and then combine the frames in the list us frames = [] files = ['file_0.csv', 'file_1.csv', 'file_2.csv'] - for f in files: - frames.append(pd.read_csv(f)) - result = pd.concat(frames) + result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True) You can use the same approach to read all files matching a pattern. Here is an example using ``glob``: @@ -966,12 +964,10 @@ You can use the same approach to read all files matching a pattern. Here is an import glob frames = [] - for f in glob.glob('file_*.csv'): - frames.append(pd.read_csv(f)) - result = pd.concat(frames) + files = glob.glob('file_*.csv') + result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True) -This performs significantly better than using ``pd.append`` to add each of the files to an existing DataFrame. -Finally, this strategy will work with the other ``read_`` functions described in the :ref:`io docs`. +Finally, this strategy will work with the other ``read_*(...)`` functions described in the :ref:`io docs`. .. ipython:: python :supress: diff --git a/doc/source/io.rst b/doc/source/io.rst index 1919aefc94da9..9692766505d7a 100644 --- a/doc/source/io.rst +++ b/doc/source/io.rst @@ -1444,7 +1444,7 @@ class of the csv module. For this, you have to specify ``sep=None``. Reading multiple files to create a single DataFrame ''''''''''''''''''''''''''''''''''''''''''''''''''' -It's best to use ``pd.concat`` to combine multiple files, rather than ``pd.append``. +It's best to use :func:`~pandas.concat` to combine multiple files. See the :ref:`cookbook` for an example. .. _io.chunking: From 645b86c95b22c347c8e76274f0ee21752b1622bf Mon Sep 17 00:00:00 2001 From: Dan Kenefick Date: Fri, 28 Apr 2017 07:46:56 -0400 Subject: [PATCH 3/4] ENH: pandas read_* wildcard #15904 --- doc/source/cookbook.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/cookbook.rst b/doc/source/cookbook.rst index c324b4e19672c..92d62c7289be6 100644 --- a/doc/source/cookbook.rst +++ b/doc/source/cookbook.rst @@ -967,7 +967,7 @@ You can use the same approach to read all files matching a pattern. Here is an files = glob.glob('file_*.csv') result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True) -Finally, this strategy will work with the other ``read_*(...)`` functions described in the :ref:`io docs`. +Finally, this strategy will work with the other ``pd.read_*(...)`` functions described in the :ref:`io docs`. .. ipython:: python :supress: From b2ab07ad4f7f7a71ffccc4661ec9f1a1ffc880d1 Mon Sep 17 00:00:00 2001 From: Dan Kenefick Date: Fri, 28 Apr 2017 21:29:19 -0400 Subject: [PATCH 4/4] ENH: pandas read_* wildcard #15904 --- doc/source/cookbook.rst | 2 -- 1 file changed, 2 deletions(-) diff --git a/doc/source/cookbook.rst b/doc/source/cookbook.rst index 92d62c7289be6..8466b3d3c3297 100644 --- a/doc/source/cookbook.rst +++ b/doc/source/cookbook.rst @@ -954,7 +954,6 @@ of the individual frames into a list, and then combine the frames in the list us data = pd.DataFrame(np.random.randn(10, 4)) data.to_csv('file_{}.csv'.format(i)) - frames = [] files = ['file_0.csv', 'file_1.csv', 'file_2.csv'] result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True) @@ -963,7 +962,6 @@ You can use the same approach to read all files matching a pattern. Here is an .. ipython:: python import glob - frames = [] files = glob.glob('file_*.csv') result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)