Skip to content

ENH: pandas read_* wildcard #15904 #16166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 30, 2017

Conversation

dwkenefick
Copy link
Contributor

  • closes ENH: pandas read_* wildcard #15904
  • tests added / passed (N/A for docs)
  • passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
  • whatsnew entry (N/A for docs)

@codecov
Copy link

codecov bot commented Apr 28, 2017

Codecov Report

Merging #16166 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #16166   +/-   ##
=======================================
  Coverage   90.87%   90.87%           
=======================================
  Files         162      162           
  Lines       50816    50816           
=======================================
  Hits        46178    46178           
  Misses       4638     4638
Flag Coverage Δ
#multiple 88.65% <ø> (ø) ⬆️
#single 40.33% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 075eca1...b2a4f72. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 28, 2017

Codecov Report

Merging #16166 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16166      +/-   ##
==========================================
- Coverage   90.87%   90.86%   -0.01%     
==========================================
  Files         162      162              
  Lines       50816    50819       +3     
==========================================
- Hits        46178    46176       -2     
- Misses       4638     4643       +5
Flag Coverage Δ
#multiple 88.64% <ø> (-0.01%) ⬇️
#single 40.33% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_converter.py 63.54% <0%> (-1.82%) ⬇️
pandas/core/frame.py 97.58% <0%> (-0.01%) ⬇️
pandas/core/dtypes/api.py 100% <0%> (ø) ⬆️
pandas/api/types/__init__.py 100% <0%> (ø) ⬆️
pandas/core/dtypes/common.py 93.5% <0%> (+0.25%) ⬆️
pandas/core/generic.py 91.63% <0%> (+0.32%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 075eca1...b2ab07a. Read the comment docs.


import glob
frames = []
for f in glob.glob('file_*.csv'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idiomatically

result = pd.concat([pd.read_csv(f) for f in glob.glob('file_*.csv')], ignore_index=True)

usually you want the ignore_index=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

result = pd.concat(frames)

This performs significantly better than using ``pd.append`` to add each of the files to an existing DataFrame.
Finally, this strategy will work with the other ``read_`` functions described in the :ref:`io docs<io>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> pd.read_*(..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, but with (...) to be consistent with a later usage.

frames.append(pd.read_csv(f))
result = pd.concat(frames)

This performs significantly better than using ``pd.append`` to add each of the files to an existing DataFrame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.append is not a thing. Remove the first sentence in any event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed here and in io.rst


frames = []
files = ['file_0.csv', 'file_1.csv', 'file_2.csv']
for f in files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below for the idiom to do this

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The best way to combine multiple files into a single DataFrame is to read the individual frames one by one, put all
of the individual frames into a list, and then combine the frames in the list using ``pd.concat``:
Copy link
Contributor

@jreback jreback Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :func:`pd.concat`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed here and in io.rst

@jreback jreback added the Docs label Apr 28, 2017
data = pd.DataFrame(np.random.randn(10, 4))
data.to_csv('file_{}.csv'.format(i))

frames = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delete this frames now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks.

.. ipython:: python

import glob
frames = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this frames

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@jreback
Copy link
Contributor

jreback commented Apr 28, 2017

I think this should have a pointer from the io.rst/read_csv docs as well (maybe a mini section), or it could entirely exist there (and have a pointer from the cookbook)

@dwkenefick
Copy link
Contributor Author

@jreback There should be a small section in io.rst that references this. See here. If you had something different in mind let me know - happy to make the change.

@TomAugspurger TomAugspurger merged commit de87344 into pandas-dev:master Apr 30, 2017
@TomAugspurger
Copy link
Contributor

@dwkenefick thanks! The doc build should be done in 20-30 minutes, if you want to check the output here

cbertinato pushed a commit to cbertinato/pandas that referenced this pull request May 1, 2017
* DOC: pandas read_* wildcard pandas-dev#15904

Added example in cookbook about reading multiple files into a dataframe.
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
* DOC: pandas read_* wildcard pandas-dev#15904

Added example in cookbook about reading multiple files into a dataframe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: pandas read_* wildcard
3 participants