Skip to content

Feature request: read_csv/read_table/read_fwf - read multiple files with the same structure, applying the same parameters (skiprows, skipfooter, nrows) #12618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maxu777 opened this issue Mar 14, 2016 · 5 comments
Labels
API Design Docs IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label

Comments

@maxu777
Copy link
Contributor

maxu777 commented Mar 14, 2016

Hello,

I think it would make sense to make read_csv(), read_table(), read_fwf() able to read multiple files with the same structure. It might be tricky to read multiple files into one string, especially when all of them have header line(s) and when you want to use the following parameters: skiprows, skipfooter, nrows.
The logic for (skiprows, skipfooter, nrows) is already implemented, so IMO it shouldn't be very difficult. The header parameter (if header exists) must be read/parsed only from the first (from one) file.

Of course one can always do something like:

df = pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)

but it's not very efficient when working with big files.

In the last days there were plenty of similar questions stackoverflow.com, asking how to merge CSV files with the same structure.

Thank you!

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

None of the read_* functions work on a glob, but is pretty trivial to do so. You could add this to the cookbook docs; will accept a PR for that. I suppose in theory we could wrap around this. But would be a bit of an effort.

In [6]: pd.options.display.max_rows=10

In [7]: DataFrame(np.random.randn(100000,2)).to_csv('file1.csv')

In [8]: DataFrame(np.random.randn(100000,2)).to_csv('file2.csv')

In [9]: pd.concat([ pd.read_csv(f) for f in glob.glob('file*.csv') ], ignore_index=True)
Out[9]: 
        Unnamed: 0         0         1
0                0  0.431748 -0.558788
1                1  0.849501 -0.174770
2                2 -1.493231  0.160548
3                3  2.021627 -0.365592
4                4  1.640202 -0.595996
...            ...       ...       ...
199995       99995  0.612665  0.599660
199996       99996 -0.644949 -1.029477
199997       99997  0.615948 -0.666595
199998       99998  0.450607 -0.275538
199999       99999  1.181064 -0.051599

[200000 rows x 3 columns]

In [10]: pd.concat([ pd.read_csv(f, index_col=0) for f in glob.glob('file*.csv') ], ignore_index=True)
Out[10]: 
               0         1
0       0.431748 -0.558788
1       0.849501 -0.174770
2      -1.493231  0.160548
3       2.021627 -0.365592
4       1.640202 -0.595996
...          ...       ...
199995  0.612665  0.599660
199996 -0.644949 -1.029477
199997  0.615948 -0.666595
199998  0.450607 -0.275538
199999  1.181064 -0.051599

[200000 rows x 2 columns]

@jreback jreback added IO Data IO issues that don't fit into a more specific label API Design IO CSV read_csv, to_csv labels Mar 14, 2016
@jreback jreback added this to the Someday milestone Mar 14, 2016
@TomAugspurger
Copy link
Contributor

This was a dupe of #15904 and closed by #16166

@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Someday May 22, 2017
@anmyachev
Copy link
Contributor

@jreback, @TomAugspurger Hello, I could add support for reading from multiple files, be it file list or wildcard support. Although this is syntactic sugar, but maintaining it is not worth a lot of effort, and will save users from boilerplate code. What do you think?

@jreback
Copy link
Contributor

jreback commented Jan 27, 2021

it would be a good feature however

error handling would have to be very obvious

would likely want to extend to other read_* functions so ideally it's a somewhat generic soln

pls open a new issue for discussions

@anmyachev
Copy link
Contributor

#39435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Docs IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

5 participants