-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
netCDF IO suport #5487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interested in contributing to make this happen? |
Certainly willing to contribute on this. My proposal was meant to see if this would be feature that folks were interested in. I have some project specific implementations that could be generalized but the effort may be best served by a discussion on how to approach this. Presumably, following along the lines of the HDF5 application would be ideal. |
Is it tabular/columnar data? If so, can you map netCDF data types directly to numpy dtypes? If so, seems like it's pretty clear what should happen. |
netCDF is quite similar to the way HDF5 / PyTables work. I think you could make this very similar, as a separate module. I think you could maybe use |
@jtratner - N-dimensional homogeneous arrays, the netCDF4 package takes care of the loading into numpy arrays. @jreback - I'll take a look at replicating the I'll take a first stab at it over the next week or so and report back. |
looking forward to seeing what you come up with. |
@jhamman gr8....be sure to add the dep (netCDF ?) to provide plenty of test cases! start with a smaller feature set, you can build/add over time. pls hook up to travis! |
@jhamman You may also be interested in the xray package that @jreback references. It's partly built on pandas and has support for n-dimensional, gridded, datatypes (the initial goal was to naturally represent netcdf3/4, which it does). You may find some of the code in there useful for your read/write functionality. Note that we use the netCDF4 library, which supports netcdf3 + 4, not any HDF5 library. |
One tricky aspect here is dealing with large variables. It's possible (indeed, somewhat common) to have netCDF files with variables too big to fit into memory. Libraries like netCDF4-python (or xray, for that matter) let you use slicing syntax to only load part of an variable. I may be mistaken, but I don't think pandas has any support (yet) for objects that don't fit entirely into memory. |
what is a 'large variable' ? you can easily load part of an object via query or slicing, see docs: http://pandas-docs.github.io/pandas-docs-travis/io.html#io-hdf5 |
@jreback You're right, I stand corrected. |
Have there been any other discussions on this recently? I've been using |
To add to what @jhamman writes, NetCDF is a highly structured data format that isn't a great fit for existing pandas data structures, except in some special cases. This would be a little different if we had working nd panels, but you'd still end up reading a netcdf file into a dict of nd panels -- pretty ugly and not very useable. Pandas itself is never going to have the right data structures for netcdf. This is, of course, the motivation for why I wrote xray. I'm not saying xray is the end all solution here, but it does add no additional dependencies (beyond itself and a library to read netcdf files) and makes it quite straightforward to read netcdf into pandas, with explicit choices about how to handle edge cases along the way. That's all you could hope for from an implementation in pandas itself. |
I was just thinking, maybe we could add somewhere in the IO docs a reference to this? (saying that if you want to import NetCDF -> use xray, and you can always convert that (or part of the data) to pandas dataframes) |
Thanks @shoyer for elaborating on my point. @jorisvandenbossche - seems reasonable to point to xray as a way to get netCDF data into pandas. |
As discussed in pandas-dev#5487. This would also be a nice place to mention other packages that connect the pandas DataFrames with other file formats. Right now, the only one I can think of off-hand is `root_pandas` (pandas-dev#9378, CC @ibab), but I'm sure there are more.
@jorisvandenbossche good idea, just made a PR for that (#10027) |
I'd like to propose
read_netcdf
(netCDF (Network Common Data Format)) a new Pandas I/O api feature with a similar top level interface as the other reader functions. This format is widely used in the scientific community. Furthermore, netCDF4 is implemented on top of the HDF5 library, making this a natural extension to functionality already in the api.Most likely this would sit on top of the exiting Python/numpy interface to netCDF, and because each variables metadata is stored in the file header, no complicated parsing would necessary. Multidimensional variables could be handled in a similar manner as hdf.
This may have been brought up in the past but my search here and on the google didn't bring anything up.
The text was updated successfully, but these errors were encountered: