Skip to content

netCDF IO suport #5487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jhamman opened this issue Nov 10, 2013 · 17 comments
Closed

netCDF IO suport #5487

jhamman opened this issue Nov 10, 2013 · 17 comments
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label IO HDF5 read_hdf, HDFStore

Comments

@jhamman
Copy link

jhamman commented Nov 10, 2013

I'd like to propose read_netcdf (netCDF (Network Common Data Format)) a new Pandas I/O api feature with a similar top level interface as the other reader functions. This format is widely used in the scientific community. Furthermore, netCDF4 is implemented on top of the HDF5 library, making this a natural extension to functionality already in the api.

Most likely this would sit on top of the exiting Python/numpy interface to netCDF, and because each variables metadata is stored in the file header, no complicated parsing would necessary. Multidimensional variables could be handled in a similar manner as hdf.

This may have been brought up in the past but my search here and on the google didn't bring anything up.

@jtratner
Copy link
Contributor

Interested in contributing to make this happen?

@jhamman
Copy link
Author

jhamman commented Nov 13, 2013

Certainly willing to contribute on this. My proposal was meant to see if this would be feature that folks were interested in. I have some project specific implementations that could be generalized but the effort may be best served by a discussion on how to approach this. Presumably, following along the lines of the HDF5 application would be ideal.

@jtratner
Copy link
Contributor

Is it tabular/columnar data?

If so, can you map netCDF data types directly to numpy dtypes? If so, seems like it's pretty clear what should happen.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2013

@jhamman

netCDF is quite similar to the way HDF5 / PyTables work. I think you could make this very similar, as a separate module. I think you could maybe use HDFStore as a base class.

@jhamman
Copy link
Author

jhamman commented Nov 14, 2013

@jtratner - N-dimensional homogeneous arrays, the netCDF4 package takes care of the loading into numpy arrays.

@jreback - I'll take a look at replicating the HDFStore features in terms of the netCDF4 package. I actually don't think this should be all that difficult.

I'll take a first stab at it over the next week or so and report back.

@jtratner
Copy link
Contributor

looking forward to seeing what you come up with.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2013

@jhamman gr8....be sure to add the dep (netCDF ?) to ci/requirements (you don't have to add to all, but make sure at least 2.7 and 3.3. (you can add where pytables is tested), you can add for different versions (if that matters).

provide plenty of test cases! start with a smaller feature set, you can build/add over time.

pls hook up to travis!

@ebrevdo
Copy link

ebrevdo commented Feb 25, 2014

@jhamman You may also be interested in the xray package that @jreback references. It's partly built on pandas and has support for n-dimensional, gridded, datatypes (the initial goal was to naturally represent netcdf3/4, which it does). You may find some of the code in there useful for your read/write functionality. Note that we use the netCDF4 library, which supports netcdf3 + 4, not any HDF5 library.

@jhamman
Copy link
Author

jhamman commented Feb 25, 2014

Thanks @ebrevdo . This looks very promising.

In fact, it may make sense, as @jreback suggests, to use some of xray's back-ends in pandas.

I'll poke around xray and see how it all works.

@shoyer
Copy link
Member

shoyer commented Feb 26, 2014

One tricky aspect here is dealing with large variables. It's possible (indeed, somewhat common) to have netCDF files with variables too big to fit into memory. Libraries like netCDF4-python (or xray, for that matter) let you use slicing syntax to only load part of an variable. I may be mistaken, but I don't think pandas has any support (yet) for objects that don't fit entirely into memory.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

what is a 'large variable' ? you can easily load part of an object via query or slicing, see docs: http://pandas-docs.github.io/pandas-docs-travis/io.html#io-hdf5

@shoyer
Copy link
Member

shoyer commented Feb 26, 2014

@jreback You're right, I stand corrected.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jhamman
Copy link
Author

jhamman commented Apr 29, 2015

@shoyer and @jreback,

Have there been any other discussions on this recently? I've been using xray to handle most of my netCDF I/O and the xray.Dataset.to_pandas() method has been meeting all my needs in terms of reading netCDF data into Pandas data types. I'm not sure I see much benefit in building a separate netCDF reader for Pandas given the functionality that xray provides.

@shoyer
Copy link
Member

shoyer commented Apr 29, 2015

To add to what @jhamman writes, NetCDF is a highly structured data format that isn't a great fit for existing pandas data structures, except in some special cases. This would be a little different if we had working nd panels, but you'd still end up reading a netcdf file into a dict of nd panels -- pretty ugly and not very useable. Pandas itself is never going to have the right data structures for netcdf.

This is, of course, the motivation for why I wrote xray. I'm not saying xray is the end all solution here, but it does add no additional dependencies (beyond itself and a library to read netcdf files) and makes it quite straightforward to read netcdf into pandas, with explicit choices about how to handle edge cases along the way. That's all you could hope for from an implementation in pandas itself.

@shoyer shoyer closed this as completed Apr 29, 2015
@jorisvandenbossche
Copy link
Member

I was just thinking, maybe we could add somewhere in the IO docs a reference to this? (saying that if you want to import NetCDF -> use xray, and you can always convert that (or part of the data) to pandas dataframes)

@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Apr 29, 2015
@jhamman
Copy link
Author

jhamman commented Apr 29, 2015

Thanks @shoyer for elaborating on my point.

@jorisvandenbossche - seems reasonable to point to xray as a way to get netCDF data into pandas.

shoyer added a commit to shoyer/pandas that referenced this issue Apr 30, 2015
As discussed in pandas-dev#5487.

This would also be a nice place to mention other packages that connect the
pandas DataFrames with other file formats. Right now, the only one I can think
of off-hand is `root_pandas` (pandas-dev#9378, CC @ibab), but I'm sure there are more.
@shoyer
Copy link
Member

shoyer commented Apr 30, 2015

@jorisvandenbossche good idea, just made a PR for that (#10027)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

6 participants