Skip to content

New data format: ROOT files #9378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ibab opened this issue Jan 30, 2015 · 17 comments
Closed

New data format: ROOT files #9378

ibab opened this issue Jan 30, 2015 · 17 comments
Labels
IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action

Comments

@ibab
Copy link

ibab commented Jan 30, 2015

Hi,
I've recently written a tiny python package for loading/saving ROOT files as pandas DataFrames: root_pandas.
ROOT is the main data format used by particle physicists.

Do you think it might be worth adding read_root and to_root functions to pandas itself?
This could convince a lot of physicists to give pandas a try.
The dependencies it would add could be handled like it's currently done with hdf5.

ROOT performs well in comparison with hdf5 and could be a useful addition to pandas, even ignoring the fact that it is very popular in physics.

If there's interest, I would polish my code and create a pull request.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2015

this is a format that the hdf5 files are setup in?

kind of like a convention for naming and structure?

@jreback jreback added the IO Data IO issues that don't fit into a more specific label label Jan 30, 2015
@ibab
Copy link
Author

ibab commented Jan 30, 2015

It's a completely different format that was designed from the ground up for physics experiments. I think it might actually predate HDF5.

Here's an info page on the file format: https://root.cern.ch/drupal/content/root-files-1

@jreback
Copy link
Contributor

jreback commented Jan 30, 2015

So I think a nice way to do this would be to create pandas-io (like we are doing in #8961)

where we can put 'non-core' IO packages handling. So that an import would inject them into the pandas namespace but it would be optional (and updated externally to pandas), maybe something like

import pandas-io import root_format
df.to_root()
pd.from_root()

would work

@ibab
Copy link
Author

ibab commented Jan 30, 2015

That sounds like the right place to put a package like this.
It might be even more convenient if pandas would detect the presence of pandas-io and import all new formats automatically.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2015

@ibab if you'd be interested in this, then I can setup the repo and you can port things.

@jorisvandenbossche ?

@ibab
Copy link
Author

ibab commented Jan 31, 2015

@jreback Sure, thanks! I'll be happy to port it over.

@TomAugspurger
Copy link
Contributor

We should also make sure the new methods are into aware. I don't believe it will be too difficult since anything we read will go through a DataFrame, so there should be just one interface (or two if you split out chunked).

@shoyer
Copy link
Member

shoyer commented Feb 17, 2015

It might be even more convenient if pandas would detect the presence of pandas-io and import all new formats automatically.

This is an intriguing idea, and certainly would be a win for interactive use. And I really like supporting method chaining. But there are a few tradeoffs to consider here:

  1. I don't see anyway to add methods like to_root dynamically (e.g., only if ROOT is installed) without triggering potentially expensive imports (CLN: make sure that we don't have extraneous imports #9482) at import time for pandas.
  2. It's slightly un-Pythonic to have implicit imports from different modules. On the other hand, adding methods to pandas.DataFrame when root pandas is imported is also not very explicit.
  3. The bigger issue is that injected methods are hard to discover or track down if they're not in pandas itself. At the very least, users expect that DataFrame methods should be documented in our API docs.

Considering these factors, it seems like creating stubs like to_root in pandas itself would be ideal. We could outsource the actual implementation to pandas-io or whatever, and refer to those external sources in the pandas documentation. pandas-io could even replace entirely generic stubs like DataFrame.to_root(*args, **kwargs) with an appropriate signature at import time.

@ibab
Copy link
Author

ibab commented Feb 23, 2015

Adding stubs directly to pandas looks like a good idea.
The only problem I see is that pandas-io would then have to wait for new pandas releases to add new data formats.

shoyer added a commit to shoyer/pandas that referenced this issue Apr 30, 2015
As discussed in pandas-dev#5487.

This would also be a nice place to mention other packages that connect the
pandas DataFrames with other file formats. Right now, the only one I can think
of off-hand is `root_pandas` (pandas-dev#9378, CC @ibab), but I'm sure there are more.
@KonstantinSchubert
Copy link

Hi,
I would be interested to know if there are any updates to this?

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

as discussed above, this could be added to a pandas-io type of package if @ibab were interested in doing this

@tunnell
Copy link

tunnell commented Jul 28, 2015

+1. Is @ibab interested in help?

@ibab
Copy link
Author

ibab commented Jul 28, 2015

Sure, I'd be glad to help. Does a pandas-io package already exist somewhere? Or are there existing plans for how it should work?

Something like

from pandas-io import root_format
df.to_root()
pd.from_root()

should be doable.

Instead of requiring a hard dependency on root_numpy (which is the cython-based package I'm using to read the data), I could throw a "You need root_numpy" type of error.
It might be problematic that

from pandas-io import something

would always trigger lots of imports, though.
So maybe

from pandas-io.root import read_root

would be better.

What do you think?

@shoyer
Copy link
Member

shoyer commented Jul 28, 2015

A pandas-io package does not currently exist.

The standard way to handle hard dependency issues is to import packages inside the functions that call them, e.g., only import root_numpy inside read_root.

I don't think it's a good idea to monkey patch dataframes. I would rather suggest that you use functions, possibly with DataFrame.pipe, e.g., df.pipe(to_root).

@ibab
Copy link
Author

ibab commented Jul 29, 2015

I've set up a repository at https://github.com/ibab/pandas-io.

I've decided to go with pandas.io.external as the package name and copied over my root_pandas code.
A typical usage example would be

from pandas.io.external import read_root
df = read_root('in.root')

root_numpy is imported in the read_root and to_root functions, which raise

ImportError: You need the root_numpy package for ROOT integration

if the package can't be found.

I've removed the monkey patch on DataFrame, so users will have to call to_root(df, 'out.root') or use df.pipe.

I'm very open to changing any aspect of the package, if someone has an idea how it should be changed or improved.

@jorisvandenbossche
Copy link
Member

I am not sure I see the benefit of having such a pandas-io package.

The difference for users is not that big:

from pandas_root import read_root   # instead of: from pandas.io.external import read_root
df = read_root('in.root')

and

from pandas_root import to_root   # instead of: from pandas.io.external import to_root
to_root(df, 'out.root')
df.pipe(to_root, 'out.root')

In any case, @ibab, thanks for exploring this! I only think we should first more thoroughly discuss it (and exploring it can help the discussion).

Very quickly some ideas:

Pro:

  • discoverability (only one package to install, and there you can see options with tab completion)
  • more uniformity between external io providers
  • ...

Con:

  • harder to maintain, less ownership as with separate packages (of course, we can have maintainer per subpackage)
  • different subpackages of varying quality (what if the maintainer of one of the io providers is not active anymore?)
  • installing multiple packages is not that hard anymore these days
  • ...

If discoverability is the main reason, I think there are also other ways to handle this (better promoting ecosystem packages in the docs, website, add it to the io docs, ..)

@jreback jreback added the Needs Discussion Requires discussion from core team before further action label Jul 29, 2015
@jbrockmendel
Copy link
Member

Closing and adding to a tracker issue #30407 for IO format requests, can re-open if interest is expressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

8 participants