Skip to content

Dispatching fromfile and tofile #490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jakirkham opened this issue Oct 4, 2022 · 10 comments
Closed

Dispatching fromfile and tofile #490

jakirkham opened this issue Oct 4, 2022 · 10 comments
Labels
API extension Adds new functions or objects to the API.

Comments

@jakirkham
Copy link
Member

A common, simple case that NumPy provides is the ability to read & write a pure binary file. Even some simple file formats are little more than a header (and/or a footer) with binary array data included. Having functionality to load & save arrays generally as binary files would be quite helpful for getting users started in these cases.

@rgommers
Copy link
Member

rgommers commented Oct 4, 2022

@jakirkham thanks for the suggestion. This seems like a scope expansion that's better left alone. From https://data-apis.org/array-api/latest/purpose_and_scope.html: "The following topics are out of scope: I/O, ...". There's a huge amount of data formats, no common API for them across libraries, plus the semantics of data loaders are very hard to specify.

There's another reason why it is fine to have I/O routines be out of scope: you should never need to call I/O routines from within libraries. It's end users that load data (with a specific array library), and then pass that array to library functions.

fromfile also isn't an important function, I'd say if we wanted to care about I/O, we should worry about CSV, JSON, Zarr, HDF5, and other heavily used formats first.

@rgommers rgommers added the API extension Adds new functions or objects to the API. label Oct 4, 2022
@jakirkham
Copy link
Member Author

Yeah for clarity wasn't suggesting we handle formats or encoding/decoding here. Just suggesting we have a way to load very simple files into memory and represent it as an array of the desired type or write it back to disk.

Zarr, in particular (given it was mentioned), uses very simple binary files in its representation of array data. So the request as-is would be useful here.

With CSV, both fromfile and tofile support a sep (separator), which can be used to read and write CSV files (or other text files).

JSON and HDF5 are probably complicated enough that implementing them in the spec wouldn't make sense. That said, the simple ability to read binary data into an in-memory array potentially could be a useful first step for decoding and encoding these would do.

@seberg
Copy link
Contributor

seberg commented Oct 5, 2022

Might not be high priority at this point? If a library doesn't ditch NumPy completely, they can read both binary and text files using it and then convert using asarray(); which is guaranteed to work as it supports the buffer protocol.

What we are missing to some degree may be the frombuffer/view capability of reinterpreting binary buffers directly. That would e.g. allow loading in a mmap.mmap with no-copy semantics (if supported by the array object).
Reinterpreting binary buffers seems fairly "core" to me (of course you could create a trivial library around memorview to do it via the buffer protocol).

HDF5 seems more like a protocol issue? Let the HDF5 library export the buffer protocol (and/or DLPack) and you already have support (this may well already be the case).

Reading (simple?) CSV may be core. But it is not trivial to implement unless you are OK with relying on the builtin csv or NumPy.

@jakirkham
Copy link
Member Author

jakirkham commented Oct 5, 2022

It is worth noting numpy.fromfile already supports a like=... argument for array dispatching and tofile is simply a method on an array. So it might not even be much spec work to formalize this (if we are ok with the current state of things)

@seberg
Copy link
Contributor

seberg commented Oct 5, 2022

If we are to design new API, I would tend to remove the double meaning from fromfile: It does both binary and bytes (not even encoded bytes?) which seems awkward.
But from your current perspective, I suspect copying NumPy more or less is worthwhile anyway.

There are competing interest, I think. For this use-case, it would be nice to formalize clean API somewhere that may not match NumPy 100% (even if end-user orientated maybe).
But for data-api, I guess the focus is more on minimal and library-oriented functionality, where this doesn't have high priority (and maybe would have to be an optional extension).

@jakirkham
Copy link
Member Author

Maybe we could narrow the focus to binary files and use something like frombinfile and tobinfile.

If we wanted text, that could be another API like fromtxtfile and totxtfile.

@jakirkham
Copy link
Member Author

What we are missing to some degree may be the frombuffer/view capability of reinterpreting binary buffers directly.

Agreed. Forgot to mention above view is tracked in issue ( #266 ). It's currently scheduled for 2022, but I don't think we've discussed it (unless I missed it or am forgetting, which could very well be the case :).

@shoyer
Copy link
Contributor

shoyer commented Oct 5, 2022

What we are missing to some degree may be the frombuffer/view capability of reinterpreting binary buffers directly. That would e.g. allow loading in a mmap.mmap with no-copy semantics (if supported by the array object). Reinterpreting binary buffers seems fairly "core" to me (of course you could create a trivial library around memorview to do it via the buffer protocol).

+1 I would certianly consider adding something like tobytes/frombytes before tofile/fromfile.

@jakirkham
Copy link
Member Author

Based on our discussion, for simple binary loading users may better off implementing a file object (maybe io.RawIOBase). Then one can rely on things like read and then use view to cast (or perhaps use readinto to read directly into the intended buffer). Also write/writelines can be used to write array(s) out.

@kgryte
Copy link
Contributor

kgryte commented Jun 29, 2023

Based on the above discussion, we're not likely to move forward with fromfile and tofile at this time, and IO is generally out-of-scope. I'll go ahead and close this issue.

@kgryte kgryte closed this as completed Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API.
Projects
None yet
Development

No branches or pull requests

5 participants