Dispatching `fromfile` and `tofile` #490

jakirkham · 2022-10-04T05:34:59Z

A common, simple case that NumPy provides is the ability to read & write a pure binary file. Even some simple file formats are little more than a header (and/or a footer) with binary array data included. Having functionality to load & save arrays generally as binary files would be quite helpful for getting users started in these cases.

rgommers · 2022-10-04T08:32:22Z

@jakirkham thanks for the suggestion. This seems like a scope expansion that's better left alone. From https://data-apis.org/array-api/latest/purpose_and_scope.html: "The following topics are out of scope: I/O, ...". There's a huge amount of data formats, no common API for them across libraries, plus the semantics of data loaders are very hard to specify.

There's another reason why it is fine to have I/O routines be out of scope: you should never need to call I/O routines from within libraries. It's end users that load data (with a specific array library), and then pass that array to library functions.

fromfile also isn't an important function, I'd say if we wanted to care about I/O, we should worry about CSV, JSON, Zarr, HDF5, and other heavily used formats first.

jakirkham · 2022-10-04T09:00:41Z

Yeah for clarity wasn't suggesting we handle formats or encoding/decoding here. Just suggesting we have a way to load very simple files into memory and represent it as an array of the desired type or write it back to disk.

Zarr, in particular (given it was mentioned), uses very simple binary files in its representation of array data. So the request as-is would be useful here.

With CSV, both fromfile and tofile support a sep (separator), which can be used to read and write CSV files (or other text files).

JSON and HDF5 are probably complicated enough that implementing them in the spec wouldn't make sense. That said, the simple ability to read binary data into an in-memory array potentially could be a useful first step for decoding and encoding these would do.

seberg · 2022-10-05T06:35:00Z

Might not be high priority at this point? If a library doesn't ditch NumPy completely, they can read both binary and text files using it and then convert using asarray(); which is guaranteed to work as it supports the buffer protocol.

What we are missing to some degree may be the frombuffer/view capability of reinterpreting binary buffers directly. That would e.g. allow loading in a mmap.mmap with no-copy semantics (if supported by the array object).
Reinterpreting binary buffers seems fairly "core" to me (of course you could create a trivial library around memorview to do it via the buffer protocol).

HDF5 seems more like a protocol issue? Let the HDF5 library export the buffer protocol (and/or DLPack) and you already have support (this may well already be the case).

Reading (simple?) CSV may be core. But it is not trivial to implement unless you are OK with relying on the builtin csv or NumPy.

jakirkham · 2022-10-05T07:33:48Z

It is worth noting numpy.fromfile already supports a like=... argument for array dispatching and tofile is simply a method on an array. So it might not even be much spec work to formalize this (if we are ok with the current state of things)

seberg · 2022-10-05T07:44:45Z

If we are to design new API, I would tend to remove the double meaning from fromfile: It does both binary and bytes (not even encoded bytes?) which seems awkward.
But from your current perspective, I suspect copying NumPy more or less is worthwhile anyway.

There are competing interest, I think. For this use-case, it would be nice to formalize clean API somewhere that may not match NumPy 100% (even if end-user orientated maybe).
But for data-api, I guess the focus is more on minimal and library-oriented functionality, where this doesn't have high priority (and maybe would have to be an optional extension).

jakirkham · 2022-10-05T10:28:21Z

Maybe we could narrow the focus to binary files and use something like frombinfile and tobinfile.

If we wanted text, that could be another API like fromtxtfile and totxtfile.

jakirkham · 2022-10-05T22:10:34Z

What we are missing to some degree may be the frombuffer/view capability of reinterpreting binary buffers directly.

Agreed. Forgot to mention above view is tracked in issue ( #266 ). It's currently scheduled for 2022, but I don't think we've discussed it (unless I missed it or am forgetting, which could very well be the case :).

shoyer · 2022-10-05T22:23:39Z

What we are missing to some degree may be the frombuffer/view capability of reinterpreting binary buffers directly. That would e.g. allow loading in a mmap.mmap with no-copy semantics (if supported by the array object). Reinterpreting binary buffers seems fairly "core" to me (of course you could create a trivial library around memorview to do it via the buffer protocol).

+1 I would certianly consider adding something like tobytes/frombytes before tofile/fromfile.

jakirkham · 2022-10-06T18:05:00Z

Based on our discussion, for simple binary loading users may better off implementing a file object (maybe io.RawIOBase). Then one can rely on things like read and then use view to cast (or perhaps use readinto to read directly into the intended buffer). Also write/writelines can be used to write array(s) out.

kgryte · 2023-06-29T08:52:29Z

Based on the above discussion, we're not likely to move forward with fromfile and tofile at this time, and IO is generally out-of-scope. I'll go ahead and close this issue.

jakirkham mentioned this issue Oct 4, 2022

Overload numpy.fromfile() and cupy.fromfile() rapidsai/kvikio#135

Merged

rgommers added the API extension Adds new functions or objects to the API. label Oct 4, 2022

kgryte closed this as completed Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatching `fromfile` and `tofile` #490

Dispatching `fromfile` and `tofile` #490

jakirkham commented Oct 4, 2022

rgommers commented Oct 4, 2022

jakirkham commented Oct 4, 2022

seberg commented Oct 5, 2022

jakirkham commented Oct 5, 2022 •

edited

Loading

seberg commented Oct 5, 2022

jakirkham commented Oct 5, 2022

jakirkham commented Oct 5, 2022

shoyer commented Oct 5, 2022

jakirkham commented Oct 6, 2022

kgryte commented Jun 29, 2023

Dispatching fromfile and tofile #490

Dispatching fromfile and tofile #490

Comments

jakirkham commented Oct 4, 2022

rgommers commented Oct 4, 2022

jakirkham commented Oct 4, 2022

seberg commented Oct 5, 2022

jakirkham commented Oct 5, 2022 • edited Loading

seberg commented Oct 5, 2022

jakirkham commented Oct 5, 2022

jakirkham commented Oct 5, 2022

shoyer commented Oct 5, 2022

jakirkham commented Oct 6, 2022

kgryte commented Jun 29, 2023

Dispatching `fromfile` and `tofile` #490

Dispatching `fromfile` and `tofile` #490

jakirkham commented Oct 5, 2022 •

edited

Loading