Fix some review comments

lezcano · lezcano · commit 7a5b98c0793b · 2023-04-12T11:30:26.000Z
diff --git a/RFC.md b/RFC.md
@@ -1,29 +1,27 @@
-# Summary
+# A PyTorch - NumPy compatibility layer
 
+**Authors:**
+* @ev-br
+* @lezcano
+* @rgommers
+
+## Summary
 This RFC describes a proposal for a translation layer from NumPy into PyTorch.
 In simple terms, this accounts for implementing most of NumPy's API (`ndarray`,
-the `np`, `np.linalg`, `np.fft`  modules, etc) using `torch.Tensor` and PyTorch
-ops as backend.
+the `numpy`, `numpy.linalg`, `numpy.fft`  modules, etc) using `torch.Tensor`
+and PyTorch ops as backend.
 
-The this project has two main goals:
-1. Have a `torch.numpy` submodule, similar to `jax.numpy` that serves as a
-   drop-in replacement for NumPy when imported as `import torch.numpy as np`.
-2. Have TorchDynamo understand and use this layer to be able to trace through
-   NumPy programs as if they were written in PyTorch
 
-Two corollaries of this work should be:
-1. Given NumPy code, one should be able to differentiate through it using
-   PyTorch's autograd engine
-2. Given NumPy code, one should be able to execute it on CUDA
+The this project has a main goal as per the
+[initial design document](https://docs.google.com/document/d/1gdUDgZNbumFORRcUaZUVw790CtNYweAM20C1fbWMNd8):
+1. Make TorchDynamo understand NumPy calls
 
 The work is being done at [numpy_pytorch_interop](https://github.com/Quansight-Labs/numpy_pytorch_interop/).
 
-# The Translation Layer
 
-In this section we discuss the ideas behind design and implementation of the
-translation layer from PyTorch to NumPy
+## Motivation
 
-## The two expected uses
+### An introductory example
 
 Let's start with some examples.
 
@@ -37,7 +35,7 @@ z = np.dot(x, y)
 w = z.sum()
 ```
 
-By changing the first line to `import torch.numpy as np`, the semantics of the
+When we trace this program with the compat layer, the semantics of the
 program would stay the same, but the implementation would be equivalent to
 
 ```python
@@ -78,7 +76,7 @@ t_results[0] = result  # store the result in a torch.Tensor
 ```
 
 This code mixing NumPy and PyTorch already works, as `torch.Tensor` implements
-the `__array__` method. For it to work with the compatibility layer, we would
+the `__array__` method. For it to work manually with the compatibility layer, we would
 need to wrap and unwrap the inputs / outputs. This could be done modifying `fn`
 as
 
@@ -90,13 +88,87 @@ def fn(x, y):
     return ret.tensor.numpy()
 ```
 
-Note that this wrapping / unwrapping process can be easily automated via a decorator.
-Even more, if a user wants to use PyTorch as a backend in a code that mixes
-PyTorch and NumPy, it will mostly be the case that it is because they want to
-trace through that code. In that setting, TorchDynamo will be able to
-automatically do the wrapping/unwrapping.
+This process would be done automatically by TorchDynamo, so we would simply need to write
+```python
+@compile
+def fn(x, y):
+    return np.multiply(x, y).sum()
+```
+
+### The observable behavior
+
+The two main idea driving the design of this compatibility layer were the following:
+
+1. The behavior of the layer should be as close to that of NumPy as possible
+2. The layer follows NumPy master
+
+The following design decisions follow from these:
+
+**Default dtypes**. One of the issues that most often user when moving their
+codebase from NumPy to JAX was the default dtype changing from `float64` to
+`float32`. So much so, that this is one noted as one of
+[JAX's shap edges](https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html#double-64bit-precision).
+Following the spirit of making everything match NumPy by default, we choose the
+NumPy defaults whenever the `dtype` was not chosen in a factory function.
+
+**TODO(Lezcano)**: I just realized that we do not have a clean way to change
+the default dtype of `torch_np` to those from PyTorch. We should implement
+that utility flag, similar to
+[`torch.set_default_dtype`](https://pytorch.org/docs/stable/generated/torch.set_default_dtype.html).
+Perhaps call it `torch_np.use_torch_defaults()` and then add a way for users
+to be able to set their own int/float/complex defaults.
+**TODO(Lezcano)**: Do we just use them just in factory functions, or do we also
+use them anywhere else -> Check
+
+**NumPy scalars**. NumPy's type system is tricky. At first sight, it looks
+quite a bit like PyTorch's, but having a few more dtypes like `np.uint16` or
+`np.longdouble`. Upon closer inspection, one finds that it also has
+[NumPy scalar](https://numpy.org/doc/stable/reference/arrays.scalars.html) objects.
+NumPy scalars are similar to Python scalars but with a set width. NumPy scalars
+are NumPy's preferred return class for reductions and other operations that
+return just one element. NumPy scalars do not play particularly well with
+computations on devices like GPUs, as they live on CPU. Implementing NumPy
+scalars would mean that we need to synchronize after every `sum()` call, which
+is less-than-good. Instead, whenever a NumPy scalar would be returned, we will
+return a 0-D tensor, as PyTorch does.
+
+**Type promotion**. Another not-so-well-known fact of NumPy's cast system is
+that it is data-dependent. Python scalars can be used in pretty much any NumPy
+operation, being able to call any operation that accepts a 0-D array with a
+Python scalar. If you provide an operation with a Python scalar, these will be
+casted to the smallest dtype that can represent them, and then they will
+participate in type promotion, allowing for some rather interesting behaviour
+```python
+>>> np.asarray([1], dtype=np.int8) + 127
+array([128], dtype=int8)
+>>> np.asarray([1], dtype=np.int8) + 128
+array([129], dtype=int16)
+```
+This dependent type promotion will be deprecated NumPy 2.0, and will be
+replaced with [NEP 50](https://numpy.org/neps/nep-0050-scalar-promotion.html).
+As such, to be forward-looking and for simplicity, we chose to implement the
+type promotion behaviour proposed in NEP 50, which is much closer to that of
+Pytorch.
+
+Note that the decision of going with NEP 50 complements the previous one of
+returning 0-D arrays in place of NumPy scalars as, currently, 0-D arrays do not
+participate in type promotion in NumPy (but will do in NumPy 2.0):
+```python
+int64_0d_array = np.array(1, dtype=np.int64)
+np.result_type(np.int8, int64_0d_array) == np.int8
+```
+
+**Versioning**. It should be clear from the previous points that NumPy has a
+fair amount of questionable and legacy pain points. As such, we decided that
+rather than trying to fight these, we would declare that the compat layer
+follows the behavior of Numpy's master. Given the stability of NumPy's API and
+how battle-tested its main functions are, we do not expect this to become a big
+maintenance burden. If anything, it should make our lives easier, as some parts
+of NumPy will soon be simplified and we will not need to implement them, as
+described above.
 
-## The `torch.numpy` module
+
+## The `torch_np` module
 
 The bulk of the work went into implementing a system that allows us to
 implement NumPy operations in terms of those of PyTorch. The main design goals
@@ -107,9 +179,9 @@ were
 
 We say *most* of NumPy's API, because NumPy's API is not only massive, but also
 there are parts of it which cannot be implemented in PyTorch. For example,
-NumPy has support for arrays of strings, dates, and other `dtype`s that PyTorch
-does not consider. Negative strides are other example. We put together a list
-of things that are out of the scope of this project in the
+NumPy has support for arrays of string, datetime, structured and other dtypes.
+Negative strides are other example of a feature that is just out of the scope.
+We put together a list of things that are out of the scope of this project in the
 [following issue](https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/73).
 
 For the bulk of the functions, we started by prioritizing most common
@@ -124,6 +196,11 @@ The second point of preserving NumPy semantics as much as possible will be used
 in the sequel to discuss some points like the default dtypes that are used
 throughout the implementation.
 
+**Visibility of the module** For simplicity, this RFC assumes that the
+`torch_np` module will not be public, as the decision for it to be made public
+was met with different opinions. We discuss these in the "Unresolved Questions"
+section.
+
 ### Annotation-based preprocessing
 
 NumPy accepts virtually anything that smells like an array as input to its operators
@@ -138,7 +215,7 @@ array([1, 2, 3, 4, 5, 6])
 
 To implement NumPy in terms of PyTorch, for any operation we would need to put
 the inputs into tensors, perform the operations, and then wrap the tensor into
-a `torch.numpy.ndarray` (more on this class later).
+a `torch_np.ndarray` (more on this class later).
 
 To avoid all this code repetition, we implement the functions in two steps.
 
@@ -173,16 +250,16 @@ gathering all the inputs at runtime and normalizing them according to their
 annotations.
 
 We currently have four annotations (and small variations of them):
-- `ArrayLike`: The input can be a `torch.numpy.array`, a list of lists, a
+- `ArrayLike`: The input can be a `torch_np.array`, a list of lists, a
   scalar, or anything that NumPy would accept. It returns a `torch.Tensor`.
-- `DTypeLike`: Takes a `torch.numpy` dtype and returns a PyTorch dtype.
+- `DTypeLike`: Takes a `torch_np` dtype and returns a PyTorch dtype.
 - `AxisLike`: Takes anything that can be accepted as an axis (e.g. a tuple or
   an `ndarray`) and returns a tuple.
-- `OutArray`: Asserts that the input is a `torch.numpy.ndarray`. This is used
+- `OutArray`: Asserts that the input is a `torch_np.ndarray`. This is used
   to implement the `out` arg.
 
 Note that none of the code here makes use of NumPy. We are writing
-`torch.numpy.ndarray` above to make more explicit our intents, but there
+`torch_np.ndarray` above to make more explicit our intents, but there
 shouldn't be any ambiguity here.
 
 **OBS(Lezcano)**: `DTypeLike` should be `Optional[DTypeLike]`
@@ -213,114 +290,66 @@ class is rather simple. We simply register all the free functions as methods or
 dunder methods appropriately. We also forward the properties to the properties
 within the PyTorch tensor and we are done.
 
-### DTypes
-
-**Default dtypes**. One of the issues that most often user when moving their
-codebase from NumPy to JAX was the default dtype changing from `float64` to
-`float32`. So much so, that this is one noted as one of
-[JAX's shap edges](https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html#double-64bit-precision).
-Following the spirit of making everything match NumPy by default, we choose the
-NumPy defaults whenever the `dtype` was not chosen in a factory function.
-
-**TODO(Lezcano)**: I just realised that we do not have a clean way to change
-the default dtype of `torch.numpy` to those from PyTorch. We should implement
-that utility flag, similar to
-[`torch.set_default_dtype`](https://pytorch.org/docs/stable/generated/torch.set_default_dtype.html).
-Perhaps call it `torch.numpy.use_torch_defaults()` and then add a way for users
-to be able to set their own int/float/complex defaults.
-**TODO(Lezcano)**: Do we just use them just in factory functions, or do we also
-use them anywhere else -> Check
-
-**NumPy scalars**. NumPy's type system is tricky. At first sight, it looks
-quite a bit like PyTorch's, but having a few more dtypes like `np.uint16` or
-`np.longdouble`. Upon closer inspection, one finds that it also has
-[NumPy scalar](https://numpy.org/doc/stable/reference/arrays.scalars.html) objects.
-NumPy scalars are similar to Python scalars but with a set width. NumPy scalars
-are NumPy's preferred return class for reductions and other operations that
-return just one element. NumPy scalars do not play particularly well with
-computations on devices like GPUs, as they live on CPU. Implementing NumPy
-scalars would mean that we need to synchronize after every `sum()` call, which
-is less-than-good. Instead, whenever a NumPy scalar would be returned, we will
-return a 0-D tensor, as PyTorch does.
-
-**Type promotion**. Another not-so-well-known fact of NumPy's cast system is
-that it is data-dependent. Python scalars can be used in pretty much any NumPy
-operation, being able to call any operation that accepts a 0-D array with a
-Python scalar. If you provide an operation with a Python scalar, these will be
-casted to the smallest dtype that can represent them, and then they will
-participate in type promotion, allowing for some rather interesting behaviour
-```python
->>> np.asarray([1], dtype=np.int8) + 127
-array([128], dtype=int8)
->>> np.asarray([1], dtype=np.int8) + 128
-array([129], dtype=int16)
-```
-This dependent type promotion will be deprecated NumPy 2.0, and will be
-replaced with [NEP 50](https://numpy.org/neps/nep-0050-scalar-promotion.html).
-As such, to be forward-looking and for simplicity, we chose to implement the
-type promotion behaviour proposed in NEP 50, which is much closer to that of
-Pytorch.
-
-Note that the decision of going with NEP 50 complements the previous one of
-returning 0-D arrays in place of NumPy scalars as, currently, 0-D arrays do not
-participate in type promotion in NumPy (but will do in NumPy 2.0):
-```python
-int64_0d_array = np.array(1, dtype=np.int64)
-np.result_type(np.int8, int64_0d_array) == np.int8
-```
-
-## Testing
+### Testing
 
 The testing of the framework was done via ~~copying~~ vendoring tests from the
 NumPy test suit. Then, we would replace the NumPy imports for imports with
-`torch.numpy`. The failures on these tests were then triaged and discussed the
+`torch_np`. The failures on these tests were then triaged and discussed the
 priority of fixing each of them.
 
 In the (near) future, we plan to get some real world examples and run them
 through the library, to test its coverage and correctness.
 
-## Limitations
-
-One of the known limitations of this approach is the efficiency in eager.
-Similar to PrimTorch, sometimes we needed to work around some limitations of
-PyTorch (e.g. support for some operations for `float16`) or some ways PyTorch
-deviates from NumPy by implementing things manually calling several `torch`
-operations. This, when executed in eager mode and, in particular, on CUDA
-devices, will result on a perf-hit. To alleviate this, we tried to dispatch
-NumPy functions to PyTorch functions with as few indirections as possible, to
-alleviate the number of kernels called when executed on eager mode.
+### Limitations
 
-There are some known limitations. Some of them are tracked in the second part
-of the [OP of this issue](https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/73).
+A number of known limitations are tracked in the second part of the
+[OP of this issue](https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/73).
 There are some more in [this issue](https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/86).
 When landing all this, we will create a comprehensive document with the differences
-between NumPy and `torch.numpy`.
+between NumPy and `torch_np`.
 
-## Beyond NumPy
+### Beyond Plain NumPy
 
-**CUDA**. The current implementation has just been implemented and tested on
-CPU. We expect CUDA coverage to be as good as the coverage we have with CPU
-matching CUDA. In the NumPy-only example in the introduction, given that no
-explicit `device` kwarg is used anywhere in this module, CUDA execution could
-be turned on via `with torch.device('cuda'):`. In the PyTorch+NumPy example, if
-the original tensors are on GPU, the whole execution should be performed on the
-GPU.
+**GPU**. The current implementation has just been implemented and tested on
+CPU. We expect GPU coverage to be as good as the coverage we have with CPU
+matching GPU. If the original tensors are on GPU, the whole execution should
+be performed on the GPU.
 
 **TODO(Lezcano)**. We should probably test CUDA on the tests.
 
 **Gradients**. We have not tested gradient tracking either as we are still to
 find some good examples on which to test it, but it should be a simple
-corollary of all this effort. In the PyTorch+NumPy scenario, if the original
-tensors fed into the function do have `requires_grad=True`, the tensors will
-track the gradients of the internal implementation and then the user could
-differentiate through the NumPy code. We do not have a way to turn the
-`requires_grad` flag in the all-NumPy case. Note that this is expected as this
-would require exposing all the autograd machinery from PyTorch into the API. If
-a user wants to compute gradients in their program, we expect them to wrap it
-in a function and apply the PyTorch-NumPy approach.
+corollary of all this effort. If the original tensors fed into the function do
+have `requires_grad=True`, the tensors will track the gradients of the internal
+implementation and then the user could differentiate through the NumPy code.
 
 **TODO(Lezcano)**. Picking up simple NumPy programs from the internet would be good for these autograd tests.
 
-# Bindings to TorchDyamo
+### Bindings to TorchDyamo
+
+The bindings for NumPy at the TorchDynamo level are currently being developed at [#95849](https://github.com/pytorch/pytorch/pull/95849).
+
+
+## Unresolved Questions
+
+A question was left open in the initial discussion. Should the module `torch_np` be publicly exposed as `torch.numpy` or not?
+
+A few arguments in favor of making it public:
+* People could use it in their NumPy programs just by changing the import to
+  `import torch.numpy as np`. This could be a selling point similar to JAX's
+  `jax.numpy`, which could incentivize adoption.
+* People would not need to use the whole PyTorch 2.0 stack to start using
+  PyTorch in their codebases
+  * See [this experiment in scikit-learn](https://github.com/scikit-learn/scikit-learn/pull/25956)
+    where they got a 7x speed-up on CPU on a layer just by using `torch.linalg`.
+* Since the layer is rather thin and in pure Python, if there are bugs,
+  external contributors could easily help fixing them or extend the supported
+  functionality.
 
-**TODO(Lezcano)**: The PR is not there yet cf. [#95849](https://github.com/pytorch/pytorch/pull/95849).
+A few arguments against:
+* The compat introduces a number of type conversions that may produce somewhat
+  slow code when used in eager mode.
+  * [Note] Keeping this in mind, we tried to use in the implementations as few
+    operators as possible, to make it reasonably fast in eager mode.
+* Exposing `torch.numpy` would create a less performant secondary entry point
+  to many of the functions in PyTorch. This could be a trap for new users.