Skip to content

Subtensor RV lift does not work correctly with Categorical RVs #230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jessegrabowski opened this issue Jan 2, 2023 · 7 comments · Fixed by #281
Closed

Subtensor RV lift does not work correctly with Categorical RVs #230

jessegrabowski opened this issue Jan 2, 2023 · 7 comments · Fixed by #281
Labels
bug Something isn't working

Comments

@jessegrabowski
Copy link
Member

jessegrabowski commented Jan 2, 2023

Describe the issue:

Originally raised on discourse here.

It appears that when pm.Categorical has observed data with missing values, the variable is re-instantiated inside model.make_obs_var with the wrong number of categories. This example shows that the new number of categories is indeed controlled by the shape of the data:

import pymc as pm
import numpy as np

# No error
data = np.ma.masked_equal([1, -1, -1], -1)
with pm.Model():
    idx = pm.Categorical(f"hi_idx", p=[0.1, 0.2, 0.7], observed=data)
pm.draw(idx, 100).max()
>>>Out: 2.0

data = np.ma.masked_equal([1, -1], -1)
with pm.Model():
    idx = pm.Categorical(f"hi_idx", p=[0.1, 0.2, 0.7], observed=data)
pm.draw(idx, 100).max()
>>> Out: 1.0

If the data are longer than the number of classes, the code will error out, as shown below:

Reproduceable code example:

import pymc as pm
import numpy as np

data = np.ma.masked_equal([1, 1, 0, 0, 2, -1, -1], -1)
with pm.Model():
    idx = pm.Categorical(f"hi_idx", p=[0.1, 0.2, 0.7], observed=data)
pm.draw(idx, 100).max()

Error message:

IndexError: index 5 is out of bounds for axis 0 with size 3
Apply node that caused the error: AdvancedSubtensor1(TensorConstant{[0.1 0.2 0.7]}, TensorConstant{[5 6]})
Toposort index: 1
Inputs types: [TensorType(float64, (3,)), TensorType(uint8, (2,))]
Inputs shapes: [(3,), (2,)]
Inputs strides: [(8,), (1,)]
Inputs values: [array([0.1, 0.2, 0.7]), array([5, 6], dtype=uint8)]
Outputs clients: [[categorical_rv{0, (1,), int64, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x12AD72820>), TensorConstant{(1,) of 2}, TensorConstant{4}, AdvancedSubtensor1.0)]]

Backtrace when the node is created (use PyTensor flag traceback__limit=N to make it longer):
  File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3194, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3373, in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
  File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3433, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/7b/rzxy96cj0w751_6td3g2yss00000gn/T/ipykernel_96907/3148429013.py", line 3, in <module>
    idx = pm.Categorical(f"hi_idx", p=[0.1, 0.2, 0.7], observed=data)
  File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/pymc/distributions/distribution.py", line 457, in __new__
    return super().__new__(cls, name, *args, **kwargs)
  File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/pymc/distributions/distribution.py", line 310, in __new__
    rv_out = model.register_rv(
  File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/pymc/model.py", line 1348, in register_rv
    rv_var = self.make_obs_var(rv_var, observed, dims, transform)
  File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/pymc/model.py", line 1425, in make_obs_var
    (missing_rv_var,) = local_subtensor_rv_lift.transform(fgraph, fgraph.outputs[0].owner)

HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

PyMC version information:

PyMC version: 5.0.1 Pytensor version: 2.8.11

Context for the issue:

No response

@jessegrabowski jessegrabowski added the bug Something isn't working label Jan 2, 2023
@jessegrabowski
Copy link
Member Author

jessegrabowski commented Jan 3, 2023

I believe this is actually a pytensor bug in pytensor.tensor.random.rewriting.basic.local_subtensor_rv_lift. I guess I'll open a separate issue over there.

I maybe spoke too soon. 'local_subtensor_rv_lift' is where the Categorical's p parameters are masked away, but I'm not sure that function is doing anything wrong. The root of the problem seems to be that the parameter p in the Categorical distribution doesn't automatically carry shape information the way that the parameters of other distributions do. It seems like there needs to be a special case added somewhere to handle this, but I'm not positive where it should go.

@jessegrabowski jessegrabowski changed the title BUG: Number of classes in pm.Categorical is inferred from the data shape when observed has missing values BUG: Class probabilities in pm.Categorical are split into observed and missing when observed has missing values Jan 3, 2023
@ricardoV94
Copy link
Member

ricardoV94 commented Feb 28, 2023

As you first guessed, the issue is in the local_subtensor_rv_lift which naively tries to split the p parameter as if it had ndim_params=0 (ie. one entry per independent dimension). The splitting should ignore the last dimension, so if you have p=[[.5, .5], [.25, .75]] you split into 2 rvs, one with p=[.5, .5] and another with p=[.25, .75].

Categorical is a bit unique here because it's one of the few (maybe the only?) univariate distribution with a vector parameter in the core case.

By the way, fixing the bug may be related to #49

@ricardoV94
Copy link
Member

Moving this to Pytensor

@ricardoV94 ricardoV94 transferred this issue from pymc-devs/pymc Feb 28, 2023
@ricardoV94 ricardoV94 changed the title BUG: Class probabilities in pm.Categorical are split into observed and missing when observed has missing values Subtensor RV lift does nto work correctly with Categorical RVs Feb 28, 2023
@ricardoV94 ricardoV94 changed the title Subtensor RV lift does nto work correctly with Categorical RVs Subtensor RV lift does not work correctly with Categorical RVs Feb 28, 2023
@greenguy33
Copy link

Hi @ricardoV94 and @jessegrabowski, thanks for documenting this. I encountered this issue while trying to impute missing data for categorical variables. Is there a workaround available in the meantime while the bug is fixed?

@ricardoV94
Copy link
Member

ricardoV94 commented Apr 21, 2023

@greenguy33 you can do the imputation yourself. The only thing that happens behind the scenes is that we create two Categorical Variables, one fully observed and one unobserved (where the entries where nan). You just have to split the p vectors wisely among the two and you should get the same result.

Simplest case

p = [[0.5, 0.5], [0.3, 0.7]]
data = [np.nan, 1]
with pm.Model():
  cat_unobserved = pm.Categorical("cat_unobserved", p[0])
  cat_observed = pm.Categorical("cat_observed", p[1], observed=data[1])
  cat = pm.Deterministic("cat", pm.math.stack((cat_unobserved, cat_observed))

@ricardoV94
Copy link
Member

@greenguy33 the fix is included in the last release, you should be able to use it in a couple of hours

@greenguy33
Copy link

@ricardoV94 thanks very much! Will give it a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants