Allow encoding of non-numbers (e.g. missingness, deletions, ranges) #195

krivard · 2020-08-23T17:51:09Z

Since moving to a data versioning scheme, there is no longer any way to remove a row from COVIDcast without removing all previous versions of that row as well (so that it's as if it was never published at all). This is hazardous -- leaving the row in is inaccurate, and removing the row gives forecasters access to future-privileged information that will not match realtime usage.

We are developing a survey of different kinds of missingness and deletions that occur in the different COVIDcast sources here to help spec out an encoding system.

Some additional conversation on this is in a thread on the first set of performance fixes, but it looks like the column additions mentioned there didn't actually make it into staging this time around.

RoniRos · 2020-12-16T16:42:02Z

Suppressed/censored values are a common facet of epidemiological data, so we need to build support for it into Epidata. This means a variety of non-numeric values. In the case of HHS's Hospitalization By Facility data stream, it will be a value that means "a number in the range [0-3]". There are cases where zero could be singled out, e.g. in Change Healthcare data we have a value for "a number in the range [1-3]". In addition to supporting numerical ranges, there are a variety of other possible values ("censored", "unreported", "N/A", etc.).

Implementation:

My suggestion is that we maintain an open-ended and ever growing list of special values. To start with, we'll have all the types of missingness and deletions that we have encountered, as well as all types of numeric ranges we encountered or expect to encounter (e.g. [0-3], [1-3] etc.). Our code would then be able to ask about specific values and do the right thing with them.

As far as how to represent these values in a database, there are likely known solutions that have been worked out for any self-respecting DB that supports missingness. But since I don't know what they are, let me suggest something very basic: We probably don't want to use a separate non-numeric stream of values, because it will double the size of the database. Instead, we should expropriate a range of numerical values. With an integer stream, it would have been ideal to expropriate the 1000 largest negative or positive numbers represented by a 64 bit integer, but this is too tied to low level architecture. Instead, we can expropriate the smallest 1000 negative numbers, namely -1 thru -1000. This will require that if we actually use negative integers in our streams, we substract 1000 from them before storage, and add 1000 upon retrieval. This could be generalized to floating point data as well.

krivard · 2020-12-16T17:56:09Z

We'll have to think carefully about how these non-numeric values should be interpreted by downstream clients. As I understand it, data frame implementations in both R and Python require all values in a column to have the same data type. It's not clear how or if users will expect non-numeric values to propagate into data frames.

RoniRos · 2020-12-17T02:02:30Z

I was thinking this representation will live only in the DB, not in the served data. Everything should be handled under the hood in the API. Namely, once the data is extracted from the DB, every value is split into two values: a regular numerical value, and a 'case' indicator. One value of the case indicator will be "normal", meaning interpret the numerical value as usual. The other cases may say the data is 'missing', 'N/A', 'deleted', 'range[1-3]', 'range[<1000]', etc., in which case the numerical value is ignored.

The proposal is essentially to encode two values with a single variable. Otherwise, there will be a huge waste of DB space. I suppose the alternative is compression of sequences of DB values, but it's hard to see how that will work with the non-numerical values are interspersed randomly. Another solution is to use a bit-based data representation with the first bit determining the meaning of the other bits.

How are different types of missingness represented in commercial DB packages? @krivard @dfarrow0

krivard · 2020-12-17T21:00:08Z

I am not aware of any commercial package that represents more than one kind of missingness. Everybody has a null/nil/na value, but only one. Some brief googling turns up articles on how to handle missingness in feature vectors for machine learning models, and guidance for data scientists to always recode missing-value constants (like -999999) to a more appropriate (ie NULL) value before proceeding with their analysis.

dfarrow0 · 2020-12-17T21:22:15Z

How are different types of missingness represented in commercial DB packages?

i don't have much experience with that. where i work, the common pattern is to define non-primitive data types using the protobuf language. your proposal would be something like this:

message Interval {
  double lower = 1;
  double upper = 2;
  bool lower_inclusive = 3;
  bool upper_inclusive = 4;
}

message Value {
  oneof {
    double value = 1;
    Interval interval = 2;
    bool missing = 3;
    bool deleted = 4;
  }
}

in the above, a value could be 3.1415 or missing or deleted or [0, 3] or [0, 1000) etc etc. so while it's a rich representation, it's more complicated to store and query. i don't know of any databases that handle protobuf messages (although they may exist). you might could do something equivalent with json and nosql (maybe mongo?). but i think you would lose a lot of the performance you get from a relational database like mariadb. in any case, the end user would still need to handle special values somehow.

RoniRos · 2020-12-18T01:01:39Z

Thank you both. If I understand correctly David's protobuf example, the main relative advantage of what I am proposing is saving in storage space. But maybe that's not a significant consideration.

sgratzl · 2021-06-17T14:47:57Z

fixed with #417

krivard added the project proposal This is a big project deserving of a full project requirements doc label Oct 13, 2020

krivard mentioned this issue Oct 22, 2020

Minimum value thresholding and megaFIPS aggregation for low counts cmu-delphi/covidcast-indicators#361

Open

krivard added this to the December OKRs milestone Nov 12, 2020

SumitDELPHI added the Engineering Used to filter issues when synching with Asana label Dec 2, 2020

RoniRos changed the title ~~Explicitly encode missingness and deletions~~ Allow encoding of non-numbers (e.g. missingness, deletions, ranges) Dec 16, 2020

dshemetov mentioned this issue Feb 17, 2021

NAN coding database and acquisition changes #417

Merged

5 tasks

sgratzl closed this as completed Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow encoding of non-numbers (e.g. missingness, deletions, ranges) #195

Allow encoding of non-numbers (e.g. missingness, deletions, ranges) #195

krivard commented Aug 23, 2020 •

edited by sync-by-unito bot

Loading

RoniRos commented Dec 16, 2020

krivard commented Dec 16, 2020

RoniRos commented Dec 17, 2020

krivard commented Dec 17, 2020

dfarrow0 commented Dec 17, 2020

RoniRos commented Dec 18, 2020

sgratzl commented Jun 17, 2021

Allow encoding of non-numbers (e.g. missingness, deletions, ranges) #195

Allow encoding of non-numbers (e.g. missingness, deletions, ranges) #195

Comments

krivard commented Aug 23, 2020 • edited by sync-by-unito bot Loading

RoniRos commented Dec 16, 2020

krivard commented Dec 16, 2020

RoniRos commented Dec 17, 2020

krivard commented Dec 17, 2020

dfarrow0 commented Dec 17, 2020

RoniRos commented Dec 18, 2020

sgratzl commented Jun 17, 2021

krivard commented Aug 23, 2020 •

edited by sync-by-unito bot

Loading