Merge pull request #427 from cmu-delphi/large_spikes_validator

krivard · web-flow · commit 732d6960f1e5 · 2020-11-30T10:30:10.000-05:00
Large spikes validator
diff --git a/validator/PLANS.md b/validator/PLANS.md
@@ -22,6 +22,7 @@
 * Most recent date seen in source data is not older than most recent date seen in reference data
 * Similar number of obs per day as recent API data (static threshold)
 * Similar average value as API data (static threshold)
+* Outliers in cases and deaths signals using [this method](https://github.com/cmu-delphi/covidcast-forecast/tree/dev/corrections/data_corrections)
 * Source data for specified date range is empty
 * API data for specified date range is empty
 
@@ -44,6 +45,9 @@
 
 ### Larger issues
 
+* Set up validator to use Sir-complains-a-lot alerting functionality on a signal-by-signal basis (should send alert output as a slack message and "@" a set person), as a stop-gap before the logging server is ready
+  * This is [how Sir-CAL works](https://github.com/benjaminysmith/covidcast-indicators/blob/main/sir_complainsalot/delphi_sir_complainsalot/run.py)
+  * [Example output](https://delphi-org.slack.com/archives/C01E81A3YKF/p1605793508000100)
 * Expand framework to support nchs_mortality, which is provided on a weekly basis and has some differences from the daily data. E.g. filenames use a different format ("weekly_YYYYWW_geotype_signalname.csv")
 * Make backtesting framework so new checks can be run individually on historical indicator data to tune false positives, output verbosity, understand frequency of error raising, etc. Should pull data from API the first time and save locally in `cache` dir.
 * Add DETAILS.md doc with detailed descriptions of what each check does and how. Will be especially important for statistical/anomaly detection checks.
@@ -52,20 +56,18 @@
   * Easier suppression of many errors at once
     * Maybe store errors as dict of dicts. Keys could be check strings (e.g. "check_bad_se"), then next layer geo type, etc
   * Nicer formatting for error “report”.
+    * Potentially set `__print__()` method in ValidationError class
     * E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
 * Check for erratic data sources that wrongly report all zeroes
   * E.g. the error with the Wisconsin data for the 10/26 forecasts
   * Wary of a purely static check for this
   * Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases)
   * This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week
-  * Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.
-* Outlier detection (in progress)
-  * Current approach is tuned to daily cases and daily deaths; use just on those signals?
-  * prophet (package) detection is flexible, but needs 2-3 months historical data to fit on. May make sense to use if other statistical checks also need that much data.
+  * Also partially captured by outlier checking, depending on `size_cut` setting. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.
 * Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
 * If can't get data from API, do we want to use substitute data for the comparative checks instead?
-  * E.g. most recent successful API pull -- might end up being a couple weeks older
   * Currently, any API fetch problems just doesn't do comparative checks at all.
+  * E.g. most recent successful API pull -- might end up being a couple weeks older
 * Improve performance and reduce runtime (no particular goal, just avoid being painfully slow!)
   * Profiling (iterate)
   * Save intermediate files?
@@ -87,3 +89,4 @@
   * Raise errors when one p-value (per geo region, e.g.) is significant OR when a bunch of p-values for that same type of test (different geo regions, e.g.) are "close" to significant
   * Correct p-values for multiple testing
   * Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500)
+  * Use prophet package? Would require 2-3 months of API data.
diff --git a/validator/delphi_validator/datafetcher.py b/validator/delphi_validator/datafetcher.py
@@ -91,7 +91,7 @@ def fetch_api_reference(data_source, start_date, end_date, geo_type, signal_type
     ).rename(
         columns={'geo_value': "geo_id", 'stderr': 'se', 'value': 'val'}
     ).drop(
-        ['direction', 'issue', 'lag'], axis=1
+        ['issue', 'lag'], axis=1
     ).reindex(columns=column_names)
 
     return api_df
diff --git a/validator/delphi_validator/validate.py b/validator/delphi_validator/validate.py
@@ -9,7 +9,6 @@
 from os.path import join
 from datetime import date, datetime, timedelta
 import pandas as pd
-
 from .errors import ValidationError, APIDataFetchError
 from .datafetcher import filename_regex, \
     read_filenames, load_csv, get_geo_signal_combos, \
@@ -608,18 +607,154 @@ def check_rapid_change_num_rows(self, df_to_test, df_to_reference, checking_date
 
         self.increment_total_checks()
 
+    def check_positive_negative_spikes(self, source_df, api_frames, geo, sig):
+        """
+        Adapt Dan's corrections package to Python (only consider spikes) :
+        https://github.com/cmu-delphi/covidcast-forecast/tree/dev/corrections/data_corrections
+
+        Statistics for a right shifted rolling window and a centered rolling window are used
+        to determine outliers for both positive and negative spikes.
+
+        As it is now, ststat will always be NaN for source frames.
+
+        Arguments:
+            - source_df: pandas dataframe of CSV source data
+            - api_frames: pandas dataframe of reference data, either from the
+            COVIDcast API or semirecent data
+            - geo: str; geo type name (county, msa, hrr, state) as in the CSV name
+            - sig: str; signal name as in the CSV name
+
+        """
+        self.increment_total_checks()
+        # Combine all possible frames so that the rolling window calculations make sense.
+        source_frame_start = source_df["time_value"].min()
+        source_frame_end = source_df["time_value"].max()
+        api_frames_end = min(api_frames["time_value"].max(
+        ), source_frame_start-timedelta(days=1))
+        all_frames = pd.concat([api_frames, source_df]). \
+            drop_duplicates(subset=["geo_id", "time_value"], keep='last'). \
+            sort_values(by=['time_value']).reset_index(drop=True)
+        if "index" in all_frames.columns:
+            all_frames = all_frames.drop(columns=["index"])
+        # Tuned Variables from Dan's Code for flagging outliers. Size_cut is a
+        # check on the minimum value reported, sig_cut is a check
+        # on the ftstat or ststat reported (t-statistics) and sig_consec
+        # is a lower check for determining outliers that are next to each other.
+        size_cut = 20
+        sig_cut = 3
+        sig_consec = 2.25
+
+        # Functions mapped to rows to determine outliers based on fstat and ststat values
+
+        def outlier_flag(frame):
+            if (abs(frame["val"]) > size_cut) and not (pd.isna(frame["ststat"])) \
+                    and (frame["ststat"] > sig_cut):
+                return 1
+            if (abs(frame["val"]) > size_cut) and (pd.isna(frame["ststat"])) and \
+                    not (pd.isna(frame["ftstat"])) and (frame["ftstat"] > sig_cut):
+                return 1
+            if (frame["val"] < -size_cut) and not (pd.isna(frame["ststat"])) and \
+                    not pd.isna(frame["ftstat"]):
+                return 1
+            return 0
+
+        def outlier_nearby(frame):
+            if (not pd.isna(frame['ststat'])) and (frame['ststat'] > sig_consec):
+                return 1
+            if pd.isna(frame['ststat']) and (frame['ftstat'] > sig_consec):
+                return 1
+            return 0
+
+        # Calculate ftstat and ststat values for the rolling windows, group fames by geo region
+        region_group = all_frames.groupby("geo_id")
+        window_size = 14
+        shift_val = 0
+
+        # Shift the window to match how R calculates rolling windows with even numbers
+        if window_size % 2 == 0:
+            shift_val = -1
+
+        # Calculate the t-statistics for the two rolling windows (windows center and windows right)
+        all_full_frames = []
+        for _, group in region_group:
+            rolling_windows = group["val"].rolling(
+                window_size, min_periods=window_size)
+            center_windows = group["val"].rolling(
+                window_size, min_periods=window_size, center=True)
+            fmedian = rolling_windows.median()
+            smedian = center_windows.median().shift(shift_val)
+            fsd = rolling_windows.std() + 0.00001  # if std is 0
+            ssd = center_windows.std().shift(shift_val) + 0.00001  # if std is 0
+            vals_modified_f = group["val"] - fmedian.fillna(0)
+            vals_modified_s = group["val"] - smedian.fillna(0)
+            ftstat = abs(vals_modified_f)/fsd
+            ststat = abs(vals_modified_s)/ssd
+            group['ftstat'] = ftstat
+            group['ststat'] = ststat
+            all_full_frames.append(group)
+
+        all_frames = pd.concat(all_full_frames)
+        # Determine outliers in source frames only, only need the reference
+        # data from just before the start of the source data
+        # because lead and lag outlier calculations are only one day
+        outlier_df = all_frames.query(
+            'time_value >= @api_frames_end & time_value <= @source_frame_end')
+        outlier_df = outlier_df.sort_values(by=['geo_id', 'time_value']) \
+            .reset_index(drop=True).copy()
+        outlier_df["flag"] = 0
+        outlier_df["flag"] = outlier_df.apply(outlier_flag, axis=1)
+        outliers = outlier_df[outlier_df["flag"] == 1]
+        outliers_reset = outliers.copy().reset_index(drop=True)
+
+        # Find the lead outliers and the lag outliers. Check that the selected row
+        # is actually a leading and lagging row for given geo_id
+        upper_index = list(filter(lambda x: x < outlier_df.shape[0],
+                                  list(outliers.index+1)))
+        upper_df = outlier_df.iloc[upper_index, :].reset_index(drop=True)
+        upper_compare = outliers_reset[:len(upper_index)]
+        sel_upper_df = upper_df[upper_compare["geo_id"]
+                                == upper_df["geo_id"]].copy()
+        lower_index = list(filter(lambda x: x >= 0, list(outliers.index-1)))
+        lower_df = outlier_df.iloc[lower_index, :].reset_index(drop=True)
+        lower_compare = outliers_reset[-len(lower_index)                                       :].reset_index(drop=True)
+        sel_lower_df = lower_df[lower_compare["geo_id"]
+                                == lower_df["geo_id"]].copy()
+
+        sel_upper_df["flag"] = 0
+        sel_lower_df["flag"] = 0
+
+        sel_upper_df["flag"] = sel_upper_df.apply(outlier_nearby, axis=1)
+        sel_lower_df["flag"] = sel_lower_df.apply(outlier_nearby, axis=1)
+
+        upper_outliers = sel_upper_df[sel_upper_df["flag"] == 1]
+        lower_outliers = sel_lower_df[sel_lower_df["flag"] == 1]
+
+        all_outliers = pd.concat([outliers, upper_outliers, lower_outliers]). \
+            sort_values(by=['time_value', 'geo_id']). \
+            drop_duplicates().reset_index(drop=True)
+
+        # Identify outliers just in the source data
+        source_outliers = all_outliers.query(
+            "time_value >= @source_frame_start & time_value <= @source_frame_end")
+
+        if source_outliers.shape[0] > 0:
+            self.raised_errors.append(ValidationError(
+                ("check_positive_negative_spikes",
+                 source_frame_start, source_frame_end, geo, sig),
+                (source_outliers),
+                'Source dates with flagged ouliers based on the \
+                previous 14 days of data available'))
+
     def check_avg_val_vs_reference(self, df_to_test, df_to_reference, checking_date, geo_type,
                                    signal_type):
         """
         Compare average values for each variable in test dataframe vs reference dataframe.
-
         Arguments:
             - df_to_test: pandas dataframe of CSV source data
             - df_to_reference: pandas dataframe of reference data, either from the
             COVIDcast API or semirecent data
             - geo_type: str; geo type name (county, msa, hrr, state) as in the CSV name
             - signal_type: str; signal name as in the CSV name
-
         Returns:
             - None
         """
@@ -731,13 +866,14 @@ def validate(self, export_dir):
         Returns:
             - None
         """
+
         # Get relevant data file names and info.
+
         export_files = read_filenames(export_dir)
         date_filter = make_date_filter(self.start_date, self.end_date)
 
         # Make list of tuples of CSV names and regex match objects.
         validate_files = [(f, m) for (f, m) in export_files if date_filter(m)]
-
         self.check_missing_date_files(validate_files)
         self.check_settings()
 
@@ -747,7 +883,6 @@ def validate(self, export_dir):
         # For every daily file, read in and do some basic format and value checks.
         for filename, match in validate_files:
             data_df = load_csv(join(export_dir, filename))
-
             self.check_df_format(data_df, filename)
             self.check_bad_geo_id_format(
                 data_df, filename, match.groupdict()['geo_type'])
@@ -781,12 +916,14 @@ def validate(self, export_dir):
         date_list = [self.start_date + timedelta(days=days)
                      for days in range(self.span_length.days + 1)]
 
+        # Get 14 days prior to the earliest list date
+        outlier_lookbehind = timedelta(days=14)
+
         # Get all expected combinations of geo_type and signal.
         geo_signal_combos = get_geo_signal_combos(self.data_source)
 
         all_api_df = self.threaded_api_calls(
-            self.start_date - min(semirecent_lookbehind,
-                                  self.max_check_lookbehind),
+            self.start_date - outlier_lookbehind,
             self.end_date, geo_signal_combos)
 
         # Keeps script from checking all files in a test run.
@@ -821,6 +958,20 @@ def validate(self, export_dir):
             if geo_sig_api_df is None:
                 continue
 
+            # Outlier dataframe
+            if (signal_type in ["confirmed_7dav_cumulative_num", "confirmed_7dav_incidence_num",
+                                "confirmed_cumulative_num", "confirmed_incidence_num", "deaths_7dav_cumulative_num",
+                                "deaths_cumulative_num"]):
+                earliest_available_date = geo_sig_df["time_value"].min()
+                source_df = geo_sig_df.query(
+                    'time_value <= @date_list[-1] & time_value >= @date_list[0]')
+                outlier_start_date = earliest_available_date - outlier_lookbehind
+                outlier_end_date = earliest_available_date - timedelta(days=1)
+                outlier_api_df = geo_sig_api_df.query(
+                    'time_value <= @outlier_end_date & time_value >= @outlier_start_date')
+                self.check_positive_negative_spikes(
+                    source_df, outlier_api_df, geo_type, signal_type)
+
             # Check data from a group of dates against recent (previous 7 days,
             # by default) data from the API.
             for checking_date in date_list:
@@ -872,6 +1023,7 @@ def validate(self, export_dir):
                         recent_df, reference_api_df, checking_date, geo_type, signal_type)
 
             # Keeps script from checking all files in a test run.
+
             if self.test_mode:
                 kroc += 1
                 if kroc == 2:
diff --git a/validator/tests/test_checks.py b/validator/tests/test_checks.py