Skip to content

NAN coding database and acquisition changes #417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 10, 2021
Merged

NAN coding database and acquisition changes #417

merged 20 commits into from
May 10, 2021

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Feb 17, 2021

This PR begins to address issue #195. Indicator and epidata client PRs to follow. See the design doc for more info. The following changes are implemented:

  • add 3 columns to the covidcast.ddl for missingness fields
  • add acquisition validation for handling the new missing columns, with backwards compatibility for CSVs that do not have these (minimal missing code inference, either NOT_MISSING or UNKNOWN, else error)
  • add missing fields to CovidcastRow and RowValues
  • allow NULLS in the "val" column in the database
  • add integration tests for the new columns
  • missingness codes depend on a constants file in the cmu-delphi/utils repo (which is currently manually synced with the file in the indicators repo)

TODO:

  • update the Epidata client to request and receive missingness columns
  • update the API server to handle requests for and send missingness columns
  • add a complete suite of integration tests for missing codes
  • update nancodes with data ranges
  • create a survey of the geomaps in the indicators

@dshemetov dshemetov requested a review from krivard February 17, 2021 22:16
@dshemetov dshemetov changed the title NAN coding database changes NAN coding database and acquisition changes Feb 17, 2021
@chinandrew
Copy link
Contributor

could you add some integration tests for the acquisition step?

@dshemetov
Copy link
Contributor Author

what do you mean by integration tests? I'm just vaguely familiar with the term. like test communication with a database?

@chinandrew
Copy link
Contributor

what do you mean by integration tests? I'm just vaguely familiar with the term. like test communication with a database?

Yea, see the integrations/ folder for some examples. They rely on a local web server/db and can run the acquisition end to end

@dshemetov dshemetov marked this pull request as draft March 24, 2021 04:32
@krivard krivard requested a review from sgsmob March 29, 2021 13:15
@dshemetov
Copy link
Contributor Author

dshemetov commented Mar 29, 2021

OBSOLETE: Didn't need an expected geomap.

Recording some decision logs with @krivard. So there’s a bit of variation in county coverage across the indicators. A majority of the indicators have 3k+ counties, three have ~1500, ~2000, ~2500, and two outliers have ~100 and ~900.

Number of counties in signal across all time for latest issue:
{'covid-act-now pcr_specimen_positivity_rate': 3032,
 'chng smoothed_adj_outpatient_covid': 3041,
 'doctor-visits smoothed_adj_cli': 2521,
 'google-symptoms sum_anosmia_ageusia_smoothed_search': 109,
 'hospital-admissions smoothed_covid19_from_claims': 925,
 'quidel covid_ag_smoothed_pct_positive': 2074,
 'safegraph median_home_dwell_time': 3230,
 'fb-survey smoothed_wcli': 1526,
 'jhu-csse confirmed_cumulative_num': 3282,
 'usa-facts confirmed_cumulative_num': 3193}

To construct an expected geo map for an indicator, we need to consider the time window to use for it. If we use the full time window, we’ll get everything, but we might get geos that have stopped being reported. While this would add some nans to the database, this doesn’t seem to be a big issue though. As the table below shows, most counties that have reported at least once in the full time window reported at least once in the last month, which suggests that we won't have many empty counties reporting nothing but nans.

Fraction of counties in signal present in past month compared to full window:
{'covid-act-now pcr_specimen_positivity_rate': 1.0,
 'chng smoothed_adj_outpatient_covid': 0.9993423216047352,
 'doctor-visits smoothed_adj_cli': 1.0,
 'google-symptoms sum_anosmia_ageusia_smoothed_search': 1.0,
 'hospital-admissions smoothed_covid19_from_claims': 1.0,
 'quidel covid_ag_smoothed_pct_positive': 1.0,
 'safegraph median_home_dwell_time': 0.9996904024767802,
 'fb-survey smoothed_wcli': 1.0,
 'jhu-csse confirmed_cumulative_num': 1.0,
 'usa-facts confirmed_cumulative_num': 1.0}

Our current plan is to make a utility that will build custom expected geo maps for some of the indicators. Given the slow rate of change in the geos, we will schedule an update to this list once a month.

@krivard krivard requested review from nmdefries and removed request for sgsmob April 5, 2021 13:34
* add 3 missing columns to covidcast table schema
* update acquisition with validation and of new missing columns in CSVs
* update api server to serve the missing columns
* add integration tests for the new columns
* Ben Smith and Andrew Chin's review suggestions integrated
@dshemetov dshemetov marked this pull request as ready for review April 13, 2021 17:58
@dshemetov dshemetov requested a review from melange396 April 14, 2021 20:18
Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found a couple nits, but otherwise LGTM

@dshemetov
Copy link
Contributor Author

Ooof, let's wait to merge this until next week, I found a couple things I'd like to fix.

@dshemetov
Copy link
Contributor Author

If our infra uses Docker, then a requirements file needs to be updated.

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xlnt

@dshemetov
Copy link
Contributor Author

dshemetov commented May 5, 2021

A few things coming up while merging Flask server and working with Docker:

  • python requirements need to be updated: python-dotenv, flask, orjson, sqlalchemy==1.3.24
  • the last one because the Flask server currently depends on a <1.4 version of SqlAlchemy because of RowProxy (ping @sgratzl to double check)

I'm getting a few failing tests due to databases not existing at the moment.

@krivard
Copy link
Contributor

krivard commented May 6, 2021

Can you paste the exact errors in thread? CI doesn't run on forks so we can't see what you're seeing.

@dshemetov
Copy link
Contributor Author

Once I installed the right dependencies and added the flag
--env SQLALCHEMY_DATABASE_URI=mysql+mysqldb://user:pass@delphi_database_epidata:3306/epidata
to the Docker run string, the tests that were failing now pass with pytest, but they fail when running with py3tester. A lot of integrations tests are also failing.

@dshemetov
Copy link
Contributor Author

Once I updated the delphi_web_epidata container correctly, all the tests pass now.

@dshemetov
Copy link
Contributor Author

For future ref, this link to the CI recipe from @sgratzl was really helpful. Going to add that to the ops docs.

@krivard
Copy link
Contributor

krivard commented May 7, 2021

🙏 thank you for both going on that goose chase and ensuring future-us won't have to chase that particular goose again!

If you change this to merging into dev that will make it easier to deploy with the new release system Monday.

@dshemetov dshemetov changed the base branch from main to dev May 7, 2021 22:18
@dshemetov
Copy link
Contributor Author

We should incorporate these naming changes to delphi_utils before we release on Monday.

@krivard
Copy link
Contributor

krivard commented May 10, 2021

https://pypi.org/project/delphi-utils/0.1.1/ is up; merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants