Skip to content

Pending updates to columns in db_signals.csv #1442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

melange396
Copy link
Collaborator

@melange396 melange396 commented May 18, 2024

This PR is derived from #1434; i removed all of the new columns but this should include all of the changes to the existing columns (except Available Geography, more on that in a bit).

Please let me know if we need to fix any of these -- the summary of differences appears to me to be:

  • "day"' removed from Time Type column, replaced with empty string: (this seems like it was accidental)
    • dsew-cpr:confirmed_admissions_covid_1d_7dav
  • "n/a" removed from Pathogen/Disease Area column, replaced with empty string: (these seem intentional)
    • nchs-mortality:deaths_allcause_incidence_num
    • nchs-mortality:deaths_allcause_incidence_prop
    • nchs-mortality:deaths_percent_of_expected
    • safegraph-daily:completely_home_prop
    • safegraph-daily:completely_home_prop_7dav
    • safegraph-daily:full_time_work_prop
    • safegraph-daily:full_time_work_prop_7dav
    • safegraph-daily:median_home_dwell_time
    • safegraph-daily:median_home_dwell_time_7dav
    • safegraph-daily:part_time_work_prop
    • safegraph-daily:part_time_work_prop_7dav
    • safegraph-weekly:bars_visit_num
    • safegraph-weekly:bars_visit_prop
    • safegraph-weekly:restaurants_visit_num
    • safegraph-weekly:restaurants_visit_prop
  • Newline added at the end of the file (this is effectively inconsequential)

The Available Geography column has some sweeping changes applied to it... In one example from chng, the text was modified from county,hhs,hrr,msa,nation,state to county, hrr (by Delphi), msa (by Delphi), state (by Delphi), hhs (by Delphi), nation (by Delphi). I believe this signifies that only county data came from the source, and we computed the various other higher levels of geo aggregation. This is valuable information, but i would suggest we keep the column the way it was and create a new column called something like Geographies aggregated by Delphi or Post-aggregated geographies that lists the geography types that were extrapolated by us. There are a few reasons for doing it this way, including that (i believe) the Signal Documentation app expects the structured comma-separated text without the extra annotations as it was before, and that representing the same information in its own column should save some space. If you agree with this, let me know as i think i should be able to apply those changes pretty easily. Also, some entries (like quidel for instance) have " (by Delphi)" attached to every geography in the list; that suggests to me that we did aggregations to produce county-level data from finer-grained locations, but i didn't think that was the case.

…al Type', 'Time Type', 'Is Weighted', 'Is Cumulative', 'Has StdErr', 'Has Sample Size'
@melange396 melange396 added chore api change affect the API and its responses code health readability, maintainability, best practices, etc data quality labels May 18, 2024
@melange396
Copy link
Collaborator Author

Just kidding! Those n/a values were not actually removed in the source spreadsheet nor in #1434 -- i inadvertently stripped them due to the way i imported the csv files... I edited the above message to strikethrough the irrelevant text.

@melange396
Copy link
Collaborator Author

Here is some code that you can paste into a python interpreter to see the (correct) list of differences:

import pandas as pd

base_url = 'https://github.com/cmu-delphi/delphi-epidata/raw/{}/src/server/endpoints/covidcast_utils/db_signals.csv'

current = pd.read_csv(base_url.format('dev'), na_filter=False)
proposed = pd.read_csv(base_url.format('bot/update-docs'), na_filter=False)

new_cols = set(proposed.columns) - set(current.columns)
print(new_cols)

non_matching = (proposed[current.columns] != current)
diffs_per_col = non_matching.apply(sum)
print(diffs_per_col)

mismatched_time = pd.concat([current[['Source Subdivision', 'Signal']], non_matching[['Time Type']]], axis=1)
print(mismatched_time[mismatched_time['Time Type']])

Copy link

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@melange396
Copy link
Collaborator Author

and then the csv in this PR was produced by following the above code snippet with this:

intermediate = proposed[current.columns]
intermediate['Available Geography'] = current['Available Geography']
intermediate.to_csv('intermediate.csv', index=False)
import os
for _ in range(2):
    os.system("sed -i 's/,False,/,FALSE,/g' intermediate.csv")
    os.system("sed -i 's/,True,/,TRUE,/g' intermediate.csv")

@melange396
Copy link
Collaborator Author

the source data in the google sheet has changed since this was done; closing this PR to create a new one...

@melange396 melange396 closed this May 28, 2024
@nmdefries nmdefries deleted the piecemeal_db_signals_updates branch June 5, 2024 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change affect the API and its responses chore code health readability, maintainability, best practices, etc data quality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant