-
Notifications
You must be signed in to change notification settings - Fork 4
Google symptoms dap #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, a lot of these files are too big for me to leave comments directly, so I'll put them here.
Final Report:
In the introduction, it would be good to link to the previous work establishing anosmia/ageusia as useful for a limited geographic range.
electrical heath records
electronic health records
The violin plot shown above and the barplot shown in Appendix II indicate the difference in the spatial distribution of symptoms in the same symptom set. This leads to the fact that when training linear regression models for different counties, the number of features (symptoms) that are "actually" taken into account is different. If we draw the median of coefficients for symptoms across all the counties available for the symptom set, we simply get 0s for symptoms with low geographical coverage. It should be noted that getting median to zero is not an error message, because the default missingness of this data set is caused by the extremely low search volume.
This paragraph is hard to parse, especially if the reader hasn't gotten to the appendix yet. Essentially you're saying this, right:
"Within a symptom set, there are large differences in geographical availability by symptom. Because of this and the zero-fill procedure when creating a symptom set, some symptoms are implicitly excluded from modeling for a given county. This leads to the coefficient for that symptom to be zero in the model for a given county. Symptoms with high missingness will, as a result, have a median coefficient of zero; this isn't an error."
From an organizational perspective, results in the appendices should be auxiliary -- not required to understand the main report, but providing additional tangential results that someone might be interested in. If you're using results from the appendix in the main report, those results should also be in the main report.
Appendix I is particularly interesting -- it makes an even stronger case for using regression over rawsum based on comparisons for non-sensory symptom sets.
Over all, looks good! Thanks for your hard work!
Add GS DAP
./scripts
.Google_Symptoms_DAP_Final_Report.html
is the final report.Appendix1_Correlation_Results.html
Appendix2_Coefficients_and_Intercepts.html