-
Notifications
You must be signed in to change notification settings - Fork 16
pipeline for Safegraph patterns #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @jingjtang ! I will try to complete the review of the Safegraph pipeline by early next week. I do have some early comments / concerns, based on an initial skim of the codebase.
|
@jingjtang in order to run the pipeline for data prior to June 2020, I uncommented line 26. The pipeline runs properly for a couple dates:
But then it hangs, without having processed all the data. ( (Without line 26 uncommented, i.e., only running on data following June 2020, the pipeline completed and terminated correctly.) |
If it helps debug, @jingjtang , I added a print(metric) inside the exporting loop:
It seems that there is something about the 2019-07-29 data in particular that is causing the pipeline to hang. |
Really confused as to why the pipeline hangs. I do notice that memory consumption for individual subprocesses is much higher for these pipelines compared to the original Safegraph pipeline. The July 22nd and 29th jobs consume up to 8GB, but this memory is freed after the completion of their tasks... |
Upon further investigation, there are four Python processes that appear to be sleeping (?) -- they each consume 1.5GB, but are using 0% CPU and are not bolded in top. There are another 8 Python processes that consume essentially no memory, and have 0% CPU consumption (which makes sense, because I'm guessing these Python processes are blocked, and the fact that no progress is being made on these processes in turn blocks @krivard do you have any suggestions for quick ways to debug these kinds of issues?
|
Can you switch from pool.map to ordinary map, so you can attach pdb to inspect activity when needed? You could then use |
@capnrefsmmat Thanks for those suggestions! I will use those techniques in the future - before I saw your post, I did a more rudimentary switch to direct iteration and waited for the traceback. @jingjtang I have found the source of the error. It lies in the assert statement here: https://github.com/cmu-delphi/covidcast-indicators/pull/225/files#diff-fbb4d07961dfbc089de3ac81616f7041R70-R71 This condition is not met, at least for the file @jingjtang Can you please investigate whether this assert statement should still hold for the final pipeline, and if so why it is not being met by some of the input files? |
The assert statement should hold. I will look into those files. Thanks @huisaddison |
@huisaddison I checked By definition, the sum of So, now I think we can delete this assert statement. |
OK, great, thanks @jingjtang ! And your investigation checks out -- 7/22 and 7/29 worked properly, they always happened to be the last two that succeeded before the rest failed, but 8/19 indeed failed. I will remove the assert statement from my local copy and re-run. |
We're going to want to rebase this branch and force-push it to resolve the conflicts before merging to main. The procedure for this is:
At this point it's then important to check with everyone who plausibly checked out a copy of the
|
The pipeline ran successfully in "serial mode", so the freezing in parallel mode is a more subtle bug. I am going to deliver an initial version of this data to the causal modeling team, to allow them to assess its value, and in the meanwhile I will switch over to assessing the new GHT data. The eventual next steps for this PR are:
Both (2) and (3) are good ideas, but implementing either one of them would result in a pipeline that runs in a reasonable amount of time in the steady state (i.e., if we are resource-constrained, only one of them is necessary). |
Added code (commented) for using geomap utils.
|
Hi @jingjtang - a quick question before you take a vacation! I noticed that the state-level bars data has at most 42 states - a quick line count ( |
@huisaddison. Yes, I also noticed that. It should be a problem due to the input data. (No matter I use the mapping file in |
Thanks @jingjtang . My hypotheses are:
I'll encourage the modeling team to start with the restaurant data and we can look into this further after your time off. Regarding the bugs in the geomapper utils (missing population, uppercase states) - can you comment in the open PR (if it's under review), or open an issue / PR for these problems? (Unless @krivard suggests an alternative course of action). |
c5c9ff4
to
b9a788f
Compare
I did some more sleuthing regarding the missing states. On 2020-08-16, the missing states were:
The commonality between these states are that they all have small populations (perhaps with the exception of Washington State). Then, I took a look at the raw visit numbers for bars versus for restaurants. The number of visits to restaurants are already two orders of magnitude larger than the number of visits to bars, so it seems like for those 8-9 states that are missing, Safegraph just doesn't have enough penetration to "see" the few bar visits that may be occurring there. If we want to improve the spatial coverage of the bars indicator, we may want to take the union over multiple NAISC codes, but that may reduce the precision of the indicator.
|
|
||
We access the Safegraph data using an AWS key-secret pair which is valid | ||
until June 15, 2021. The AWS credentials have been issued under | ||
@huisaddison's Safegraph Data Catalog account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this file basically be exported into the public documentation? If so, we should move this into an internal-only part of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not directly included in the public documentation, but
- it is used to help the person who ultimately drafts the public documentation, and
- we do intend on open-sourcing this codebase some time this fall
Personal tokens required for API access are encrypted before being committed to the repository, and the keys live on the delphi server, not in git.
@jingjtang The issues with the geomap utils that you mentioned, do those only affect Quidel or do they also affect this Safegraph Patterns indicator? (Or worse, do they also affect indicators that are in production?) |
@huisaddison I think it would affect the safegraph patterns since the data is also provided at zipcode level. The unit test failed for a simple data frame which have integer values at zipcode level but float numbers after using the functions in geo utils to aggregate it from zipcode level to msa. As for other signals, if the aggregation starts from the zipcode level, so yes. If it starts from county level, not sure yet. |
@huisaddison Talked to Dmitry already. It seems this is not a problem in the geo utils. But just a difference existing between the old mapping and the new one. So only Quidel will be affected. Safegraph patterns will not. |
Hi @jingjtang - I just took a look. Can you please complete the switch to the GeoMapper utility, test the package, make necessary changes, and re-push? I do not believe that two of methods being used in the (currently commented out) code are available in the They are not showing up as attributes when I play with Once you amend and test the code to work with the GeoMapper utility, ping me on Slack and I will take a second look ASAP. Thank you! |
metric_prop_name = "_".join([metric, "prop"]) | ||
|
||
############################ Use GeoMapper in utils ##################### | ||
# gmpr = GeoMapper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please uncomment this block, test (I think the methods in gmpr used here are no longer available), and repush. Ping me on Slack once this is ready and I will take another look ASAP :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged the new geo utils from main to this branch and it should work now. The unit tests have been updated
…vidcast-indicators into safegraph_patterns
Thanks for updating the code @jingjtang ! It looks like there are a few more kinks to iron out. It appears that the code is trying to set an obsolete keyword argument:
Can you fix this and I will re-review? If possible, please include the output of the included unit tests in this PR (hopefully the unit tests do not require a bigchunk instance to run, but if they do, I can look into creating a login into my server for you, or we can talk to Brian about getting you a bigchunk.) Thanks! |
recursive=True) | ||
|
||
process_file = partial(process, brand_df=brand_df, | ||
map_df=map_df, metrics=METRICS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obsolete keyword argument being set here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - @jingjtang if you have many other Delphi responsibilities to tend to, let me know and I'll work on this PR tonight (to take this off your plate).
@huisaddison It was fixed. Can you try it again? The unit tests are updated |
Tests pass, linter is mostly happy, pipeline runs - LGTM :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found typo in documentation.
code = 722511) | ||
|
||
## Metrics, Level 2 (`m2`) | ||
* `num`: number of new deaths on a given week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `num`: number of new deaths on a given week | |
* `num`: number of visits on a given week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be "in a given week"?
The way I read it is:
If "on" it would refer to a given week entirely.
If "in" it would refer to days within a given week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be "in". Thanks.
Addresses #191
Raw data are
patterns.csv
fromWeekly Patterns
which is updated weekly.This pipeline can be easily fit to other datasets in Safegraph with columns like "visits_by_day")
Exceptions in geo have not been finished. Temporarily, found more than 1.5k ZipCodes that are not included in our mapping file.
Mapping information from
naics code
tosafegraph_brand_id
is frombrand_info.csv
inPlaces Schema
datasetsWon't download the whole
Places Schema
datasets but only necessarybrand_info.csv
s which need to be stored in./static/brand_info
Questions to be answered:
patterns.csv
?Sum of
visits_by_day
is not always equal toraw_visit_counts
in those raw files