Skip to content

Diff uploads in ght #240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Aug 27, 2020
Merged

Diff uploads in ght #240

merged 16 commits into from
Aug 27, 2020

Conversation

vishakha1812
Copy link
Contributor

@vishakha1812 vishakha1812 commented Aug 24, 2020

No description provided.

@vishakha1812 vishakha1812 changed the base branch from main to deploy-google_health August 24, 2020 01:25
@vishakha1812 vishakha1812 changed the base branch from deploy-google_health to main August 24, 2020 22:11
@vishakha1812 vishakha1812 requested a review from krivard August 24, 2020 22:14
@huisaddison huisaddison requested review from taylor-arnold and huisaddison and removed request for taylor-arnold August 25, 2020 18:17
@huisaddison
Copy link
Contributor

Initial comments:

  • Some comments are inline. There seems to be a reference to cache_dir, which I think is no longer being used? And the params.json.template file should be updated with a default value for the new parameter data_dir.
  • The commit adds a bunch of data files. Do we really want to merge these onto main, or are they just being provided for the benefit of testing the pipeline?
  • Even after updating params.json, I cannot run the pipeline - Invalid bucket name. The default bucket name is an empty string. @vishakha1812 should I be using a different bucket name?
  • Linter: 9.65 / 10 ✔️
  • Unit tests: 13 / 16 pass. The failed ones may be due to the invalid bucket name?

@huisaddison
Copy link
Contributor

huisaddison commented Aug 25, 2020

@vishakha1812 has kindly provided the necessary parameters. With the new parameters, I encounter the following error when running the pipeline:

<delphi_utils.archive.S3ArchiveDiffer object at 0x7f18ab9e2f98>
INFO:Creating data from 2020-01-05 through 2020-08-21.
Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/addison/ght/covidcast-indicators/google_health/delphi_google_health/__main__.py", line 11, in <module>
    run_module()  # pragma: no cover
  File "/home/addison/ght/covidcast-indicators/google_health/delphi_google_health/run.py", line 72, in run_module
    ght, start_date, end_date, static_dir=static_dir, cache_dir=cache_dir
TypeError: get_counts_states() got an unexpected keyword argument 'cache_dir'

3 / 16 of the unit tests are still failing. I have a feeling the 3 failing unit tests are connected to the above issue.

@vishakha1812 the issue is either in the function definition of pull_api.get_counts_states or its invocation in run.py. Can you please fix this? After the necessary commit, I will pull and re-test.

@vishakha1812
Copy link
Contributor Author

  • We do need a cache_dir as one of the parameters in the S3ArchiveDiffer(). Update: data_dir in .json files is added and pushed.
  • The data files have been moved to a new directory data from cache directory. The cache is now being used for a different purpose. We are pushing it to main so that one doesn't have to create a new folder and move those files every-time for testing.

@krivard
Copy link
Contributor

krivard commented Aug 25, 2020

Moving the files is fine, but it looks like this both moves them and updates them, which is not great -- every person who runs this pipeline on their own machine will generate a different set of data files due to the nondeterministic nature of GHT queries. I understand that we need to keep some files around in data so that tests work (#213 suggests revising that) but in the meantime do we have a way of preventing collisions? Is .gitignore set up to handle these files? Should it be?

@huisaddison
Copy link
Contributor

The pipeline now runs without any errors. Interestingly, my receiving/ folder has output csvs up until 20200430, then a large gap, and then output for 20200820 and 20200821. @vishakha1812 is this expected behavior?

Unit test results (19/19 passed, 5 warnings):

================================================== warnings summary ==================================================
tests/test_run.py::TestRunModule::test_match_old_raw_output
tests/test_run.py::TestRunModule::test_match_old_raw_output
tests/test_run.py::TestRunModule::test_match_old_raw_output
tests/test_run.py::TestRunModule::test_match_old_raw_output
  /home/addison/ght/covidcast-indicators/google_health/tests/test_run.py:58: FutureWarning: The 'check_less_precise' keyword in testing.assert_*_equal is deprecated and will be removed in a future version. You can stop passing 'check_less_precise' to silence this warning.
    assert_frame_equal(test_df, new_df, check_less_precise=5)

tests/test_run.py::TestRunModule::test_match_old_smoothed_output
  /home/addison/ght/covidcast-indicators/google_health/tests/test_run.py:81: FutureWarning: The 'check_less_precise' keyword in testing.assert_*_equal is deprecated and will be removed in a future version. You can stop passing 'check_less_precise' to silence this warning.
    assert_frame_equal(test_df, new_df, check_less_precise=5)

-- Docs: https://docs.pytest.org/en/stable/warnings.html

----------- coverage: platform linux, python 3.6.8-final-0 -----------
Name                                                                                                                      Stmts   Miss  Cover   Missing
-------------------------------------------------------------------------------------------------------------------------------------------------------
/home/addison/ght/covidcast-indicators/google_health/env/lib/python3.6/site-packages/delphi_google_health/__init__.py         8      0   100%
/home/addison/ght/covidcast-indicators/google_health/env/lib/python3.6/site-packages/delphi_google_health/__main__.py         1      1     0%   2
/home/addison/ght/covidcast-indicators/google_health/env/lib/python3.6/site-packages/delphi_google_health/constants.py        8      0   100%
/home/addison/ght/covidcast-indicators/google_health/env/lib/python3.6/site-packages/delphi_google_health/export.py          16      0   100%
/home/addison/ght/covidcast-indicators/google_health/env/lib/python3.6/site-packages/delphi_google_health/map_values.py      28      0   100%
/home/addison/ght/covidcast-indicators/google_health/env/lib/python3.6/site-packages/delphi_google_health/pull_api.py        74      2    97%   228-231
/home/addison/ght/covidcast-indicators/google_health/env/lib/python3.6/site-packages/delphi_google_health/run.py             72      7    90%   50-51, 55-56, 60-61, 142
/home/addison/ght/covidcast-indicators/google_health/env/lib/python3.6/site-packages/delphi_google_health/smooth.py          26      0   100%
-------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                                       233     10    96%

===================================== 19 passed, 5 warnings in 75.25s (0:01:15) ======================================
(env) [addison@bigchunk-dev-01 google_health]$ 

@krivard
Copy link
Contributor

krivard commented Aug 25, 2020

That actually sounds right -- the diff-based issue definition should only retain files in receiving, and only the lines of those files, that differ from what's already in the API.

@vishakha1812
Copy link
Contributor Author

vishakha1812 commented Aug 25, 2020

@krivard .gitignore has not been set up for that yet, I can add a line which which will ignore any changes to these files.
With reference to #213 ,if we add a static folder which will contain all the data files, wouldn't that mean that we would be testing the pipeline with the old data files and not the updated one?

@krivard
Copy link
Contributor

krivard commented Aug 25, 2020

Unit tests of the calculations should always be done against static files.

Unit tests of the file-fetching process would be nice, but so far we've attempted to assert that unit tests should be runnable with the default params file, and no keys are permitted in the params file, which means all unit tests must work without keys. We can revisit this if we run into repeated problems with data sources switching up their file format on us by surprise, but for now, we'll stick with this.

@krivard krivard requested a review from dshemetov August 26, 2020 18:06
@dshemetov
Copy link
Contributor

All tests passed on my machine! 👍

@krivard
Copy link
Contributor

krivard commented Aug 27, 2020

Looks like Jenkins also demands aws_credentials in the test params file

@vishakha1812
Copy link
Contributor Author

vishakha1812 commented Aug 27, 2020

Adding that.
But we don't have a variable for "bucket_name" in vars.yaml or should I leave it empty in the .j2 file?

@krivard krivard merged commit 049f0a5 into main Aug 27, 2020
@krivard krivard deleted the diff_ght branch September 1, 2020 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants