Skip to content

CI, ENH: Check each minimum dependency is enforced in *.yaml and environment.yml #51189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Feb 23, 2023
Merged

CI, ENH: Check each minimum dependency is enforced in *.yaml and environment.yml #51189

merged 45 commits into from
Feb 23, 2023

Conversation

kathleenhang
Copy link
Contributor

@kathleenhang kathleenhang commented Feb 6, 2023

@kathleenhang
Copy link
Contributor Author

@MarcoGorelli Could you please review my script?

Also, do I need to add test cases for this script? If so, which file would I put the test cases in? I didn't see a place for CI test cases under pandas/tests/.

@MarcoGorelli
Copy link
Member

Thanks @kathleenhang ! I'll review today

Regarding tests - there is scripts/tests, where some other CI-related scripts are tested (in general we're not too strict about testing these, though it would be better to have at least some little test)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seriously good effort here, well done @kathleenhang !

I've left some comments, but it looks like you're headed in the right direction - this is going to be really useful for preventing future CI breakages

Also, you'll need to check that

pre-commit run validate-min-versions-in-sync --all-files

runs fine

Comment on lines 59 to 121
yaml_file = yaml.safe_load(data)
yaml_deps = yaml_file["dependencies"]
res = []
[res.append(x) for x in yaml_deps if x not in res]
yaml_deps = res
for dep_line in yaml_deps:
replace_text = operator = yaml_left = ""
search_text = str(dep_line)
if str(dep_line) in exclusion_list:
continue
if ">=" in dep_line:
operator = ">="
elif "==" in dep_line:
operator = "=="
elif "=" in dep_line:
operator = "="
elif "<" in dep_line:
operator = "<"
elif ">" in dep_line:
operator = ">"
else:
operator = ""
if operator == "":
yaml_package, yaml_version = str(dep_line).strip(), None
yaml_left = yaml_package
else:
yaml_package, yaml_version = str(dep_line).strip().split(operator)
if operator == "<" or operator == ">":
if yaml_package in toml_dependencies:
if version.parse(yaml_version) <= version.parse(
toml_dependencies[yaml_package]
):
yaml_left = yaml_package
else:
yaml_left = str(dep_line) + ", "
else:
yaml_left = yaml_package + operator
if yaml_package in toml_dependencies:
if ">" in yaml_left or "<" in yaml_left:
if "," in yaml_left:
# ex: "numpy<1.24.0," + ">=" + "1.2"
replace_text = (
yaml_left + ">=" + toml_dependencies[yaml_package]
)
else:
replace_text = yaml_left + toml_dependencies[yaml_package]
# update yaml package version to TOML min version
elif yaml_version is not None:
if version.parse(
toml_dependencies[yaml_package]
) > version.parse(yaml_version):
# ex: "hypothesis>=" + "6.34.2"
replace_text = yaml_left + toml_dependencies[yaml_package]
elif version.parse(
toml_dependencies[yaml_package]
) == version.parse(yaml_version):
replace_text = dep_line
else:
# ex: "hypothesis + ">=" + 6.34.2"
replace_text = (
yaml_package + ">=" + toml_dependencies[yaml_package]
)
data = data.replace(search_text, replace_text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, but it's getting quite complicated 😄 how about taking this part out to its own function (which accepts yaml_file as an argument), and then unit-testing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I've separated this portion of code into its own function. I'm writing unit tests now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

BTW if you notice some CI failures for some unrelated jobs, that's probably exactly what they are (i.e. unrelated)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a new unit test file at scripts/tests/test_validate_min_versions_in_sync.py.

It comes with 3 additional files under scripts/tests/:

- dummy_toml_file.toml
- dummy_yaml_expected_file_1.yaml
- dummy_yaml_unmodified_file_1.yaml

The .toml file can be adjusted to change the baselines for the minimum dependencies. The dummy_yaml_unmodified_file_1.yaml can be adjusted to test different edge cases, however, the dummy_yaml_expected_file_1.yaml should be changed accordingly to reflect the proper results for the previously mentioned file, since the two should be in sync as a testing set.

I was thinking of creating more sets of expected/unmodified YAML files so that each set can test its own edge case like duplicates, same version, no version, range, etc. That way, each unit test can be smaller and more meaningful.

Currently, I think the dataset is too large to provide meaningful test assertion results.

@MarcoGorelli What are your thoughts? Feedback appreciated!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! this isn't user-facing, so arguably doesn't need too extensive testing. but yes, smaller test files would make it easier to reason about what the script is doing

@mroeschke
Copy link
Member

I think for this to pass the minimum versions will need to be added to the yaml files too e.g. package>=min_version

@mroeschke mroeschke added the Code Style Code style, linting, code_checks label Feb 9, 2023
yaml_left = yaml_package
else:
yaml_package, yaml_version = str(dep_line).strip().split(operator)
if operator == "<" or operator == ">":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pylint is suggesting to rewrite this one

 pylint...................................................................Failed
- hook id: pylint
- duration: 541.86s
- exit code: 8

************* Module scripts.validate_min_versions_in_sync
scripts/validate_min_versions_in_sync.py:93:15: R1714: Consider merging these comparisons with 'in' by using 'operator in ('<', '>')'. Use a set instead if elements are hashable. (consider-using-in)

(pylint is very slow, so it currently only runs in CI - but ruff is re-implementing a large portion of its checks, hopefully we can remove it and just use ruff soon-ish, which is blazingly fast)

@MarcoGorelli
Copy link
Member

thanks for sticking with this - looks like there's a few merge conflicts to resolve now (due to https://github.com/pandas-dev/pandas/pull/51175/files ?), if you resolve them and push then this might be close

@MarcoGorelli
Copy link
Member

thanks for updating - I'm seeing these failures in CI, looks like there's a missing operator somewhere?

pre-commit hook(s) made changes.
If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
diff --git a/ci/deps/actions-310.yaml b/ci/deps/actions-310.yaml
index b81c346..7d15e4c 100644
--- a/ci/deps/actions-310.yaml
+++ b/ci/deps/actions-310.yaml
@@ -50,7 +50,7 @@ dependencies:
   - scipy>=1.7.1
   - sqlalchemy<1.4.46, >=1.4.16
   - tabulate>=0.8.9
-  - tzdata>=2022a
+  - tzdata2022.1
   - xarray>=0.21.0
   - xlrd>=2.0.1
   - xlsxwriter>=1.4.3
diff --git a/ci/deps/actions-311.yaml b/ci/deps/actions-311.yaml
index f4b9cd0..e0a6836 100644
--- a/ci/deps/actions-311.yaml
+++ b/ci/deps/actions-311.yaml
@@ -50,7 +50,7 @@ dependencies:
   - scipy>=1.7.1
   - sqlalchemy<1.4.46, >=1.4.16
   - tabulate>=0.8.9
-  - tzdata>=2022a
+  - tzdata2022.1
   - xarray>=0.21.0
   - xlrd>=2.0.1
   - xlsxwriter>=1.4.3
diff --git a/ci/deps/actions-38-minimum_versions.yaml b/ci/deps/actions-38-minimum_versions.yaml
index 7652b63..706549d 100644
--- a/ci/deps/actions-38-minimum_versions.yaml
+++ b/ci/deps/actions-38-minimum_versions.yaml
@@ -53,7 +53,7 @@ dependencies:
   - scipy=1.7.1
   - sqlalchemy=1.4.16
   - tabulate=0.8.9
-  - tzdata=2022a
+  - tzdata=2022.1
   - xarray=0.21.0
   - xlrd=2.0.1
   - xlsxwriter=1.4.3
diff --git a/ci/deps/actions-39.yaml b/ci/deps/actions-39.yaml
index 3d1d2e5..8c6427d 100644
--- a/ci/deps/actions-39.yaml
+++ b/ci/deps/actions-39.yaml
@@ -50,7 +50,7 @@ dependencies:
   - scipy>=1.7.1
   - sqlalchemy<1.4.46, >=1.4.16
   - tabulate>=0.8.9
-  - tzdata>=2022a
+  - tzdata2022.1
   - xarray>=0.21.0
   - xlrd>=2.0.1
   - xlsxwriter>=1.4.3
diff --git a/ci/deps/actions-pypy-38.yaml b/ci/deps/actions-pypy-38.yaml
index 3218ec1..1fde1e7 100644
--- a/ci/deps/actions-pypy-38.yaml
+++ b/ci/deps/actions-pypy-38.yaml
@@ -14,7 +14,7 @@ dependencies:
   # test dependencies
   - pytest>=7.0.0
   - pytest-cov
-  - pytest-asyncio
+  - pytest-asyncio>=0.17.0
   - pytest-xdist>=2.2.0
   - hypothesis>=6.34.2
 
diff --git a/environment.yml b/environment.yml
index a15997b..99c0469 100644
--- a/environment.yml
+++ b/environment.yml
@@ -53,7 +53,7 @@ dependencies:
   - scipy>=1.7.1
   - sqlalchemy>=1.4.16
   - tabulate>=0.8.9
-  - tzdata>=2022a
+  - tzdata2022.1
   - xarray>=0.21.0
   - xlrd>=2.0.1
   - xlsxwriter>=1.4.3

@MarcoGorelli
Copy link
Member

wait does tzdata use a-b-c-d-e-f-g on conda-forge, and 1-2-3-4-5-6-7 on PyPI? 🤦 🤦 🤦 never seen anything like this 😆

@mroeschke
Copy link
Member

FWIW I don't think tzdata is managed by the same owners on pypi vs conda-forge

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Looks like this might be it

There's a nitpicky CI error

 pylint...................................................................Failed
- hook id: pylint
- duration: 611.32s
- exit code: 16

************* Module scripts.validate_min_versions_in_sync
scripts/validate_min_versions_in_sync.py:127:4: C0200: Consider using enumerate instead of iterating with range and len (consider-using-enumerate)

I've also left some other comments, but content-wise, I think this looks solid

Comment on lines 65 to 66
res = []
[res.append(x) for x in yaml_dependencies if x not in res]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to find the unique dependencies? perhaps using a set would be clearer, like

res = list({x for x in yaml_dependencies})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was using a set before but when I would run pre-commit, I get this:
TypeError: unhashable type: 'dict'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you remember which hook was throwing that? pretty sure this is valid syntax

In [15]: yaml_dependencies = [1,2,3, 3]

In [16]: list({x for x in yaml_dependencies})
Out[16]: [1, 2, 3]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check minimum version of dependencies are aligned........................Failed
- hook id: validate-min-versions-in-sync
- exit code: 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'll take a look

Copy link
Contributor Author

@kathleenhang kathleenhang Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because of
tandard>=0.15.2', 'dask-core', 'seaborn-base', 'moto', 'flask', 'asv>=0.5.1', 'c-compiler', 'cxx-compiler', 'black=23.1.0', 'cpplint', 'flake8=6.0.0', 'isort>=5.2.1', 'mypy=1.0', 'pre-commit>=2.15.0', 'pyupgrade', 'ruff=0.0.215', 'gitpython', 'gitdb', 'natsort', 'numpydoc', 'pydata-sphinx-theme<0.11', 'pytest-cython', 'sphinx', 'sphinx-panels', 'sphinx-copybutton', 'types-python-dateutil', 'types-PyMySQL', 'types-pytz', 'types-setuptools', 'nbconvert>=6.4.5', 'nbsphinx', 'pandoc', 'ipywidgets', 'nbformat', 'notebook>=6.0.3', 'ipykernel', 'jinja2>=3.0.0', 'markdown', 'feedparser', 'pyyaml', 'requests', {'pip': ['sphinx-toggleprompt']}]

In one of the yaml_dependencies, there is {'pip': ['sphinx-toggleprompt']}]

That's also why I added this portion:

if isinstance(dependency, dict) or dependency in EXCLUSION_LIST: continue

To ignore that nested dictionary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks for explaining!

Would it work to apply

$ git diff
diff --git a/scripts/validate_min_versions_in_sync.py b/scripts/validate_min_versions_in_sync.py
index b1dc734994..6e3c93dd57 100755
--- a/scripts/validate_min_versions_in_sync.py
+++ b/scripts/validate_min_versions_in_sync.py
@@ -62,9 +62,6 @@ def pin_min_versions_to_ci_deps():
             yaml_file_data = yaml_f.read()
             yaml_file = yaml.safe_load(yaml_file_data)
             yaml_dependencies = yaml_file["dependencies"]
-            res = []
-            [res.append(x) for x in yaml_dependencies if x not in res]
-            yaml_dependencies = res
             yaml_map = get_yaml_map_from(yaml_dependencies)
             toml_map = get_toml_map_from(toml_dependencies)
             yaml_file_data = pin_min_versions_to_yaml_file(
@@ -105,6 +102,8 @@ def get_yaml_map_from(yaml_dic) -> dict[str, str]:
     for dependency in yaml_dic:
         if type(dependency) == dict or dependency in EXCLUSION_LIST:
             continue
+        if dependency in yaml_map:
+            continue
         search_text = str(dependency)
         operator = get_operator_from(search_text)
         if "," in dependency:

?

I would help if you could also add a short little docstring to pin_min_versions_to_ci_deps in which you note that pip dependencies won't be pinned (which is definitely OK for now, what you've done already adds value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, it works! I added the docstring, along with all the other requested updates.

yaml_file_data = pin_min_versions_to_yaml_file(
yaml_map, toml_map, yaml_file_data
)
os.remove(curr_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary (the next line will overwrite the file)

def get_yaml_map_from(yaml_dic) -> dict[str, str]:
yaml_map = {}
for dependency in yaml_dic:
if type(dependency) == dict or dependency in EXCLUSION_LIST:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

if isinstance(dependency, dict)

would be more idiomatic

continue
toml_version = version.parse(min_dep)
yaml_versions = clean_version_list(yaml_versions, toml_version)
[cleaned_yaml_versions.append(x) for x in yaml_versions if "-" not in x]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above - can we create this in one go instead of first creating an empty list and then appending to it within a list comprehension?

@MarcoGorelli MarcoGorelli added this to the 2.1 milestone Feb 21, 2023
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, this might be it!

I'll do a final check tomorrow and then hopefully we can get this in

@@ -12,7 +12,8 @@ dependencies:
- pytest>=7.0.0
- pytest-cov
- pytest-xdist>=2.2.0
- pytest-asyncio>=0.17
- psutil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

psutil was recently removed from all of the these files. Could you remove them?

@pytest.mark.parametrize(
"src_toml, src_yaml, expected_yaml",
[
( # random
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of these comments could you put this comment in the file name like random_deps.toml

Also for the directory naming could you use something like scripts/tests/data/?



def clean_version_list(yaml_versions, toml_version):
for i, _ in enumerate(yaml_versions):
Copy link
Member

@mroeschke mroeschke Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i, _ in enumerate(yaml_versions):
for i in range(len(yaml_versions)):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait this undoes the pylint nit 😆

 ************* Module scripts.validate_min_versions_in_sync
scripts/validate_min_versions_in_sync.py:132:4: C0200: Consider using enumerate instead of iterating with range and len (consider-using-enumerate)

maybe we can just remove this one

toml_map = get_toml_map_from(toml_dependencies)
with open(curr_file, "w") as f:
f.write(
pin_min_versions_to_yaml_file(yaml_map, toml_map, yaml_file_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this runs on every commit, we don't necessarily want to rewrite this file every time.

@MarcoGorelli
Copy link
Member

Should be good, thanks! I've just made some minor fixups, and re-ran the script from having checked-out the yaml files to the upstream/main versions

@MarcoGorelli
Copy link
Member

Nice, let's ship this

I can see a few places where there could still be some minor cleanups, e.g.

-            for yaml_version in yaml_versions:
-                old_dep += yaml_version + ", "
-            old_dep = old_dep[:-2]
+            old_dep = old_dep + ", ".join(yaml_versions)

and

-EXCLUSION_LIST = {
-    "python=3.8[build=*_pypy]": None,
-    "tzdata": None,
-    "pyarrow": None,
-}
+EXCLUSION_LIST = frozenset(["python=3.8[build=*_pypy]", "tzdata", "pyarrow"])

If you wanted to get these in in a follow-up PR, that'd be welcome - but for now, it's OK, let's merge this so we're a bit more protected against CI failures

Thanks @kathleenhang , fantastic effort here

@MarcoGorelli MarcoGorelli merged commit 9935690 into pandas-dev:main Feb 23, 2023
@kathleenhang
Copy link
Contributor Author

@MarcoGorelli Thanks, I appreciate your help! Nice additions, I'm picking up some nifty python tricks here.

I'll be sure to add those in a follow-up PR.

@kathleenhang kathleenhang deleted the kathleenhang-50207 branch February 23, 2023 20:24
@kathleenhang kathleenhang restored the kathleenhang-50207 branch February 23, 2023 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code Style Code style, linting, code_checks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI check each minimum dependency is enforced in *.yaml and environment.yml files
3 participants