Skip to content

CI check each minimum dependency is enforced in *.yaml and environment.yml files #50207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Dec 12, 2022 · 20 comments · Fixed by #51189
Closed

CI check each minimum dependency is enforced in *.yaml and environment.yml files #50207

MarcoGorelli opened this issue Dec 12, 2022 · 20 comments · Fixed by #51189
Assignees
Labels
CI Continuous Integration good first issue

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 12, 2022

xref #50205 (comment)

If any dependency has a minimum version specified in

pandas/pyproject.toml

Lines 57 to 119 in 16b9c98

[project.optional-dependencies]
test = ['hypothesis>=5.5.3', 'pytest>=6.0', 'pytest-xdist>=1.31', 'pytest-asyncio>=0.17.0']
performance = ['bottleneck>=1.3.2', 'numba>=0.53.1', 'numexpr>=2.7.1']
timezone = ['tzdata>=2022.1']
computation = ['scipy>=1.7.1', 'xarray>=0.21.0']
fss = ['fsspec>=2021.07.0']
aws = ['s3fs>=2021.08.0']
gcp = ['gcsfs>=2021.07.0', 'pandas-gbq>=0.15.0']
excel = ['odfpy>=1.4.1', 'openpyxl>=3.0.7', 'pyxlsb>=1.0.8', 'xlrd>=2.0.1', 'xlsxwriter>=1.4.3']
parquet = ['pyarrow>=6.0.0']
feather = ['pyarrow>=6.0.0']
hdf5 = [# blosc only available on conda (https://github.com/Blosc/python-blosc/issues/297)
#'blosc>=1.20.1',
'tables>=3.6.1']
spss = ['pyreadstat>=1.1.2']
postgresql = ['SQLAlchemy>=1.4.16', 'psycopg2>=2.8.6']
mysql = ['SQLAlchemy>=1.4.16', 'pymysql>=1.0.2']
sql-other = ['SQLAlchemy>=1.4.16']
html = ['beautifulsoup4>=4.9.3', 'html5lib>=1.1', 'lxml>=4.6.3']
xml = ['lxml>=4.6.3']
plot = ['matplotlib>=3.6.1']
output_formatting = ['jinja2>=3.0.0', 'tabulate>=0.8.9']
clipboard = ['PyQt5>=5.15.1', 'qtpy>=2.2.0']
compression = ['brotlipy>=0.7.0', 'python-snappy>=0.6.0', 'zstandard>=0.15.2']
all = ['beautifulsoup4>=4.9.3',
# blosc only available on conda (https://github.com/Blosc/python-blosc/issues/297)
#'blosc>=1.21.0',
'bottleneck>=1.3.2',
'brotlipy>=0.7.0',
'fastparquet>=0.6.3',
'fsspec>=2021.07.0',
'gcsfs>=2021.07.0',
'html5lib>=1.1',
'hypothesis>=6.13.0',
'jinja2>=3.0.0',
'lxml>=4.6.3',
'matplotlib>=3.6.1',
'numba>=0.53.1',
'numexpr>=2.7.3',
'odfpy>=1.4.1',
'openpyxl>=3.0.7',
'pandas-gbq>=0.15.0',
'psycopg2>=2.8.6',
'pyarrow>=6.0.0',
'pymysql>=1.0.2',
'PyQt5>=5.15.1',
'pyreadstat>=1.1.2',
'pytest>=6.0',
'pytest-xdist>=1.31',
'pytest-asyncio>=0.17.0',
'python-snappy>=0.6.0',
'pyxlsb>=1.0.8',
'qtpy>=2.2.0',
'scipy>=1.7.1',
's3fs>=2021.08.0',
'SQLAlchemy>=1.4.16',
'tables>=3.6.1',
'tabulate>=0.8.9',
'tzdata>=2022.1',
'xarray>=0.21.0',
'xlrd>=2.0.1',
'xlsxwriter>=1.4.3',
'zstandard>=0.15.2']

then if that dependency appears in environment.yml or any of the *.yaml files, then it should either be pinned to that minimum version, or be marked as >= that minimum version

Let's add a script to automate checking this!

https://github.com/pandas-dev/pandas/blob/main/scripts/validate_min_versions_in_sync.py is kind of related


Feel free to tag me for review if you take this on

@MarcoGorelli MarcoGorelli added CI Continuous Integration good first issue labels Dec 12, 2022
@mroeschke
Copy link
Member

Notes:

  1. ci/deps/actions-*-minimum_versions.yaml should be pinned as = instead of >=
  2. IMO validate_min_versions_in_sync.py should be updated to ensure the dependency files are also in sync

Personally not sure if this is a good first issue as there are a lot of moving parts but happy to be wrong

@seanjedi
Copy link
Contributor

Hmm, I believe I understand the ask.
So essentially get all the dependencies from pyproject.toml optional-dependencies, and compare them against the dependencies within environment.yml, and if there are any dependencies that are newer within pyproject.toml tan environment.yaml, then update environment.yaml to either = or >= correct?
So if that is the case, then:

  1. how do we determine when we want to have it = or >=? Would that be depdendent on how pyproject.toml has it?
  2. what if environment.yml somehow has a newer depdency, then should be ignore it, or should we downgrade it to the version that pyproject has it at?

Also can I take this?

@MarcoGorelli
Copy link
Member Author

I'd say:

  1. if it already has = then keep as-is, else, add >=
  2. not sure I understand the question, can you show an example?

And yes, sure!

@seanjedi
Copy link
Contributor

For 2 let's say hypothetically we have a situation where
In environment.yml we have:

- numba=0.53.1

and in pyproject we have:

numba>=0.52

Should I check if the dependencies in pyproject are indeed greater than the ones in environment, or if they differ then set the dependencies to pyproject
This might be all moot, but I wanted to doublecheck since in my head the way I was thinking of comparing the two is if pyproject has a dependency greater than environment, then update environment, otherwise stay the same, or do we want the two be equal?
I think the case we have, is that we want it to be equal, so above even though environment has a great version, we should have it pinned to be numba>=0.52

@seanjedi
Copy link
Contributor

take

@MarcoGorelli
Copy link
Member Author

If there's already a pin, then as long as the version is greater than the minimum version, you can leave it as it is

If there's no pin, then just put >= and the minimum version

@seanjedi
Copy link
Contributor

So should I create a new file for this, or extend validate_min_versions_in_sync.py to be able to update these dependencies as well?

@MarcoGorelli
Copy link
Member Author

it's probably fine to extend that file if possible

@seanjedi
Copy link
Contributor

seanjedi commented Dec 13, 2022

Created a branch here to work on the issue: seanjedi-50207

@seanjedi
Copy link
Contributor

@MarcoGorelli I have a question regarding which YAML files to check
Right now I am checking this list of YAML file (these are all the .yml files I can find in both the root and ci/deps, not all of these have dependencies, that is correct right? )

Also looking at how we were checking the dependencies earlier, it seems that we were only concerned with the optional dependencies, should I only be checking optional dependencies, or if I find any dependency that is both within pyproject.toml and any yaml file, then I should pin it?

image

@MarcoGorelli
Copy link
Member Author

you only need to look at environment.yml and ci/deps/*.yaml, the other yaml files are irrelevant

I find any dependency that is both within pyproject.toml and any yaml file, then I should pin it

yup (unless it's pinned already)

@seanjedi
Copy link
Contributor

Update (Since I have not updated in a long while): I am still working on this issue, but progress has slowed down a lot since the last commit I made in my branch.
This began with the Holiday's taking my time away so that I can spend it with family.
Now it is due to graduate school, and a few requirements for graduation that appeared that I have to appeal against now due to administration.
I bring this up so that if someone sees this, I want to ensure that this issue is not dead and that if anyone else wants to work on this issue, I would be open to that (especially collaboratively), and please let me know so that I can let you know what I have worked on so that we don't end up doing the same work.

Otherwise, I will continue working on this issue when I have the chance.

@MarcoGorelli
Copy link
Member Author

cool, thanks - I've removed the assignment so it's clear this is open

@kathleenhang
Copy link
Contributor

take

@kathleenhang
Copy link
Contributor

kathleenhang commented Jan 31, 2023

@seanjedi Hi there, I saw you mentioned about working collaboratively, and I'm open to that. I did see the branch you created for this issue. Feel free to let me know if you worked on anything extra or details on how you'd like to collaborate together. I'm going to start working on this issue now.

Edit: Here is the WIP scripts/validate_min_versions_in_sync.py

Also, I have a question. What types of characters are allowed for the version portion of the YAML dependency files? I ask this because the only dependency file in ci/deps/ which breaks the script is:

ci/deps/actions-pypy-38.yaml - line 8 - contains:

- python=3.8[build=*_pypy] # TODO: use this once pypy3.8 is available

Because it contains 2 "=" signs. I am wondering if I should write code to accommodate this or if this is out of place and only digits and decimals should be accommodated for the version portion?

@kathleenhang
Copy link
Contributor

@MarcoGorelli Could you please review my script? I linked it above. I'm wondering if it will run too slow or be too risky since it would create a new file and delete the old one. I'm still working on it, but wanted to know if I'm headed in the right direction.

@MarcoGorelli
Copy link
Member Author

nice! taking a look 👀

@MarcoGorelli
Copy link
Member Author

hey @kathleenhang, looks like you're on the right path - the script doesn't run at the moment though: if you have a list, then .append will return None, you probably wanted something like

    all_yaml_files = list(YAML_PATH.iterdir())
    all_yaml_files.append(ENV_PATH)

And

            data = yaml_f.read()
            yaml_file = yaml.safe_load(yaml_f)

should probably have been

            data = yaml_f.read()
            yaml_file = yaml.safe_load(data)

?

If you can get it to run without errors, then I'll take a closer look

@kathleenhang
Copy link
Contributor

@MarcoGorelli Got it working! I commented out line 8 from ci/deps/actions-pypy-38.yaml which shows- python=3.8[build=*_pypy] # TODO: use this once pypy3.8 is available. If you run the script again, make sure to comment that line out, or the script breaks.

What should be done for these edge cases?

- pyqt5==5.15.1
- numpy<1.24.0
- sqlalchemy<1.4.46
- python=3.8[build=*_pypy] # TODO: use this once pypy3.8 is available

Also the script does mess up on a line in environment.yml: - jinja2>=3.0.0>=3.0.0. I'm planning to fix that. Other than that, it works for me.

I'm not sure if it will work for you since I am using the imported os python module. I'm not sure if that may be different on Windows due to differing paths, etc.

Let me know what you think, thanks.

@MarcoGorelli
Copy link
Member Author

thanks for updating!

so:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants