Skip to content

Streamline docker usage #49981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 2, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 10 additions & 39 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,42 +1,13 @@
FROM quay.io/condaforge/mambaforge
FROM python:3.10.8
WORKDIR /home/pandas

# if you forked pandas, you can pass in your own GitHub username to use your fork
# i.e. gh_username=myname
ARG gh_username=pandas-dev
ARG pandas_home="/home/pandas"
RUN apt-get update && apt-get -y upgrade
RUN apt-get install -y build-essential

# Avoid warnings by switching to noninteractive
ENV DEBIAN_FRONTEND=noninteractive
# hdf5 needed for pytables installation
RUN apt-get install -y libhdf5-dev

# Configure apt and install packages
RUN apt-get update \
&& apt-get -y install --no-install-recommends apt-utils git tzdata dialog 2>&1 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if those packages are not actually needed anymore? Especially tzdata seems to have been added on purpose to have the timezone tests working: #46219

(we copied this for the gitpod docker file as well, but so if it's no longer needed, we can clean that up there as well)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tzdata is already installed on the image. I'm guessing either from the base python image or may be included as part of the base ubuntu

#
# Configure timezone (fix for tests which try to read from "/etc/localtime")
&& ln -fs /usr/share/zoneinfo/Etc/UTC /etc/localtime \
&& dpkg-reconfigure -f noninteractive tzdata \
#
# cleanup
&& apt-get autoremove -y \
&& apt-get clean -y \
&& rm -rf /var/lib/apt/lists/*

# Switch back to dialog for any ad-hoc use of apt-get
ENV DEBIAN_FRONTEND=dialog

# Clone pandas repo
RUN mkdir "$pandas_home" \
&& git clone "https://github.com/$gh_username/pandas.git" "$pandas_home" \
&& cd "$pandas_home" \
&& git remote add upstream "https://github.com/pandas-dev/pandas.git" \
&& git pull upstream main

# Set up environment
RUN mamba env create -f "$pandas_home/environment.yml"

# Build C extensions and pandas
SHELL ["mamba", "run", "--no-capture-output", "-n", "pandas-dev", "/bin/bash", "-c"]
RUN cd "$pandas_home" \
&& export \
&& python setup.py build_ext -j 4 \
&& python -m pip install --no-build-isolation -e .
RUN python -m pip install --upgrade pip
RUN python -m pip install --use-deprecated=legacy-resolver \
Copy link
Member Author

@WillAyd WillAyd Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--use-deprecated=legacy-resolver is in place because the new pip resolver seems to download multiple package versions to (I assume) check dependencies via setup.py or pyproject.yml. Some of our dependencies (ex: boto3) have a ton of releases, so it makes for a painfully slow process

Possible resolutions to use the newer solver I think are

  1. CI: add minimal requirements file #48828
  2. https://pip.pypa.io/en/stable/topics/dependency-resolution/#possible-ways-to-reduce-backtracking

So either reduce the number of dependencies, add a floor to their version or set up a constraint file once after solving

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use mamba, which has a decent dependency resolver ...

-r https://raw.githubusercontent.com/pandas-dev/pandas/main/requirements-dev.txt
CMD ["/bin/bash"]
31 changes: 9 additions & 22 deletions doc/source/development/contributing_environment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -228,34 +228,21 @@ with a full pandas development environment.

Build the Docker image::

# Build the image pandas-yourname-env
docker build --tag pandas-yourname-env .
# Or build the image by passing your GitHub username to use your own fork
docker build --build-arg gh_username=yourname --tag pandas-yourname-env .
# Build the image
docker build -t pandas-dev .

Run Container::

# Run a container and bind your local repo to the container
docker run -it -w /home/pandas --rm -v path-to-local-pandas-repo:/home/pandas pandas-yourname-env
# This command assumes you are running from your local repo
# but if not alter ${PWD} to match your local repo path
docker run -it --rm -v ${PWD}:/home/pandas pandas-dev

Then a ``pandas-dev`` virtual environment will be available with all the development dependencies.
When inside the running container you can build and install pandas the same way as the other methods

.. code-block:: shell

root@... :/home/pandas# conda env list
# conda environments:
#
base * /opt/conda
pandas-dev /opt/conda/envs/pandas-dev

.. note::
If you bind your local repo for the first time, you have to build the C extensions afterwards.
Run the following command inside the container::

python setup.py build_ext -j 4

You need to rebuild the C extensions anytime the Cython code in ``pandas/_libs`` changes.
This most frequently occurs when changing or merging branches.
.. code-block:: bash
python setup.py build_ext -j 4
python -m pip install -e . --no-build-isolation --no-use-pep517

*Even easier, you can integrate Docker with the following IDEs:*

Expand Down