Skip to content

BUG: Docs won't build (S3 bucket does not exist) #56592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
johnstacy opened this issue Dec 21, 2023 · 11 comments · Fixed by #56762
Closed
3 tasks done

BUG: Docs won't build (S3 bucket does not exist) #56592

johnstacy opened this issue Dec 21, 2023 · 11 comments · Fixed by #56762

Comments

@johnstacy
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

git clone https://github.com/pandas-dev/pandas.git build/pandas
cd build/pandas
git checkout v2.1.4

pip3 install -r requirements-dev.txt
python -m pip install -ve . --no-build-isolation --config-settings editable-verbose=true
cd doc
python make.py html --num-jobs 1

Issue Description

Built a Docker image using the provided Dockerfile. Inside the container, ran the attached commands to build the docs. It complains about not finding an S3 bucket (s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml). I'm running with --num-jobs 1 because when I was running the build with parallel processing, it would fail but not tell me what the issue actually was.

Expected Behavior

HTML docs build successfully.

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.8.final.0
python-bits : 64
OS : Linux
OS-release : 6.5.0-14-generic
Version : #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 15:13:47 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.4
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 63.2.0
pip : 23.3.2
Cython : 0.29.33
pytest : 7.4.3
hypothesis : 6.92.1
sphinx : 6.2.1
blosc : 1.11.1
feather : None
xlsxwriter : 3.1.9
lxml.etree : 4.9.4
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 8.18.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat: None
fastparquet : 2023.10.1
fsspec : 2023.12.2
gcsfs : 2023.12.2post1
matplotlib : 3.7.4
numba : 0.58.1
numexpr : 2.8.8
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : 1.2.6
pyxlsb : 1.0.10
s3fs : 2023.12.2
scipy : 1.11.4
sqlalchemy : 2.0.23
tables : 3.9.2
tabulate : 0.9.0
xarray : 2023.12.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

@johnstacy johnstacy added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 21, 2023
@mroeschke
Copy link
Member

A PR to change the example to a .. code-block:: python would be welcome

@mroeschke mroeschke added Docs good first issue and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 22, 2023
@JackCollins91
Copy link
Contributor

May I take?

@srinivaspavan9
Copy link

srinivaspavan9 commented Dec 30, 2023

Hi I started working on this issue
and I couldnt reproduce the issue

this is what i got after executing
python make.py html --num-jobs 1
in the container and the complete build is done

Is this what i am supposed to get ?

image

@JackCollins91
Copy link
Contributor

JackCollins91 commented Dec 30, 2023

+1 to @srinivaspavan9 , I also could not reproduce from a fresh build in a docker container. System info attached. Documents built without any error, nor any warning about an S3 bucket. I got the same depreciation warnings as above, but I assume that is a different issue(?)

System info attached. Python version is 3.10.8.
=== System Information ===.txt

Happy to continue work if there's a next step.

@johnstacy
Copy link
Author

johnstacy commented Jan 2, 2024

@JackCollins91 In your system info, it shows python 3.9. Was the system info generated outside the container?

@johnstacy
Copy link
Author

And to confirm, you all are using a container built using this Dockerfile? https://github.com/pandas-dev/pandas/blob/main/Dockerfile

@johnstacy
Copy link
Author

Here's my stacktrace by the way...

/ci/build/pandas/doc/source/user_guide/index.rst:74: WARNING: toctree contains reference to excluded document 'user_guide/style'
WARNING: 
>>>-------------------------------------------------------------------------
Exception in /ci/build/pandas/doc/source/user_guide/io.rst at block ending on line None
Specify :okexcept: as an option in the ipython:: block to suppress this message
---------------------------------------------------------------------------
NoSuchBucket                              Traceback (most recent call last)
File /usr/local/lib/python3.10/site-packages/s3fs/core.py:113, in _error_wrapper(func, args, kwargs, retries)
    112 try:
--> 113     return await func(*args, **kwargs)
    114 except S3_RETRYABLE_ERRORS as e:
File /usr/local/lib/python3.10/site-packages/aiobotocore/client.py:408, in AioBaseClient._make_api_call(self, operation_name, api_params)
    407     error_class = self.exceptions.from_code(error_code)
--> 408     raise error_class(parsed_response, operation_name)
    409 else:
NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist
The above exception was the direct cause of the following exception:
FileNotFoundError                         Traceback (most recent call last)
Cell In[387], line 1
----> 1 df = pd.read_xml(
      2     "s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml",
      3     xpath=".//journal-meta",
      4 )
File /ci/build/pandas/pandas/io/xml.py:1132, in read_xml(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, dtype, converters, parse_dates, encoding, parser, stylesheet, iterparse, compression, storage_options, dtype_backend)
    888 r"""
    889 Read XML document into a :class:`~pandas.DataFrame` object.
    890 
   (...)
   1128 2  triangle      180    3.0
   1129 """
   1130 check_dtype_backend(dtype_backend)
-> 1132 return _parse(
   1133     path_or_buffer=path_or_buffer,
   1134     xpath=xpath,
   1135     namespaces=namespaces,
   1136     elems_only=elems_only,
   1137     attrs_only=attrs_only,
   1138     names=names,
   1139     dtype=dtype,
   1140     converters=converters,
   1141     parse_dates=parse_dates,
   1142     encoding=encoding,
   1143     parser=parser,
   1144     stylesheet=stylesheet,
   1145     iterparse=iterparse,
   1146     compression=compression,
   1147     storage_options=storage_options,
   1148     dtype_backend=dtype_backend,
   1149 )
File /ci/build/pandas/pandas/io/xml.py:852, in _parse(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, dtype, converters, parse_dates, encoding, parser, stylesheet, iterparse, compression, storage_options, dtype_backend, **kwargs)
    849 else:
    850     raise ValueError("Values for parser can only be lxml or etree.")
--> 852 data_dicts = p.parse_data()
    854 return _data_to_frame(
    855     data=data_dicts,
    856     dtype=dtype,
   (...)
    860     **kwargs,
    861 )
File /ci/build/pandas/pandas/io/xml.py:556, in _LxmlFrameParser.parse_data(self)
    553 from lxml.etree import iterparse
    555 if self.iterparse is None:
--> 556     self.xml_doc = self._parse_doc(self.path_or_buffer)
    558     if self.stylesheet:
    559         self.xsl_doc = self._parse_doc(self.stylesheet)
File /ci/build/pandas/pandas/io/xml.py:631, in _LxmlFrameParser._parse_doc(self, raw_doc)
    622 def _parse_doc(
    623     self, raw_doc: FilePath | ReadBuffer[bytes] | ReadBuffer[str]
    624 ) -> etree._Element:
    625     from lxml.etree import (
    626         XMLParser,
    627         fromstring,
    628         parse,
    629     )
--> 631     handle_data = get_data_from_filepath(
    632         filepath_or_buffer=raw_doc,
    633         encoding=self.encoding,
    634         compression=self.compression,
    635         storage_options=self.storage_options,
    636     )
    638     with preprocess_data(handle_data) as xml_data:
    639         curr_parser = XMLParser(encoding=self.encoding)
File /ci/build/pandas/pandas/io/xml.py:700, in get_data_from_filepath(filepath_or_buffer, encoding, compression, storage_options)
    689     filepath_or_buffer = stringify_path(filepath_or_buffer)
    691 if (
    692     isinstance(filepath_or_buffer, str)
    693     and not filepath_or_buffer.startswith(("<?xml", "<"))
   (...)
    698     or file_exists(filepath_or_buffer)
    699 ):
--> 700     with get_handle(
    701         filepath_or_buffer,
    702         "r",
    703         encoding=encoding,
    704         compression=compression,
    705         storage_options=storage_options,
    706     ) as handle_obj:
    707         filepath_or_buffer = (
    708             handle_obj.handle.read()
    709             if hasattr(handle_obj.handle, "read")
    710             else handle_obj.handle
    711         )
    713 return filepath_or_buffer
File /ci/build/pandas/pandas/io/common.py:718, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    715     codecs.lookup_error(errors)
    717 # open URLs
--> 718 ioargs = _get_filepath_or_buffer(
    719     path_or_buf,
    720     encoding=encoding,
    721     compression=compression,
    722     mode=mode,
    723     storage_options=storage_options,
    724 )
    726 handle = ioargs.filepath_or_buffer
    727 handles: list[BaseBuffer]
File /ci/build/pandas/pandas/io/common.py:420, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    415     pass
    417 try:
    418     file_obj = fsspec.open(
    419         filepath_or_buffer, mode=fsspec_mode, **(storage_options or {})
--> 420     ).open()
    421 # GH 34626 Reads from Public Buckets without Credentials needs anon=True
    422 except tuple(err_types_to_retry_with_anon):
File /usr/local/lib/python3.10/site-packages/fsspec/core.py:135, in OpenFile.open(self)
    128 def open(self):
    129     """Materialise this as a real open file without context
    130 
    131     The OpenFile object should be explicitly closed to avoid enclosed file
    132     instances persisting. You must, therefore, keep a reference to the OpenFile
    133     during the life of the file-like it generates.
    134     """
--> 135     return self.__enter__()
File /usr/local/lib/python3.10/site-packages/fsspec/core.py:103, in OpenFile.__enter__(self)
    100 def __enter__(self):
    101     mode = self.mode.replace("t", "").replace("b", "") + "b"
--> 103     f = self.fs.open(self.path, mode=mode)
    105     self.fobjects = [f]
    107     if self.compression is not None:
File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1295, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1293 else:
   1294     ac = kwargs.pop("autocommit", not self._intrans)
-> 1295     f = self._open(
   1296         path,
   1297         mode=mode,
   1298         block_size=block_size,
   1299         autocommit=ac,
   1300         cache_options=cache_options,
   1301         **kwargs,
   1302     )
   1303     if compression is not None:
   1304         from fsspec.compression import compr
File /usr/local/lib/python3.10/site-packages/s3fs/core.py:671, in S3FileSystem._open(self, path, mode, block_size, acl, version_id, fill_cache, cache_type, autocommit, size, requester_pays, cache_options, **kwargs)
    668 if cache_type is None:
    669     cache_type = self.default_cache_type
--> 671 return S3File(
    672     self,
    673     path,
    674     mode,
    675     block_size=block_size,
    676     acl=acl,
    677     version_id=version_id,
    678     fill_cache=fill_cache,
    679     s3_additional_kwargs=kw,
    680     cache_type=cache_type,
    681     autocommit=autocommit,
    682     requester_pays=requester_pays,
    683     cache_options=cache_options,
    684     size=size,
    685 )
File /usr/local/lib/python3.10/site-packages/s3fs/core.py:2110, in S3File.__init__(self, s3, path, mode, block_size, acl, version_id, fill_cache, s3_additional_kwargs, autocommit, cache_type, requester_pays, cache_options, size)
   2108         self.details = s3.info(path)
   2109         self.version_id = self.details.get("VersionId")
-> 2110 super().__init__(
   2111     s3,
   2112     path,
   2113     mode,
   2114     block_size,
   2115     autocommit=autocommit,
   2116     cache_type=cache_type,
   2117     cache_options=cache_options,
   2118     size=size,
   2119 )
   2120 self.s3 = self.fs  # compatibility
   2122 # when not using autocommit we want to have transactional state to manage
File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1651, in AbstractBufferedFile.__init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs)
   1649         self.size = size
   1650     else:
-> 1651         self.size = self.details["size"]
   1652     self.cache = caches[cache_type](
   1653         self.blocksize, self._fetch_range, self.size, **cache_options
   1654     )
   1655 else:
File /usr/local/lib/python3.10/site-packages/fsspec/spec.py:1664, in AbstractBufferedFile.details(self)
   1661 @property
   1662 def details(self):
   1663     if self._details is None:
-> 1664         self._details = self.fs.info(self.path)
   1665     return self._details
File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:118, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
    115 @functools.wraps(func)
    116 def wrapper(*args, **kwargs):
    117     self = obj or args[0]
--> 118     return sync(self.loop, func, *args, **kwargs)
File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs)
    101     raise FSTimeoutError from return_result
    102 elif isinstance(return_result, BaseException):
--> 103     raise return_result
    104 else:
    105     return return_result
File /usr/local/lib/python3.10/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
     54     coro = asyncio.wait_for(coro, timeout=timeout)
     55 try:
---> 56     result[0] = await coro
     57 except Exception as ex:
     58     result[0] = ex
File /usr/local/lib/python3.10/site-packages/s3fs/core.py:1328, in S3FileSystem._info(self, path, bucket, key, refresh, version_id)
   1323         raise translate_boto_error(e, set_cause=False)
   1325 try:
   1326     # We check to see if the path is a directory by attempting to list its
   1327     # contexts. If anything is found, it is indeed a directory
-> 1328     out = await self._call_s3(
   1329         "list_objects_v2",
   1330         self.kwargs,
   1331         Bucket=bucket,
   1332         Prefix=key.rstrip("/") + "/" if key else "",
   1333         Delimiter="/",
   1334         MaxKeys=1,
   1335         **self.req_kw,
   1336     )
   1337     if (
   1338         out.get("KeyCount", 0) > 0
   1339         or out.get("Contents", [])
   1340         or out.get("CommonPrefixes", [])
   1341     ):
   1342         return {
   1343             "name": "/".join([bucket, key]),
   1344             "type": "directory",
   1345             "size": 0,
   1346             "StorageClass": "DIRECTORY",
   1347         }
File /usr/local/lib/python3.10/site-packages/s3fs/core.py:348, in S3FileSystem._call_s3(self, method, *akwarglist, **kwargs)
    346 logger.debug("CALL: %s - %s - %s", method.__name__, akwarglist, kw2)
    347 additional_kwargs = self._get_s3_method_kwargs(method, *akwarglist, **kwargs)
--> 348 return await _error_wrapper(
    349     method, kwargs=additional_kwargs, retries=self.retries
    350 )
File /usr/local/lib/python3.10/site-packages/s3fs/core.py:140, in _error_wrapper(func, args, kwargs, retries)
    138         err = e
    139 err = translate_boto_error(err)
--> 140 raise err
FileNotFoundError: The specified bucket does not exist
<<<-------------------------------------------------------------------------
Exception occurred:
  File "/usr/local/lib/python3.10/site-packages/IPython/sphinxext/ipython_directive.py", line 584, in process_input
    raise RuntimeError(
RuntimeError: Unexpected exception in `/ci/build/pandas/doc/source/user_guide/io.rst` line None
The full traceback has been saved in /tmp/sphinx-err-hidb_hlu.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
+ /usr/local/bin/ninja

@JackCollins91
Copy link
Contributor

Hi @johnstacy and @srinivaspavan9

Yes confirmed docker container is the same as you've used.

Thanks for the stack trace, the issue is coming from build/pandas/doc/build/html/io.html which has several s3 bucket URLs.

However, I was able to build the docs without an issue and the code blocks seem to be running and working. Is it possible this is actually caused by some other kind of web connection error? or some issue with your particular machine to access s3?

Perhaps try from a different machine if possible?

image

@mroeschke If we cannot replicate the issue, is there still a desire to make a PR for either of the following?

  1. convert the example code blocks in build/pandas/doc/build/html/io.html from .. ipython:: python to .. code-block:: python

OR

  1. change :okwarning: to :okexcept:

@mroeschke
Copy link
Member

Yes there's still a desire to convert this to a code-block. The docs should not need to make network calls to build.

@johnstacy
Copy link
Author

I am seeing the error both from my machine at home as well as trying to build it in a CI pipeline...so I'm basically ruling out a connection issue. Also, on my machine, I pulled the aws cli container and was able to anonymously access the bucket with --no-sign-request which makes sense since it's public. This got me to thinking perhaps one of the libraries changed and maybe the read_xml call needs something in the storage options to tell it to use anonymous access or something like that. Playing with that now..

I'm curious if anybody has tried building a brand new image from the Dockerfile and using that or if you're using old images.

@JackCollins91
Copy link
Contributor

Thanks for the thorough information @johnstacy . Thanks also for checking Multiple machines. I'm also puzzled about why we cannot recreate it. Although the suggested changes above should fix no matter what.

I'll take the step of creating a completely fresh instance on a new machine and running exactly your provided docker container and see if this changes the result.

Would also be interested to hear of anything from your investigation on no-sign-requests.

JackCollins91 added a commit to JackCollins91/pandas_jco that referenced this issue Jan 14, 2024
JackCollins91 pushed a commit to JackCollins91/pandas_jco that referenced this issue Jan 14, 2024
For each S3 bucket code block, ideally we show what the output would be, but without making an actual call. Unfortunately, for several of the S3 buckets, there are issues with the code, which we must fix in another commit or PR.

For now, the two S3 examples that do work, we edit to make the code block show what the output would have been if it had run successfully.

Find details on issues in conversation on PR pandas-dev#56592
mroeschke pushed a commit that referenced this issue Jan 15, 2024
* Update io.rst

Make consistent with other s3 bucket URL examples and avoid doc build error when problem with s3 url.

* Update io.rst

Make example consistent with other code block examples

* Update v2.3.0.rst

* immitating interactive mode

For each S3 bucket code block, ideally we show what the output would be, but without making an actual call. Unfortunately, for several of the S3 buckets, there are issues with the code, which we must fix in another commit or PR.

For now, the two S3 examples that do work, we edit to make the code block show what the output would have been if it had run successfully.

Find details on issues in conversation on PR #56592

* Update io.rst

Code still doesn't run, but at least unmatched } is no longer the issue.

* Update v2.3.0.rst

avoids unnecessary file change in PR

* Update io.rst

Rollback changes to one of the examples (out of scope)

* Update io.rst

* Update io.rst

---------

Co-authored-by: JackCollins1991 <[email protected]>
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this issue May 7, 2024
…56762)

* Update io.rst

Make consistent with other s3 bucket URL examples and avoid doc build error when problem with s3 url.

* Update io.rst

Make example consistent with other code block examples

* Update v2.3.0.rst

* immitating interactive mode

For each S3 bucket code block, ideally we show what the output would be, but without making an actual call. Unfortunately, for several of the S3 buckets, there are issues with the code, which we must fix in another commit or PR.

For now, the two S3 examples that do work, we edit to make the code block show what the output would have been if it had run successfully.

Find details on issues in conversation on PR pandas-dev#56592

* Update io.rst

Code still doesn't run, but at least unmatched } is no longer the issue.

* Update v2.3.0.rst

avoids unnecessary file change in PR

* Update io.rst

Rollback changes to one of the examples (out of scope)

* Update io.rst

* Update io.rst

---------

Co-authored-by: JackCollins1991 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants