Skip to content

Add interactive terminal to pandas website #46682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Apr 7, 2022 · 23 comments · Fixed by #47428
Closed

Add interactive terminal to pandas website #46682

datapythonista opened this issue Apr 7, 2022 · 23 comments · Fixed by #47428
Assignees
Labels

Comments

@datapythonista
Copy link
Member

We discussed in the past about making pandas examples in the documentation runnable. The original idea was to use Binder for it, which requires a decent amount of hosting, besides setting up things in our end.

There is now a new alternative, based on webassembly, Jupyter Lite. The idea is that there is no backend to run the code, but it's a WebAssembly Python interpreter in the client browser who executes the code.

NumPy is already using this in their home page. The terminal takes few seconds to load, but after that, seems to work just fine, and it already has pandas installed in the environment, so import pandas works.

Would be nice to get the same for pandas. I'd start adding a new section to our Getting started page for the interactive terminal, see how it works, and when we've got this working fine, we can consider adding it to the home page, making examples runnable...

@bennaaym
Copy link

bennaaym commented Apr 8, 2022

Hey @datapythonista, I took a look at NumPy's website and the way they are integrating their interactive shell is by using an iframe to access the JupyterLite interactive shell. I tried to import pandas using their shell, however, I faced some errors.

  • Error 1:
    error1

  • Error 2:
    error2
    PS: the URL works fine when using jupyter notebook

Implementing the same shell as the NumPy's is relatively simple, however, I'm not sure if it will perform the same for pandas.

@datapythonista I'm interested in this issue and I'd like to hear some feedback from you about the above statements

@datapythonista
Copy link
Member Author

Thanks for having a look @bennaaym, good findings. I think there are two separate problems.

  • JupyterLite is probably shipped with an environment as small as possible, since it needs to load in the client browser, and the bigger it is, the slowest to be to load. Your first error is not an error, it's a warning, caused by a package lzma not being installed in the environment. Also, looks like urllib.request.urlopen is not recognizing the https protocol, which I assume is because another package is missing.
  • If using http instead of https, I get this error URLError: <urlopen error [Errno 23] Host is unreachable>. Feels like the webassembly environment is a sandbox without Internet access I assume.

For the first set of errors, I think we may need to generate our own environment, with what we need. I'm not sure about having access to the Internet. If that's not an option, I guess we could have a small csv file in the environemnt, or just build data with the constructor as we do in most examples.

@jtpio do you have thoughts on this?

@jtpio
Copy link
Contributor

jtpio commented Apr 8, 2022

Hi!

Right, urlopen might be tricky to get to work in the browser for now. Some related discussion in the Pyodide repo: pyodide/pyodide#398

But it's possible to use the fetch from the browser as an alternative:

import pandas as pd
from js import fetch

URL = "https://gist.github.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"

res = await fetch(URL)
text = await res.text()

filename = 'data.csv'

with open(filename, 'w') as f:
    f.write(text)

data = pd.read_csv(filename, sep=';')
data

Sample from jupyterlite/jupyterlite#119 (comment)

Which gives the following:

image

Agree it's not as natural as passing the URL to pandas directly but it allows to pull data from other places.

@jtpio
Copy link
Contributor

jtpio commented Apr 8, 2022

Also for the pandas docs, we could probably use jupyterlite-sphinx directly: https://jupyterlite-sphinx.readthedocs.io/en/latest/

Since the getting started page seems to be located here in this repo? https://github.com/pandas-dev/pandas/blob/main/doc/source/getting_started/index.rst

@datapythonista
Copy link
Member Author

Thanks @jtpio, that's very useful to know.

There are two getting started pages, one in the website and one in the docs. For now we'll add this to the website, which is not sphinx but a markdown: https://github.com/pandas-dev/pandas/blob/main/web/pandas/getting_started.md

Using fetch sounds good, but we can't use it directly, since what we want is to let users play a bit with a very simplepandas example. So we need to have the import and a very simple code to give them some data, like the pandas.read_csv original example. Not sure if we could try to monkeypath urlib to use fetch, so the read_csv code works.

But probably better to start with something easier. What I'd do for now is to create a simple dataframe with the constructor, so users can play a bit with it. And once that's working, we can consider improvements. What do you think @bennaaym?

@bennaaym
Copy link

bennaaym commented Apr 9, 2022

@datapythonista sounds good. So something like the below example is the target for now, right?

demo

@datapythonista
Copy link
Member Author

Yes, I'd start with something like that. Maybe a dataframe with few more columns and samples, so it's still simple but users can do a groupby, or use a date function. But that's the idea.

@psychemedia
Copy link
Contributor

@datapythonista FWIW, I have tried to collect various workarounds for loading files etc here on the off-chance they might suggest a way to you of simplifying .read_csv() and .to_csv() demos.

@datapythonista
Copy link
Member Author

Thanks @psychemedia that looks good. The problem here is that we don't want to read a csv, but to show to users how to do it in pandas. So, the code must be the one we want users to learn, the regular pandas one with no change at all. So, the code in the example will be just the original failing one, and what we need to change if anything are the packaged files that JupyterLite is using. Monkeypatching the downloading of files. Or if it makes sense (not sure it does), updating pandas so it can use different libraries to download files from the Internet (not sure if there is any other use case where this would be useful).

@psychemedia
Copy link
Contributor

@datapythonista Does pandas work out of the can to read and write CSV files in an arbitrary pyodide environment? Is that the more general case?

@datapythonista
Copy link
Member Author

@datapythonista Does pandas work out of the can to read and write CSV files in an arbitrary pyodide environment? Is that the more general case?

I assume it doesn't. We can also add the csv file to the JupyterLite distribution, so we don't need to download from the Internet, maybe that makes more sense if downloading it is too much trouble.

@jtpio
Copy link
Contributor

jtpio commented Apr 26, 2022

@bennaaym @datapythonista this looks exciting, let me know if you need any help setting that up!

@TomAugspurger
Copy link
Contributor

pandas uses fsspec for all of it's I/O operations other than local and http[s]. I wonder if we could come implement an fsspec backend that uses pyodide APIs to do the IO? As a prototype it might use a pyodide:// protocol, but if this worked we might consider changing our implementation to use fsspec for https URLs too.

That said, I think that the fsspec APIs that pandas calls might use some threads, which weren't supported by pyodide last I looked.

cc @martindurant from the fsspec.

@datapythonista
Copy link
Member Author

I was testing this a bit further, and looks like fetch is restricted by CORS, and only requests from the pandas domain would work. See this code::

from js import fetch

resp = await fetch('https://<third-party-domain>')
content = await resp.text()
content

Raises this exception:

---------------------------------------------------------------------------
JsException                               Traceback (most recent call last)
Input In [7], in <cell line: 5>()
      1 from js import fetch
      3 URL = "https://pandas.pydata.org/"
----> 5 res = await fetch(URL)
      6 text = await res.text()

File /lib/python3.10/asyncio/futures.py:284, in Future.__await__(self)
    282 if not self.done():
    283     self._asyncio_future_blocking = True
--> 284     yield self  # This tells Task to wait for completion.
    285 if not self.done():
    286     raise RuntimeError("await wasn't used with future")

File /lib/python3.10/asyncio/tasks.py:304, in Task.__wakeup(self, future)
    302 def __wakeup(self, future):
    303     try:
--> 304         future.result()
    305     except BaseException as exc:
    306         # This may also be a cancellation.
    307         self.__step(exc)

File /lib/python3.10/asyncio/futures.py:201, in Future.result(self)
    199 self.__log_traceback = False
    200 if self._exception is not None:
--> 201     raise self._exception
    202 return self._result

JsException: TypeError: Failed to fetch

The example works fine when the url is in the same domain as the page. I'd be +1 to make pandas IO functions work with fetch if this worked for any URL. But to me, it doesn't seem to be worth to make pandas more complex just to have a page in the website with runnable code.

@ambit741235WHJR
Copy link

Hi, can I contribute to this issue if there's no problem?

Also, can you please guide to this issue as I'm new here.

@psychemedia
Copy link
Contributor

I note that the JupyterLite PR jupyterlite/jupyterlite#655 means that pandas can now read and write CSV files into the JupyterLite filesystem. This means that demos that ship CSV data files should work with the current pandas package.

The simplest way I've found to download files is to use something like the following, although this does require the use of the pyodide package:

with open("test.txt", "w") as f:
    f.write(pyodide.open_url(URL).read())

@jtpio
Copy link
Contributor

jtpio commented Jun 15, 2022

I note that the JupyterLite PR jupyterlite/jupyterlite#655 means that pandas can now read and write CSV files into the JupyterLite filesystem. This means that demos that ship CSV data files should work with the current pandas package.

Yes this will make it possible to use pd.read_csv() directly.

Will make a new release when this PR gets merged, and will let you know here.

@jtpio
Copy link
Contributor

jtpio commented Jun 17, 2022

FYI a new release of jupyterlite is out with support for opening files from the Python kernel: https://github.com/jupyterlite/jupyterlite/releases/tag/v0.1.0b9

image

See this example notebook for more information: https://github.com/jupyterlite/jupyterlite/blob/main/examples/pyolite/virtual_drive.ipynb

@jtpio
Copy link
Contributor

jtpio commented Jun 17, 2022

Linking to the repo / JupyterLite deployment used for the Sympy website (https://www.sympy.org/en/shell.html) for reference: https://github.com/sympy/live. They enable some optimizations to only deploy the REPL app which could be relevant for pandas as well.

There could be a similar repo in the pandas-dev organization on GitHub so you have full control on the settings and the content (example notebooks, csv files).

@hamedgibago
Copy link

take

@hamedgibago
Copy link

Hi!

Right, urlopen might be tricky to get to work in the browser for now. Some related discussion in the Pyodide repo: pyodide/pyodide#398

But it's possible to use the fetch from the browser as an alternative:

import pandas as pd
from js import fetch

URL = "https://gist.github.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"

res = await fetch(URL)
text = await res.text()

filename = 'data.csv'

with open(filename, 'w') as f:
    f.write(text)

data = pd.read_csv(filename, sep=';')
data

Sample from jupyterlite/jupyterlite#119 (comment)

Which gives the following:

image

Agree it's not as natural as passing the URL to pandas directly but it allows to pull data from other places.

I called the code, but in the first call JupyterLite returned network error. Does it make sense or not? But at second call it returned data correctly. Another problem is low speed of JypiterLite.
image

@jtpio
Copy link
Contributor

jtpio commented Jun 20, 2022

Thanks @hamedgibago for working on this.

With the latest release of JupyterLite it shouldn't be necessary to manually fetch the file anymore (see #46682 (comment)).

Probably the simplest would be to create a custom JupyterLite deployment as detailed in https://jupyterlite.readthedocs.io/en/latest/quickstart/deploy.html. And then add the example csv file to the contents.

@jtpio
Copy link
Contributor

jtpio commented Jun 20, 2022

Just opened #47428 to get this started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants