-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Feature to read csv from hdfs:// URL #18199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Based on the limited bit I know, dealing with authentication and all that can be a rabbit hole. If you want to put together a prototype based around http://hdfs3.readthedocs.io/en/latest/, I think we'd add it. |
How should this be implemented ? Should there also be a |
Line 94 in 2f9d4fb
_is_s3_url .
|
i believe we can either use hdfs (similar to s3fs and/or pyarrow for this); would be similar to the way we do s3 atm |
So, here is a quick comparison:
These seem to be the active (have their latest release in 2017) options. As pandas already has a pyarrow engine for parquet, it looks like having pyarrow with the native libhdfs would be the most universal option. |
This would be a great feature to have in Pandas. Is it still being worked on? |
@sergei3000 you are welcome to submit a PR; pandas is all volunteer effort |
Hi @jreback, I want to work on this PR. |
i am not sure what is the appropriate library to read from hdfs otherwise this would be similar to how we implement other readers eg like gcs |
tests can be on a similar manner to here: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_gcs.py |
This is how I managed to read from hdfs: import os
import pandas as pd
import pydoop.hdfs as hd
os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"
with hd.open("/share/bla/bla/bla/filename.csv") as f:
df = pd.read_csv(f) |
|
Here you can find some code samples of using |
Hey @jreback |
Note that pyarrow's HDFS interface will be deprecated sometime. I guess the "legacy" interface will be around a while, but fsspec will need to have its shim rewritten to the newer filesystem that pyarrow makes, when it's stable. Hopefully, this shouldn't affect users. |
Hi @jreback |
we don't run any containers as part of the ci this is just mocked which i think is fine if we really want to have full testing then would need to stop a new azure job for this (not against but a bit overkill) that said if u want to go yeah |
Note that dask does test its read_csv from HDFS: https://github.com/dask/dask/blob/master/dask/bytes/tests/test_hdfs.py#L131 |
When running pandas in AWS, The following works perfectly fine:
But running the following, does not:
It would be a good user experience to allow for the hdfs:// schema too similar to how http, ftp, s3, and file are valid schemas right now.
The text was updated successfully, but these errors were encountered: