-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add HDFS reading #18568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HDFS reading #18568
Conversation
pandas/io/hdfs.py
Outdated
except: | ||
raise ImportError("The hdfs3 library is required to handle hdfs files") | ||
|
||
if compat.PY3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be moved to pandas/compat/__init__.py
(and same change in pandas.io.s3fs)
pandas/io/parquet.py
Outdated
@@ -25,6 +25,10 @@ def get_engine(engine): | |||
except ImportError: | |||
pass | |||
|
|||
raise ImportError("unable to find a usable engine\n" | |||
"tried using: pyarrow, fastparquet\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing this? IIRC this is a separeate issue, make it a separate PR
""" s3 support for remote file interactivity """ | ||
from pandas import compat | ||
try: | ||
import hdfs3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add hdfs3 on some builds (you can put on the same as s3fs) is fine in pandas/ci
except ImportError: | ||
need_text_wrapping = (BytesIO,) | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need
need_text_wrapping = tuple(need_text_wrapping)
before handles
doc/source/whatsnew/v0.22.0.txt
Outdated
@@ -155,6 +155,7 @@ I/O | |||
^^^ | |||
|
|||
- :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`) | |||
- :func:`read_csv` now supports reading from hdfs by giving "hdfs:///tmp/data.csv". The hadoop configs will try to be automatically found. The configs can also be mentioned using the format "hdfs://nodenamehost:nodenameport/tmp/data.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add in the Other Enhancements section. Add documentionat in io.rst (where s3 docs are found)
@AbdealiJK I cherry-picked your first commit about the parquet error message (as this was an easy change) and merged that in a separate PR (#18717). Can you update the hdfs part according to the comments? |
Thanks Jeff. Modifying this is on my ToDo - a bit pre-occupied at the
moment. Will do it soon.
…On Dec 11, 2017 5:19 AM, "Joris Van den Bossche" ***@***.***> wrote:
@AbdealiJK <https://github.com/abdealijk> I cherry-picked your first
commit (as this was an easy change) and merged that in a separate PR (
#18717 <#18717>).
Can you update the hdfs part according to the comments?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18568 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACGUp9qUKaEqZNXVDuVXIqtlG8nO2TAPks5s_G3-gaJpZM4QvcDi>
.
|
Now the following will work: If hdfs3 is not installed, Throws: ImportError: The hdfs3 library is required to handle hdfs files If hdfs3 is installed but libhdfs3 is not installed, Throws: ImportError: Can not find the shared library: libhdfs3.so If hdfs3 is installed it works for the code: pd.read_csv("hdfs://localhost:9000/tmp/a.csv") If hdfs3 is installed and HADOOP_CONF_DIR is set, it works for the code: HADOOP_CONF_DIR=/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/ pd.read_csv("hdfs:///tmp/a.csv")
21f8326
to
a55e848
Compare
Hello @AbdealiJK! Thanks for updating the PR.
|
For the unit test: I need some help in setting up HDFS inside Travis. Pandas does not seem to have it's own docker ... so a little confused about this. Manually I've tested the following:
|
Codecov Report
@@ Coverage Diff @@
## master #18568 +/- ##
==========================================
- Coverage 91.64% 91.58% -0.06%
==========================================
Files 154 155 +1
Lines 51428 51459 +31
==========================================
+ Hits 47129 47131 +2
- Misses 4299 4328 +29
Continue to review full report at Codecov.
|
if you can rebase. this would require a testing setup similar to this: https://github.com/dask/dask/tree/master/continuous_integration/hdfs |
closing as stale. nice idea, but we have to be able to automatically test this. |
Add HDFS reading using
libhdfs3
closes Feature to read csv from hdfs:// URL #18199
tests added / passed
passes
git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry