Add HDFS reading #18568

AbdealiLoKo · 2017-11-29T18:43:31Z

Add HDFS reading using libhdfs3
closes Feature to read csv from hdfs:// URL #18199
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2017-11-29T23:44:39Z

pandas/io/hdfs.py

+except:
+    raise ImportError("The hdfs3 library is required to handle hdfs files")
+
+if compat.PY3:


these should be moved to pandas/compat/__init__.py (and same change in pandas.io.s3fs)

jreback · 2017-11-29T23:45:25Z

pandas/io/parquet.py

@@ -25,6 +25,10 @@ def get_engine(engine):
        except ImportError:
            pass

+        raise ImportError("unable to find a usable engine\n"
+                          "tried using: pyarrow, fastparquet\n\n"


why are you changing this? IIRC this is a separeate issue, make it a separate PR

jreback · 2017-11-29T23:46:40Z

pandas/io/hdfs.py

+""" s3 support for remote file interactivity """
+from pandas import compat
+try:
+    import hdfs3


you need to add hdfs3 on some builds (you can put on the same as s3fs) is fine in pandas/ci

jreback · 2017-11-29T23:47:25Z

pandas/io/common.py

    except ImportError:
-        need_text_wrapping = (BytesIO,)
+        pass



you need

need_text_wrapping = tuple(need_text_wrapping) before handles

jreback · 2017-11-29T23:47:57Z

doc/source/whatsnew/v0.22.0.txt

@@ -155,6 +155,7 @@ I/O
 ^^^

 - :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`)
+- :func:`read_csv` now supports reading from hdfs by giving "hdfs:///tmp/data.csv". The hadoop configs will try to be automatically found. The configs can also be mentioned using the format "hdfs://nodenamehost:nodenameport/tmp/data.csv"


add in the Other Enhancements section. Add documentionat in io.rst (where s3 docs are found)

jorisvandenbossche · 2017-12-10T23:48:59Z

@AbdealiJK I cherry-picked your first commit about the parquet error message (as this was an easy change) and merged that in a separate PR (#18717).

Can you update the hdfs part according to the comments?

AbdealiLoKo · 2017-12-11T07:13:35Z

Thanks Jeff. Modifying this is on my ToDo - a bit pre-occupied at the moment. Will do it soon.

…

On Dec 11, 2017 5:19 AM, "Joris Van den Bossche" ***@***.***> wrote: @AbdealiJK <https://github.com/abdealijk> I cherry-picked your first commit (as this was an easy change) and merged that in a separate PR ( #18717 <#18717>). Can you update the hdfs part according to the comments? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18568 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACGUp9qUKaEqZNXVDuVXIqtlG8nO2TAPks5s_G3-gaJpZM4QvcDi> .

Now the following will work: If hdfs3 is not installed, Throws: ImportError: The hdfs3 library is required to handle hdfs files If hdfs3 is installed but libhdfs3 is not installed, Throws: ImportError: Can not find the shared library: libhdfs3.so If hdfs3 is installed it works for the code: pd.read_csv("hdfs://localhost:9000/tmp/a.csv") If hdfs3 is installed and HADOOP_CONF_DIR is set, it works for the code: HADOOP_CONF_DIR=/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/ pd.read_csv("hdfs:///tmp/a.csv")

pep8speaks · 2017-12-16T07:04:22Z

Hello @AbdealiJK! Thanks for updating the PR.

In the file pandas/io/common.py, following are the PEP8 issues :

Line 106:5: E722 do not use bare except'

In the file pandas/io/hdfs.py, following are the PEP8 issues :

Line 5:1: E722 do not use bare except'

AbdealiLoKo · 2017-12-16T07:15:02Z

For the unit test: I need some help in setting up HDFS inside Travis.
The way other orgs are doing this is by using their own dockers: Example of hdfs3's CI

Pandas does not seem to have it's own docker ... so a little confused about this.

Manually I've tested the following:

If hdfs3 is not installed, Throws:

ImportError: The hdfs3 library is required to handle hdfs files

If hdfs3 is installed but libhdfs3 is not installed, Throws:

ImportError: Can not find the shared library: libhdfs3.so
See installation instructions at http://hdfs3.readthedocs.io/en/latest/install.html

If hdfs3 is installed it works for the code: pd.read_csv("hdfs://localhost:9000/tmp/a.csv")
If hdfs3 is installed it works if export HADOOP_CONF_DIR=/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/ is done and pd.read_csv("hdfs:///tmp/a.csv") is provided

codecov · 2017-12-16T07:46:43Z

Codecov Report

Merging #18568 into master will decrease coverage by 0.05%.
The diff coverage is 41.02%.

@@            Coverage Diff             @@
##           master   #18568      +/-   ##
==========================================
- Coverage   91.64%   91.58%   -0.06%     
==========================================
  Files         154      155       +1     
  Lines       51428    51459      +31     
==========================================
+ Hits        47129    47131       +2     
- Misses       4299     4328      +29

Flag	Coverage Δ
#multiple	`89.45% <41.02%> (-0.04%)`	⬇️
#single	`40.82% <23.07%> (-0.13%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/hdfs.py	`0% <0%> (ø)`
pandas/io/s3.py	`88.23% <100%> (+3.23%)`	⬆️
pandas/io/common.py	`70.11% <70.58%> (+0.62%)`	⬆️
pandas/compat/__init__.py	`58.69% <75%> (-0.08%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.68% <0%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c28b624...a55e848. Read the comment docs.

jreback · 2018-02-01T13:22:52Z

if you can rebase. this would require a testing setup similar to this:

https://github.com/dask/dask/tree/master/continuous_integration/hdfs

jreback · 2018-03-16T22:07:16Z

closing as stale. nice idea, but we have to be able to automatically test this.

jreback requested changes Nov 29, 2017

View reviewed changes

jreback added Enhancement IO Network Local or Cloud (AWS, GCS, etc.) IO Issues labels Nov 29, 2017

jorisvandenbossche added the IO Parquet parquet, feather label Dec 10, 2017

jorisvandenbossche mentioned this pull request Dec 10, 2017

Parquet: Add error message for no engine found #18717

Merged

jorisvandenbossche removed the IO Parquet parquet, feather label Dec 10, 2017

AbdealiLoKo force-pushed the ajk/parquet branch from 21f8326 to a55e848 Compare December 16, 2017 07:04

AbdealiLoKo changed the title ~~Fix parquet error and add HDFS reading~~ Add HDFS reading Dec 16, 2017

jreback closed this Mar 16, 2018

jreback added this to the No action milestone Mar 16, 2018

DavidKatz-il mentioned this pull request Sep 24, 2020

Feature to read csv from hdfs:// URL #18199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HDFS reading #18568

Add HDFS reading #18568

AbdealiLoKo commented Nov 29, 2017 •

edited

Loading

jreback Nov 29, 2017

jreback Nov 29, 2017

jreback Nov 29, 2017

jreback Nov 29, 2017

jreback Nov 29, 2017

jorisvandenbossche commented Dec 10, 2017 •

edited

Loading

AbdealiLoKo commented Dec 11, 2017 via email

pep8speaks commented Dec 16, 2017

AbdealiLoKo commented Dec 16, 2017

codecov bot commented Dec 16, 2017 •

edited

Loading

jreback commented Feb 1, 2018

jreback commented Mar 16, 2018

Add HDFS reading #18568

Add HDFS reading #18568

Conversation

AbdealiLoKo commented Nov 29, 2017 • edited Loading

jreback Nov 29, 2017

Choose a reason for hiding this comment

jreback Nov 29, 2017

Choose a reason for hiding this comment

jreback Nov 29, 2017

Choose a reason for hiding this comment

jreback Nov 29, 2017

Choose a reason for hiding this comment

jreback Nov 29, 2017

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 10, 2017 • edited Loading

AbdealiLoKo commented Dec 11, 2017 via email

pep8speaks commented Dec 16, 2017

AbdealiLoKo commented Dec 16, 2017

codecov bot commented Dec 16, 2017 • edited Loading

Codecov Report

jreback commented Feb 1, 2018

jreback commented Mar 16, 2018

AbdealiLoKo commented Nov 29, 2017 •

edited

Loading

jorisvandenbossche commented Dec 10, 2017 •

edited

Loading

codecov bot commented Dec 16, 2017 •

edited

Loading