Skip to content

Add HDFS reading #18568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

AbdealiLoKo
Copy link
Contributor

@AbdealiLoKo AbdealiLoKo commented Nov 29, 2017

except:
raise ImportError("The hdfs3 library is required to handle hdfs files")

if compat.PY3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be moved to pandas/compat/__init__.py (and same change in pandas.io.s3fs)

@@ -25,6 +25,10 @@ def get_engine(engine):
except ImportError:
pass

raise ImportError("unable to find a usable engine\n"
"tried using: pyarrow, fastparquet\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this? IIRC this is a separeate issue, make it a separate PR

""" s3 support for remote file interactivity """
from pandas import compat
try:
import hdfs3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add hdfs3 on some builds (you can put on the same as s3fs) is fine in pandas/ci

except ImportError:
need_text_wrapping = (BytesIO,)
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need

need_text_wrapping = tuple(need_text_wrapping) before handles

@@ -155,6 +155,7 @@ I/O
^^^

- :func:`read_html` now rewinds seekable IO objects after parse failure, before attempting to parse with a new parser. If a parser errors and the object is non-seekable, an informative error is raised suggesting the use of a different parser (:issue:`17975`)
- :func:`read_csv` now supports reading from hdfs by giving "hdfs:///tmp/data.csv". The hadoop configs will try to be automatically found. The configs can also be mentioned using the format "hdfs://nodenamehost:nodenameport/tmp/data.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in the Other Enhancements section. Add documentionat in io.rst (where s3 docs are found)

@jreback jreback added Enhancement IO Network Local or Cloud (AWS, GCS, etc.) IO Issues labels Nov 29, 2017
@jorisvandenbossche jorisvandenbossche added the IO Parquet parquet, feather label Dec 10, 2017
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 10, 2017

@AbdealiJK I cherry-picked your first commit about the parquet error message (as this was an easy change) and merged that in a separate PR (#18717).

Can you update the hdfs part according to the comments?

@jorisvandenbossche jorisvandenbossche removed the IO Parquet parquet, feather label Dec 10, 2017
@AbdealiLoKo
Copy link
Contributor Author

AbdealiLoKo commented Dec 11, 2017 via email

Now the following will work:

If hdfs3 is not installed, Throws:
  ImportError: The hdfs3 library is required to handle hdfs files

If hdfs3 is installed but libhdfs3 is not installed, Throws:
  ImportError: Can not find the shared library: libhdfs3.so

If hdfs3 is installed it works for the code:
  pd.read_csv("hdfs://localhost:9000/tmp/a.csv")

If hdfs3 is installed and HADOOP_CONF_DIR is set, it works for the code:
  HADOOP_CONF_DIR=/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/
  pd.read_csv("hdfs:///tmp/a.csv")
@pep8speaks
Copy link

Hello @AbdealiJK! Thanks for updating the PR.

Line 106:5: E722 do not use bare except'

Line 5:1: E722 do not use bare except'

@AbdealiLoKo AbdealiLoKo changed the title Fix parquet error and add HDFS reading Add HDFS reading Dec 16, 2017
@AbdealiLoKo
Copy link
Contributor Author

For the unit test: I need some help in setting up HDFS inside Travis.
The way other orgs are doing this is by using their own dockers: Example of hdfs3's CI

Pandas does not seem to have it's own docker ... so a little confused about this.

Manually I've tested the following:

  • If hdfs3 is not installed, Throws:
ImportError: The hdfs3 library is required to handle hdfs files
  • If hdfs3 is installed but libhdfs3 is not installed, Throws:
ImportError: Can not find the shared library: libhdfs3.so
See installation instructions at http://hdfs3.readthedocs.io/en/latest/install.html
  • If hdfs3 is installed it works for the code: pd.read_csv("hdfs://localhost:9000/tmp/a.csv")
  • If hdfs3 is installed it works if export HADOOP_CONF_DIR=/usr/local/Cellar/hadoop/2.7.0/libexec/etc/hadoop/ is done and pd.read_csv("hdfs:///tmp/a.csv") is provided

@codecov
Copy link

codecov bot commented Dec 16, 2017

Codecov Report

Merging #18568 into master will decrease coverage by 0.05%.
The diff coverage is 41.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18568      +/-   ##
==========================================
- Coverage   91.64%   91.58%   -0.06%     
==========================================
  Files         154      155       +1     
  Lines       51428    51459      +31     
==========================================
+ Hits        47129    47131       +2     
- Misses       4299     4328      +29
Flag Coverage Δ
#multiple 89.45% <41.02%> (-0.04%) ⬇️
#single 40.82% <23.07%> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/io/hdfs.py 0% <0%> (ø)
pandas/io/s3.py 88.23% <100%> (+3.23%) ⬆️
pandas/io/common.py 70.11% <70.58%> (+0.62%) ⬆️
pandas/compat/__init__.py 58.69% <75%> (-0.08%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.68% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c28b624...a55e848. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Feb 1, 2018

if you can rebase. this would require a testing setup similar to this:

https://github.com/dask/dask/tree/master/continuous_integration/hdfs

@jreback
Copy link
Contributor

jreback commented Mar 16, 2018

closing as stale. nice idea, but we have to be able to automatically test this.

@jreback jreback closed this Mar 16, 2018
@jreback jreback added this to the No action milestone Mar 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature to read csv from hdfs:// URL
4 participants