-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DISCUSS: move online data reader subpackages to separate project? #8961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was trying to think through the workflow of such migration and it hit me that there's one thing all those subprojects to be moved away get for free when being under the umbrella of pandas: the utilities aka the boring stuff, like setup scripts, docs, tests, benchmarks & ci infrastructure. Maybe we could think of a way to somehow move crucial parts of those into a subrepo(-s) so that a fix in one place can be easily propagated everywhere. Also, as a food for thought, I think there are a lot of non-remote |
I am not in favor of moving any core IO packages eg stata,clipboard,excel one of the mains points of pandas is that it provides a consistent interface with IO |
FTR,
|
@immerrr hah..I knew you were going to say that. However, these are system level for the most part and generally included on linux/mac/windows (except the PyQt). So not that big of a deal. |
Also, it depends on what do you call "core". I always thought of pandas core being containers+time utils+split/apply/combine+high perf queries/computations. The rest could (should) just feed data in or out via a high-throughput API. That being said, I agree that unified API is good, so it should be sorted out before starting to pull pandas codebase apart. And yes, for practicality reasons, most frequently used io modules could be considered core, too, I just don't have the data to figure out which of them are most popular. |
Maybe we should limit the discussion first to the remote data readers? As the other
|
I think absolutely these sub-projects should use the existing tests (and pandas testing infrastructure). |
It's not obvious from the phrasing, I hope you mean that they should re-use the infrastructure, e.g. build matrix, but neither they shouldn't run tests of core pandas nor core pandas should run tests of its "plugin" packages. Which brings me to another issue which I'll describe in separate gh issue. Speaking of build matrix, I'd expect the latter to be tested against pandas master and at least one last pandas major release. Or maybe two. The second one would ensure that when running for a fix for last-minute incompatible change in newly released core version one doesn't break the latest major release as of several days ago and don't force users to upgrade both pandas-io-foo and pandas in lockstep. But then again, if one wants to use a bleeding edge version of pandas-io-foo package they should be able to do the same for pandas itself. Another issue is that I don't know of a way to include one yaml file inside the other which stacks with the fact that travis.yml should reside in repository root. I'm not yet sure I see how to make travis/appveyor.yml generic enough to require minimal (ideally, none whatsoever?) attention after project is bootstrapped and yet rely on some common submodule so that exact versions can be changed with a single commit to some "shared" repo and a single submodule update. Symlinking travis.yml to a submodule might do, but I'm not sure if that works on windows. |
Hello, I've been working on a more object oriented approach about DataReader than actual code. so it can use see this https://github.com/femtotrader/pandas_datareaders This is just a friendly "fork", please excuse me. I would be very pleased to have your feedback about this. |
+1, it sucks that someone (stuck in the dark ages) using 0.10 or something can't use data readers (if something's changed in the api since then) without upgrading and potentially breaking/changing lots of their code. I suspect for the most part we're not using any wild/depreciating commands, that aren't tested elsewhere in pandas, the datafeed codebase... so IMO testing against master isn't actually that important (and not doing so makes things much easier), just add the pandas version - e.g. For the backwards compat these have to dependancies of (and depend on*) pandas right? I think the easiest is to add this right here in the pydata umbrella group, or create another group here; that way it can seem more "official". *though it may be interesting to lift that direction, and have pandas as a soft dependancy (like https://github.com/changhiskhan/poseidon does). Does this make sense? @femtotrader did you copy and paste the classes and tests from pandas or is this something different? (I worry a little about change requests dep at the same time as migrating, but +1 on using requests). |
@hayd this is something different as (except Yahoo Options) in data.py https://github.com/pydata/pandas/blob/master/pandas/io/data.py almost everything is function... no class. Some people considers that we shouldn't depend on |
One important consideration is the recruiting effects of newer contributors. I have points on each of the three, speaking as a rookie myself. I comment on the docs as well, as @jreback pulled that into this thread. Carve out Documentation
My recommendation : If anything gets carved out, documentation should be the highest priority.Carve out Data Sources
My recommendation : Add a note to the docs advising that maintainers are wanted. And, as they step up, one by one they can be carved out. For instance, I could likely learn, and volunteer, to be a maintainer for the World Bank stuff, but I wouldn't necessarily want to be a maintainer for the other stuff.Carve out IO
My recommendation : It has it's benefits, but I think the costs outweigh them. I think the "darkages" reference above has merit, but it is trumped by the complicatedness that pandas would become. |
@jnmclarty The problem with carving out docs is IMO that docs and code are "hard-linked" (changes in the code api means changing the docs... at the same time), I think @jreback is talking about docs for these sub-projects (not pandas docs in general/core). Note: you can change the docs without compiling, you can even do it within github! I envision users of older pandas (dark ages) being able to monkey-patch (depending on their code):
This reminds me of @jreback's presentation "pandas is pydata middleware": Ripping it out and seeing what happens: https://github.com/hayd/pandas_data_readers (it works without change for pandas 0.13-0.15, there's a few compat import issues pre-0.13 see https://travis-ci.org/hayd/pandas_data_readers/builds/43218562). Edit: may be easy to also break apart the compat module - would make fixing earlier version easy - not sure if that's feasible though. |
@hayd Fair point, on the docs. I admit, I'm relatively new to OS planning concepts like this. I do find it all pretty interesting. |
I don't think *Excluding big endian issues, which are probably not important for users of big endian systems, assuming these exist. |
For doc "Read the docs" is very convenient https://readthedocs.org/. With webhook you can compile doc on server side when you commit changes. |
@femtotrader and for a |
@jreback how would readthedocs for a datareader package sit in the pandas docs (url-wise)? Tbh could just keep the docs as they are on the pandas side, mostly updates to datareaders are just fixes to external APIs? From the (pretty trivial) exercise I did splitting out datareaders, I think we should split it - it means people can use "old" pandas and up-to-date (working) datareaders. I think should do the same for Not sure how fits with @femtotrader's thoughts? |
You really want my opinion ;-) My but yes I think that we should split datareaders to a separate GitHub project, a separate pip package, a separate testing and continuous integration process and have an easy doc build using readthedocs Maybe we should do this in 2 steps. First move and keep exactly same code. That's just my opinion. PS: I can grant access to either http://pandas-datareaders.readthedocs.org/en/latest/ or https://github.com/femtotrader/pandas_datareaders |
ok, let's create another package under pydata.
|
I also forget to say that I can also grant access to https://pypi.python.org/pypi/pandas_datareaders (even ownership to @wesm and @jreback because they are pandas pypi package owner). I just want this part of Pandas to be improved. Please just say me what to do now. I will be able to help friday and a part of next week. Maybe we could first move the doc. |
@femtotrader what's the backwards compat situation with using pandas_datareaders vs pandas. That is: What we want IMO is to just drop in pandas_datareaders (or whatever) into pandas.io, that way we don't break users code (whilst it's independent/for now they must replace from Make sense? (also, does your lib work on 3.X? you should add that to travis.) IMO docs can be last, since if we were to change this it could remain completely behind the scenes. I guess I don't know how much API you've changed/improved and so what a good migration strategy is. |
Needs some readme + docs/ readthedocs love... I totally missed |
I can look at the docs if you want. But some other things to discuss:
|
@jorisvandenbossche No I was thinking of it as a dependancy, e.g. have These are Similarly I think extracting out pandas.compat could be useful for other packages (which depend on pandas). Edit: Thinking about it, I agree gbq doesn't belong in pandas_datareader. But I think it could easily live in a sep package. |
In my mind I also think that |
all of the modules moved to
|
agree with @jreback on About having
|
see issue here: pydata/pandas-datareader#15 in a nutshell. I think its important that the first release (say 0.1) of |
@jreback any reason for pandas not to do the import for you going forward?
Either as a dependency or as a soft dependency (e.g. throw an error that you need to |
I think a soft dep is fine |
@jreback sounds good! |
Another question: what about the docs? (I started with that pydata/pandas-datareader#18, but was not fully sure about the path to follow) Should they also stay in the pandas docs? Or refer to the pandas-datareader docs? Wich import do we use there? The 'real' |
closed by #10870 |
Opening an issue to discuss this: should we think about moving the functionality to read online data sources to a separate package?
This was mentioned recently here #8842 (comment) and #8631 (comment)
Some reasons why we would want to move it:
Some questions that come to mind:
io.data
? (theDataReader
function for Yahoo and Google Finance, FRED and Fama/French, and theOptions
classio.wb
(certain World Bank data) andio.ga
(google analytics interface)?Pinging some of the people have worked on these subpackages (certainly add others if you know):
@dstephens99 @MichaelWS @kdiether @cpcloud @vincentarelbundock @jnmclarty
@jreback @immerrr @TomAugspurger @hayd @jtratner
The text was updated successfully, but these errors were encountered: