-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Support ndjson -- newline delimited json -- for streaming data. #9180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
something like this, perhaps: def iterndjson(df):
generator = df.iterrows()
ndjson = []
row = True
while row:
try:
row = next(generator)
ndjson.append(row[1].to_dict())
except StopIteration:
row = None
return ndjson > df = pd.DataFrame({'one': [1,2,3,4], 'two': [3,4,5,6]})
> iterndjson(df)
[{'one': 1, 'two': 3},
{'one': 2, 'two': 4},
{'one': 3, 'two': 5},
{'one': 4, 'two': 6}] |
have a look here http://pandas.pydata.org/pandas-docs/stable/io.html#json |
this would also be great for things like BigQuery, which outputs JSON files as new line delimited JSON. The issue is that it's not actually valid JSON (since it ends up as multiple objects). @Karissa maybe you could hack around this by reading the file row by row, using ujson/json to read each row into a python dictionary and then passing the whole thing to the DataFrame constructor? |
@mrocklin and I looked into this. The simplest solution we came up was loading the file into a buffer, add the appropriate commas and brackets then passing back to read_json. Below are a few timings on this approach, it seems the current implementation of read_json is a bit slower than ujson, so we felt the simplicity of this approach didn't make anything too slow.
I'll try to get a patch together unless someone thinks there is a better solution. The notion would be to add a flag 'line=True' to the reader. |
@Karissa is there any difference between ndjson and jsonlines (http://jsonlines.org/) I've never heard of ndjson but it seems to be the same thing. |
Looks like jsonlines includes a few more rules, including a specification about UTF-8 encoding. http://jsonlines.org/ vs http://ndjson.org/ |
Hi, is there a way to efficiently Right now I just Thanks! |
you would have to show an example |
can you cat |
Here. |
This won't be that efficient, but its reasonably idiomatic.
|
I don't think there is an easy way to do this ATM. In pandas2 this will be more built in. cc @wesm |
It would be much more sustainable to do this kind of parsing with RapidJSON in C++; I think we should be able to deliver this en route to pandas2 |
Just checked on my production dataset:
First, fortunately it actually get's the job done better than using Second, unfortunately it is definitely less efficient than using |
@iFantastic yes, unfortunately our handling of newline delimited JSON is not great right now; stay tuned to wesm/pandas2#71 |
@jreback thanks for your time, very helpful! |
@jreback I have 1 more question: Is it possible to achieve formatting like yours
but using I mean, to get nicely formated column names like age, description, price instead of age.string, description.nULL, description.string, price.string in the columns. Thanks! |
@iFantastic you can just post-process the columns
|
@jreback Thanks! |
For ordinary ndjson files (like those output by BigQuery), I use the import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records) In python2, you'd probably want to use |
FYI. import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True) |
Hey all,
I'm a developer on dat project (git for data) and we are building a python library to interact with the data store.
Everything in dat is streaming, and we use newline delimited json as the official transfer format between processes.
Take a look at the specification for newline delimited json here
Does pandas support this yet, and if not, would you consider adding a
to_ndjson
function to the existing output formats?For example, the following table:
Would be converted to
For general streaming use cases, it might be nice to also consider other ways of supporting this format, like a generator function that outputs ndjson-able objects
The text was updated successfully, but these errors were encountered: