- Home>
- programming>
- Using the Python ijson parser
All python code is Python 3.5+.
A few months ago, I had to extract a small amount of data from a large and deeply nested JSON file quickly and export to CSV. I was working in C++ and Python on this project, so my first attempts to extract the data were using the Python json module and Pandas json reader. The file was too big and nested for either tool to work well. Solution? Use the ijson parser.
I can’t share the original so I’m using the JSON file sf-city-lots-json as an example. Structurally, the JSON is simpler, but the file is of similar size and has enough structure to be interesting to parse.
Taking a brute-force approach, I explored the file directly using the parse function, which generates event tuples of (prefix, event, value) constructed from their location in the JSON object tree and values.
import ijson
def parse_json(json_filename):
with open(json_filename, 'rb') as input_file:
# load json iteratively
parser = ijson.parse(input_file)
for prefix, event, value in parser:
print('prefix={}, event={}, value={}'.format(prefix, event, value))
if __name__ == '__main__':
parse_json('./data/citylots_small.json')
Prefixes describe location of the keys or names in the object tree. Events report value types, mark the start (start_array) and end of arrays (end_array) and objects (start_map, end_map) and also mark keys (map_key).
Ijson’s parser works efficiently with large JSON files by leveraging generator iterators and yield expressions to avoid loading the entire structure in memory at once. The object tree is also an efficient way to represent nested JSON.
This is the beginning of the JSON file:
{
"type": "FeatureCollection",
"features": [
{ "type": "Feature", "properties": { "MAPBLKLOT": "0001001", "BLKLOT": "0001001", "BLOCK_NUM": "0001", "LOT_NUM": "001", "FROM_ST": "0", "TO_ST": "0", "STREET": "UNKNOWN", "ST_
TYPE": null, "ODD_EVEN": "E" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -122.422003528252475, 37.808480096967251, 0.0 ], [ -122.422076013325281, 37.80883501981508
5, 0.0 ], [ -122.421102174348633, 37.808803534992904, 0.0 ], [ -122.421062569067274, 37.808601056818148, 0.0 ], [ -122.422003528252475, 37.808480096967251, 0.0 ] ] ] } }
...
The output from the script is pretty verbose. Breaking it down, the first few lines of the file are:
{
"type": "FeatureCollection",
"features": [
Which is parsed as:
prefix=, event=start_map, value=None
prefix=, event=map_key, value=type
prefix=type, event=string, value=FeatureCollection
prefix=, event=map_key, value=features
prefix=features, event=start_array, value=None
This is an example object from the “features” array:
{ "type": "Feature", "properties": { "MAPBLKLOT": "0001001", "BLKLOT": "0001001", "BLOCK_NUM": "0001", "LOT_NUM": "001", "FROM_ST": "0", "TO_ST": "0", "STREET": "UNKNOWN", "ST_TYPE": null, "ODD_EVEN": "E" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -122.422003528252475, 37.808480096967251, 0.0 ], [ -122.422076013325281, 37.808835019815085, 0.0 ], [ -122.421102174348633, 37.808803534992904, 0.0 ], [ -122.421062569067274, 37.808601056818148, 0.0 ], [ -122.422003528252475, 37.808480096967251, 0.0 ] ] ] } }
The parser output describes all the keys in the object hierarchy, their types and associated values.
prefix=features.item, event=start_map, value=None
prefix=features.item, event=map_key, value=type
prefix=features.item.type, event=string, value=Feature
prefix=features.item, event=map_key, value=properties
prefix=features.item.properties, event=start_map, value=None
prefix=features.item.properties, event=map_key, value=MAPBLKLOT
prefix=features.item.properties.MAPBLKLOT, event=string, value=0001001
prefix=features.item.properties, event=map_key, value=BLKLOT
prefix=features.item.properties.BLKLOT, event=string, value=0001001
prefix=features.item.properties, event=map_key, value=BLOCK_NUM
prefix=features.item.properties.BLOCK_NUM, event=string, value=0001
prefix=features.item.properties, event=map_key, value=LOT_NUM
prefix=features.item.properties.LOT_NUM, event=string, value=001
prefix=features.item.properties, event=map_key, value=FROM_ST
prefix=features.item.properties.FROM_ST, event=string, value=0
prefix=features.item.properties, event=map_key, value=TO_ST
prefix=features.item.properties.TO_ST, event=string, value=0
prefix=features.item.properties, event=map_key, value=STREET
prefix=features.item.properties.STREET, event=string, value=UNKNOWN
prefix=features.item.properties, event=map_key, value=ST_TYPE
prefix=features.item.properties.ST_TYPE, event=null, value=None
prefix=features.item.properties, event=map_key, value=ODD_EVEN
prefix=features.item.properties.ODD_EVEN, event=string, value=E
prefix=features.item.properties, event=end_map, value=None
prefix=features.item, event=map_key, value=geometry
prefix=features.item.geometry, event=start_map, value=None
prefix=features.item.geometry, event=map_key, value=type
prefix=features.item.geometry.type, event=string, value=Polygon
prefix=features.item.geometry, event=map_key, value=coordinates
prefix=features.item.geometry.coordinates, event=start_array, value=None
prefix=features.item.geometry.coordinates.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.422003528252475
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808480096967251
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.422076013325281
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808835019815085
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.421102174348633
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808803534992904
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.421062569067274
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808601056818148
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.422003528252475
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808480096967251
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item, event=end_array, value=None
prefix=features.item.geometry.coordinates, event=end_array, value=None
prefix=features.item.geometry, event=end_map, value=None
prefix=features.item, event=end_map, value=None
The parsed output for the end of the file:
]
}
Closes up the “features” array and the top-level object:
prefix=features, event=end_array, value=None
prefix=, event=end_map, value=None
Now that I understand the JSON structure and how the prefixes are formatted, I can extract a specific item from the file. For example, if I wanted property lot numbers:
import ijson
def extract_lot_numbers(json_filename):
with open(json_filename, 'rb') as input_file:
lot_numbers = ijson.items(input_file, 'features.item.properties.LOT_NUM')
for number in lot_numbers:
print('Lot number: {}'.format(number))
if __name__ == '__main__':
extract_lot_numbers('./data/citylots.json')
Which outputs:
Lot number: 001
Lot number: 001
Lot number: 002
Lot number: 001
...
I can also filter the objects generated by ijson using list comprehensions. For example, if I wanted all the properties with street type “ST”:
import ijson
def extract_properties(json_filename):
with open(json_filename, 'rb') as input_file:
objects = ijson.items(input_file, 'features.item.properties')
street_properties = (o for o in objects if o['ST_TYPE'] == 'ST')
for prop in street_properties:
print('Property on a street (ST): {}'.format(prop))
if __name__ == '__main__':
extract_properties('./data/citylots.json')
Which outputs:
Property on a street (ST): {'MAPBLKLOT': '0005001', 'BLKLOT': '0005001', 'BLOCK_NUM': '0005', 'LOT_NUM': '001', 'FROM_ST': '206', 'TO_ST': '286', 'STREET': 'JEFFERSON', 'ST_TYPE': 'ST', 'ODD_EVEN': 'E'}
Property on a street (ST): {'MAPBLKLOT': '0005001', 'BLKLOT': '0005001', 'BLOCK_NUM': '0005', 'LOT_NUM': '001', 'FROM_ST': '206', 'TO_ST': '286', 'STREET': 'JEFFERSON', 'ST_TYPE': 'ST', 'ODD_EVEN': 'E'}
Property on a street (ST): {'MAPBLKLOT': '0006001', 'BLKLOT': '0006001', 'BLOCK_NUM': '0006', 'LOT_NUM': '001', 'FROM_ST': '350', 'TO_ST': '366', 'STREET': 'JEFFERSON', 'ST_TYPE': 'ST', 'ODD_EVEN': 'E'}
Property on a street (ST): {'MAPBLKLOT': '0007001', 'BLKLOT': '0007001', 'BLOCK_NUM': '0007', 'LOT_NUM': '001', 'FROM_ST': '2936', 'TO_ST': '2936', 'STREET': 'HYDE', 'ST_TYPE': 'ST', 'ODD_EVEN': 'E'}
...
I like this site because so much useful material on here : D.
This tutorial is nice! How can I get all the json records (full json objects) from a very large json file …where there is a particular key(field)?
This was awesome, thank you! I cannot find any documentation on ijson library…probably don’t know where to look. But this will teach me how to use it!
I am looking for parsing twitter json data running into lacs of tweets with size around 2 gb. There are 4 key elements that I need to extract, but my computer hangs while processing such a large data. I have come to know that ijson can solve this. Can you reflect on this.
Because ijson does not load an entire data set in memory (uses Python generators), it will handle large datasets without hanging. You’ll want to be strategic about how to iterate over your data and find the elements you need to extract. Fewer steps and less data loaded into memory while iterating is best.
Thanks for this Ayla! I have a problem that i could not handle myself, i was wondering if you could point me in the right direction. I have a gigantic JSON file (120 gb) in which every object has the same attributes. I’m pasting only one object below as an example.
{“archived”:false,”author”:”WGReddit”,”author_created_utc”:1508554552,”author_flair_background_color”:null,”author_flair_css_class”:null,”author_flair_richtext”:[],”author_flair_template_id”:null,”author_flair_text”:null,”author_flair_text_color”:null,”author_flair_type”:”text”,”author_fullname”:”t2_ffhn7qm”,”author_patreon_flair”:false,”body”:”>**old**”,”can_gild”:true,”can_mod_post”:false,”collapsed”:false,”collapsed_reason”:null,”controversiality”:0,”created_utc”:1541030400,”distinguished”:null,”edited”:false,”gilded”:0,”gildings”:{“gid_1″:0,”gid_2″:0,”gid_3″:0},”id”:”e8tki4s”,”is_submitter”:false,”link_id”:”t3_9syrca”,”no_follow”:true,”parent_id”:”t1_e8sgiw7″,”permalink”:”\/r\/movies\/comments\/9syrca\/pluto_tv_is_having_an_all_day_marathon_of_the\/e8tki4s\/”,”removal_reason”:null,”retrieved_on”:1544848333,”score”:5,”send_replies”:true,”stickied”:false,”subreddit”:”movies”,”subreddit_id”:”t5_2qh3s”,”subreddit_name_prefixed”:”r\/movies”,”subreddit_type”:”public”}
I only need the “author”, “author_flair_text” and “body” attributes for a machine learning project however couldn’t figure out how to do it. Can you imagine how? I’m a total noob and just started learning Python 4 months ago.
Thanks in advance!
It’s definitely possible to do with ijson. For example, if your objects are in an array, similar to:
{
"records": [
{
"archived": false,
"author": "WGReddit",
"author_created_utc": 1508554552,
"author_flair_background_color": null,
"author_flair_css_class": null,
"author_flair_richtext": [],
"author_flair_template_id": null,
"author_flair_text": null,
"author_flair_text_color": null,
"author_flair_type": "text",
"author_fullname": "t2_ffhn7qm",
"author_patreon_flair": false,
"body": ">**old**",
"can_gild": true,
"can_mod_post": false,
"collapsed": false,
"collapsed_reason": null,
"controversiality": 0,
"created_utc": 1541030400,
"distinguished": null,
"edited": false,
"gilded": 0,
"gildings": {
"gid_1": 0,
"gid_2": 0,
"gid_3": 0
},
"id": "e8tki4s",
"is_submitter": false,
"link_id": "t3_9syrca",
"no_follow": true,
"parent_id": "t1_e8sgiw7",
"permalink": "/r/movies/comments/9syrca/pluto_tv_is_having_an_all_day_marathon_of_the/e8tki4s/",
"removal_reason": null,
"retrieved_on": 1544848333,
"score": 5,
"send_replies": true,
"stickied": false,
"subreddit": "movies",
"subreddit_id": "t5_2qh3s",
"subreddit_name_prefixed": "r/movies",
"subreddit_type": "public"
},
...
]
}
you can get the “author”, “author_flair_text” and “body” attributes using the ijson parser. It doesn’t create Python objects and uses a generator, so the memory footprint isn’t very big. This gist is a naive example of how to write out the data to a CSV file.
Unfortunately, iterating over gigantic JSON will take a long time even if you have a powerful machine to work with. You could try to find tools to read the JSON objects in batches or sample the data to get a much smaller data set to start. Better yet, if you’re getting the data through an API, you could try paging through the data. This article from the Machine Learning Mastery blog may help: 7 Ways to Handle Large Data Files for Machine Learning.
Hi,
I have large json file which contains may lists in it and i have to fetch some data from them but I am not able to do it without applying 5 for loops which is very inefficient So is there any way to do it efficiently. I am giving a small snippet of my Json.
Json file :-
{
‘p’:[
{ ‘id’:1,
‘q’:[
{
‘c’:[
{
‘i’:[]
}
]
}
]
}
]
I have to fetch the ‘i’ list from Json as there is so many list so it become difficult to extract this. Please Help
This snippet of code prints out ijon’s parsing of your JSON:
import ijson
parser = ijson.parse(open('nested_test.json', 'r'))
for prefix, event, value in parser:
print(prefix, event, value)
The ‘i’ list gets parsed as p.item.q.item.c.item.i:
(ml_env) ayla@ayla-xps:~/dev/tmp$ python test_ijson_nested.py
start_map None
map_key p
p start_array None
p.item start_map None
p.item map_key id
p.item.id number 1
p.item map_key q
p.item.q start_array None
p.item.q.item start_map None
p.item.q.item map_key c
p.item.q.item.c start_array None
p.item.q.item.c.item start_map None
p.item.q.item.c.item map_key i
p.item.q.item.c.item.i start_array None
p.item.q.item.c.item.i end_array None
p.item.q.item.c.item end_map None
p.item.q.item.c end_array None
p.item.q.item end_map None
p.item.q end_array None
p.item end_map None
p end_array None
end_map None
Hi Ayla!
Great article, it has been extremely helpful for me in a couple of projects I am developing.
I am having some trouble exploiting ijson’s features as the main use case for the items method is for array-like JSONS, and I have just a massive string.
Have you dealt with long-format JSONs before?
I am wondering if there is any way to get ijson to give me just the leaves under each directory, one by one, without having to reconstruct the json back from its ‘prefix, ‘event’ and ‘value’.
Thank you! I’m glad you found it helpful. I think the example in the ijson README with the description starting with Most common usage is having ijson yield native Python objects out of a JSON stream located under a prefix
will help you, since it shows how to get Python objects from a known path.
Thanks Ayla, that helped.
Though I am stuck while reading a geojson file.
As I have followd your blog I wrote this
`with open(‘head_bh_vaishali_farmSegments.json’, ‘rb’) as input_file:
objects = ijson.items(input_file, ‘features.item.geometry’)
for o in objects:
print(o)`
which outputs this
{‘type’: ‘Polygon’, ‘coordinates’: [[[Decimal(‘85.279564118794596’), Decimal(‘25.850682127125399’)], [Decimal(‘85.279570883678701’), Decimal(‘25.850688220372199’)], [Decimal(‘85.279719711130596’), Decimal(‘25.850688220372199’)]
As you can see it adds `Decimal` in front of each value and that makes my code crash.
objects = ijson.items(input_file, ‘features.item.properties’)
does it load properties from all the items in the features list?
Hi,
Thanks for putting this together. I am new to Python and I haven’t found any useful detail, for a newbie, on ijson anywhere but here. I am trying to load this large json file (800MB) into a dataframe. Regular method ends up crashing my container.
So I found out about ijson. I tried your code on my file which looks like below but with more attributes. I was able to print the first record from json but after that I got an error saying “raise common.JSONError(“Additional data”). I am guessing it has to do with my file since there are not [] brackets around the individual records. I have no control over the format of json file. Is there any way I can make it work with your code?
{“ID”: “2342”, “code”:398, “action”: “EXECUTED”}
{“ID”: “1142”, “code”:333, “action”: “DELETED”}
{“ID”: “0042”, “code”:898, “action”: “ADDED”}
Any suggestion would be really appreciated!
Cheers
The ijson stable version definitely needs the file to contain valid JSON. There’s a new release candidate (https://github.com/ICRAR/ijson/tree/v3.0rc1) that supports using your own reader and then parsing using coroutines, but it looks like that needs valid JSON too. Your file would be valid JSON if these objects were in an array ([] brackets around individual comma separated records). You could try to correct the file using Python, or command line tools like awk or sed.
It may be reasonable to read lines of your file using generators and use the Python json module instead. It may be slow, but you wouldn’t have to keep file contents in memory. Code is in this gist: https://gist.github.com/a-y-khan/d9eacb817c812fcbb201522ec12047b8.
Another possibility is to use Dask bag (https://docs.dask.org/en/latest/bag-creation.html). It has the ability to parse records from a file like you describe. Assuming your JSON records are in a simple text file and the Pandas dataframe created from those records can fit into memory, this gist shows how to use Dask bag to get a Pandas dataframe: https://gist.github.com/a-y-khan/9d3913552624c62fc2ac882c025882ee. If the Pandas dataframe can’t fit into memory, this thread may be helpful: https://github.com/dask/dask/issues/1651.
Hi,
Thanks for your help. I solved it by setting the property “MultipleLines=True”. The json part is working fine now, no memory issues. But on the other side reading from database is causing the memory problem since there is one table with 560 million records which I am trying to bring, just to columns, into a dataframe. I’ll try dask to make this work.