Archives

Categories

If this helped you, please share!

Using the Python ijson parser

Published November 2, 2017 in programming - 26 Comments

All python code is Python 3.5+.

A few months ago, I had to extract a small amount of data from a large and deeply nested JSON file quickly and export to CSV. I was working in C++ and Python on this project, so my first attempts to extract the data were using the Python json module and Pandas json reader. The file was too big and nested for either tool to work well. Solution? Use the ijson parser.

I can’t share the original so I’m using the JSON file sf-city-lots-json as an example. Structurally, the JSON is simpler, but the file is of similar size and has enough structure to be interesting to parse.

Taking a brute-force approach, I explored the file directly using the parse function, which generates event tuples of (prefix, event, value) constructed from their location in the JSON object tree and values.

import ijson


def parse_json(json_filename):
    with open(json_filename, 'rb') as input_file:
        # load json iteratively
        parser = ijson.parse(input_file)
        for prefix, event, value in parser:
            print('prefix={}, event={}, value={}'.format(prefix, event, value))


if __name__ == '__main__':
    parse_json('./data/citylots_small.json')

Prefixes describe location of the keys or names in the object tree. Events report value types, mark the start (start_array) and end of arrays (end_array) and objects (start_map, end_map) and also mark keys (map_key).

Ijson’s parser works efficiently with large JSON files by leveraging generator iterators and yield expressions to avoid loading the entire structure in memory at once. The object tree is also an efficient way to represent nested JSON.

This is the beginning of the JSON file:

{
"type": "FeatureCollection",
"features": [
{ "type": "Feature", "properties": { "MAPBLKLOT": "0001001", "BLKLOT": "0001001", "BLOCK_NUM": "0001", "LOT_NUM": "001", "FROM_ST": "0", "TO_ST": "0", "STREET": "UNKNOWN", "ST_
TYPE": null, "ODD_EVEN": "E" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -122.422003528252475, 37.808480096967251, 0.0 ], [ -122.422076013325281, 37.80883501981508
5, 0.0 ], [ -122.421102174348633, 37.808803534992904, 0.0 ], [ -122.421062569067274, 37.808601056818148, 0.0 ], [ -122.422003528252475, 37.808480096967251, 0.0 ] ] ] } }
...

The output from the script is pretty verbose. Breaking it down, the first few lines of the file are:

{
"type": "FeatureCollection",
"features": [

Which is parsed as:

prefix=, event=start_map, value=None
prefix=, event=map_key, value=type
prefix=type, event=string, value=FeatureCollection
prefix=, event=map_key, value=features
prefix=features, event=start_array, value=None

This is an example object from the “features” array:

{ "type": "Feature", "properties": { "MAPBLKLOT": "0001001", "BLKLOT": "0001001", "BLOCK_NUM": "0001", "LOT_NUM": "001", "FROM_ST": "0", "TO_ST": "0", "STREET": "UNKNOWN", "ST_TYPE": null, "ODD_EVEN": "E" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -122.422003528252475, 37.808480096967251, 0.0 ], [ -122.422076013325281, 37.808835019815085, 0.0 ], [ -122.421102174348633, 37.808803534992904, 0.0 ], [ -122.421062569067274, 37.808601056818148, 0.0 ], [ -122.422003528252475, 37.808480096967251, 0.0 ] ] ] } }

The parser output describes all the keys in the object hierarchy, their types and associated values.

prefix=features.item, event=start_map, value=None
prefix=features.item, event=map_key, value=type
prefix=features.item.type, event=string, value=Feature
prefix=features.item, event=map_key, value=properties
prefix=features.item.properties, event=start_map, value=None
prefix=features.item.properties, event=map_key, value=MAPBLKLOT
prefix=features.item.properties.MAPBLKLOT, event=string, value=0001001
prefix=features.item.properties, event=map_key, value=BLKLOT
prefix=features.item.properties.BLKLOT, event=string, value=0001001
prefix=features.item.properties, event=map_key, value=BLOCK_NUM
prefix=features.item.properties.BLOCK_NUM, event=string, value=0001
prefix=features.item.properties, event=map_key, value=LOT_NUM
prefix=features.item.properties.LOT_NUM, event=string, value=001
prefix=features.item.properties, event=map_key, value=FROM_ST
prefix=features.item.properties.FROM_ST, event=string, value=0
prefix=features.item.properties, event=map_key, value=TO_ST
prefix=features.item.properties.TO_ST, event=string, value=0
prefix=features.item.properties, event=map_key, value=STREET
prefix=features.item.properties.STREET, event=string, value=UNKNOWN
prefix=features.item.properties, event=map_key, value=ST_TYPE
prefix=features.item.properties.ST_TYPE, event=null, value=None
prefix=features.item.properties, event=map_key, value=ODD_EVEN
prefix=features.item.properties.ODD_EVEN, event=string, value=E
prefix=features.item.properties, event=end_map, value=None
prefix=features.item, event=map_key, value=geometry
prefix=features.item.geometry, event=start_map, value=None
prefix=features.item.geometry, event=map_key, value=type
prefix=features.item.geometry.type, event=string, value=Polygon
prefix=features.item.geometry, event=map_key, value=coordinates
prefix=features.item.geometry.coordinates, event=start_array, value=None
prefix=features.item.geometry.coordinates.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.422003528252475
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808480096967251
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.422076013325281
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808835019815085
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.421102174348633
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808803534992904
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.421062569067274
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808601056818148
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item.item, event=start_array, value=None
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=-122.422003528252475
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=37.808480096967251
prefix=features.item.geometry.coordinates.item.item.item, event=number, value=0
prefix=features.item.geometry.coordinates.item.item, event=end_array, value=None
prefix=features.item.geometry.coordinates.item, event=end_array, value=None
prefix=features.item.geometry.coordinates, event=end_array, value=None
prefix=features.item.geometry, event=end_map, value=None
prefix=features.item, event=end_map, value=None

The parsed output for the end of the file:

]
}

Closes up the “features” array and the top-level object:

prefix=features, event=end_array, value=None
prefix=, event=end_map, value=None

Now that I understand the JSON structure and how the prefixes are formatted, I can extract a specific item from the file. For example, if I wanted property lot numbers:

import ijson


def extract_lot_numbers(json_filename):
    with open(json_filename, 'rb') as input_file:
        lot_numbers = ijson.items(input_file, 'features.item.properties.LOT_NUM')
        for number in lot_numbers:
            print('Lot number: {}'.format(number))


if __name__ == '__main__':
    extract_lot_numbers('./data/citylots.json')

Which outputs:

Lot number: 001
Lot number: 001
Lot number: 002
Lot number: 001
...

I can also filter the objects generated by ijson using list comprehensions. For example, if I wanted all the properties with street type “ST”:

import ijson


def extract_properties(json_filename):
    with open(json_filename, 'rb') as input_file:
        objects = ijson.items(input_file, 'features.item.properties')
        street_properties = (o for o in objects if o['ST_TYPE'] == 'ST')
        for prop in street_properties:
            print('Property on a street (ST): {}'.format(prop))


if __name__ == '__main__':
    extract_properties('./data/citylots.json')

Which outputs:

Property on a street (ST): {'MAPBLKLOT': '0005001', 'BLKLOT': '0005001', 'BLOCK_NUM': '0005', 'LOT_NUM': '001', 'FROM_ST': '206', 'TO_ST': '286', 'STREET': 'JEFFERSON', 'ST_TYPE': 'ST', 'ODD_EVEN': 'E'}
Property on a street (ST): {'MAPBLKLOT': '0005001', 'BLKLOT': '0005001', 'BLOCK_NUM': '0005', 'LOT_NUM': '001', 'FROM_ST': '206', 'TO_ST': '286', 'STREET': 'JEFFERSON', 'ST_TYPE': 'ST', 'ODD_EVEN': 'E'}
Property on a street (ST): {'MAPBLKLOT': '0006001', 'BLKLOT': '0006001', 'BLOCK_NUM': '0006', 'LOT_NUM': '001', 'FROM_ST': '350', 'TO_ST': '366', 'STREET': 'JEFFERSON', 'ST_TYPE': 'ST', 'ODD_EVEN': 'E'}
Property on a street (ST): {'MAPBLKLOT': '0007001', 'BLKLOT': '0007001', 'BLOCK_NUM': '0007', 'LOT_NUM': '001', 'FROM_ST': '2936', 'TO_ST': '2936', 'STREET': 'HYDE', 'ST_TYPE': 'ST', 'ODD_EVEN': 'E'}
...

26 comments

Zack - April 5, 2018 Reply

Thanks for this!

    Ayla Khan - April 5, 2018 Reply

    You’re welcome!

    Sidney Raimann - October 28, 2018 Reply

    I like this site because so much useful material on here : D.

Srini - May 25, 2018 Reply

This tutorial is nice! How can I get all the json records (full json objects) from a very large json file …where there is a particular key(field)?

    Ayla Khan - May 25, 2018 Reply

    If you know the particular field you’re looking for, the approach I took in extract_lot_numbers could work.

Geoff - October 5, 2018 Reply

This was awesome, thank you! I cannot find any documentation on ijson library…probably don’t know where to look. But this will teach me how to use it!

ambijat - November 11, 2018 Reply

I am looking for parsing twitter json data running into lacs of tweets with size around 2 gb. There are 4 key elements that I need to extract, but my computer hangs while processing such a large data. I have come to know that ijson can solve this. Can you reflect on this.

    Ayla Khan - November 11, 2018 Reply

    Because ijson does not load an entire data set in memory (uses Python generators), it will handle large datasets without hanging. You’ll want to be strategic about how to iterate over your data and find the elements you need to extract. Fewer steps and less data loaded into memory while iterating is best.

Benoni - January 12, 2019 Reply

Thanks for this Ayla! I have a problem that i could not handle myself, i was wondering if you could point me in the right direction. I have a gigantic JSON file (120 gb) in which every object has the same attributes. I’m pasting only one object below as an example.

{“archived”:false,”author”:”WGReddit”,”author_created_utc”:1508554552,”author_flair_background_color”:null,”author_flair_css_class”:null,”author_flair_richtext”:[],”author_flair_template_id”:null,”author_flair_text”:null,”author_flair_text_color”:null,”author_flair_type”:”text”,”author_fullname”:”t2_ffhn7qm”,”author_patreon_flair”:false,”body”:”>**old**”,”can_gild”:true,”can_mod_post”:false,”collapsed”:false,”collapsed_reason”:null,”controversiality”:0,”created_utc”:1541030400,”distinguished”:null,”edited”:false,”gilded”:0,”gildings”:{“gid_1″:0,”gid_2″:0,”gid_3″:0},”id”:”e8tki4s”,”is_submitter”:false,”link_id”:”t3_9syrca”,”no_follow”:true,”parent_id”:”t1_e8sgiw7″,”permalink”:”\/r\/movies\/comments\/9syrca\/pluto_tv_is_having_an_all_day_marathon_of_the\/e8tki4s\/”,”removal_reason”:null,”retrieved_on”:1544848333,”score”:5,”send_replies”:true,”stickied”:false,”subreddit”:”movies”,”subreddit_id”:”t5_2qh3s”,”subreddit_name_prefixed”:”r\/movies”,”subreddit_type”:”public”}

I only need the “author”, “author_flair_text” and “body” attributes for a machine learning project however couldn’t figure out how to do it. Can you imagine how? I’m a total noob and just started learning Python 4 months ago.

Thanks in advance!

    Ayla Khan - January 20, 2019 Reply

    It’s definitely possible to do with ijson. For example, if your objects are in an array, similar to:

    {
    "records": [
    {
    "archived": false,
    "author": "WGReddit",
    "author_created_utc": 1508554552,
    "author_flair_background_color": null,
    "author_flair_css_class": null,
    "author_flair_richtext": [],
    "author_flair_template_id": null,
    "author_flair_text": null,
    "author_flair_text_color": null,
    "author_flair_type": "text",
    "author_fullname": "t2_ffhn7qm",
    "author_patreon_flair": false,
    "body": ">**old**",
    "can_gild": true,
    "can_mod_post": false,
    "collapsed": false,
    "collapsed_reason": null,
    "controversiality": 0,
    "created_utc": 1541030400,
    "distinguished": null,
    "edited": false,
    "gilded": 0,
    "gildings": {
    "gid_1": 0,
    "gid_2": 0,
    "gid_3": 0
    },
    "id": "e8tki4s",
    "is_submitter": false,
    "link_id": "t3_9syrca",
    "no_follow": true,
    "parent_id": "t1_e8sgiw7",
    "permalink": "/r/movies/comments/9syrca/pluto_tv_is_having_an_all_day_marathon_of_the/e8tki4s/",
    "removal_reason": null,
    "retrieved_on": 1544848333,
    "score": 5,
    "send_replies": true,
    "stickied": false,
    "subreddit": "movies",
    "subreddit_id": "t5_2qh3s",
    "subreddit_name_prefixed": "r/movies",
    "subreddit_type": "public"
    },
    ...
    ]
    }

    you can get the “author”, “author_flair_text” and “body” attributes using the ijson parser. It doesn’t create Python objects and uses a generator, so the memory footprint isn’t very big. This gist is a naive example of how to write out the data to a CSV file.

    Unfortunately, iterating over gigantic JSON will take a long time even if you have a powerful machine to work with. You could try to find tools to read the JSON objects in batches or sample the data to get a much smaller data set to start. Better yet, if you’re getting the data through an API, you could try paging through the data. This article from the Machine Learning Mastery blog may help: 7 Ways to Handle Large Data Files for Machine Learning.

Bhavuk - May 30, 2019 Reply

Hi,
I have large json file which contains may lists in it and i have to fetch some data from them but I am not able to do it without applying 5 for loops which is very inefficient So is there any way to do it efficiently. I am giving a small snippet of my Json.

Json file :-
{
‘p’:[
{ ‘id’:1,
‘q’:[
{
‘c’:[
{
‘i’:[]
}
]
}
]
}
]

I have to fetch the ‘i’ list from Json as there is so many list so it become difficult to extract this. Please Help

    Ayla Khan - June 24, 2019 Reply

    This snippet of code prints out ijon’s parsing of your JSON:


    import ijson

    parser = ijson.parse(open('nested_test.json', 'r'))
    for prefix, event, value in parser:
    print(prefix, event, value)

    The ‘i’ list gets parsed as p.item.q.item.c.item.i:

    (ml_env) ayla@ayla-xps:~/dev/tmp$ python test_ijson_nested.py
    start_map None
    map_key p
    p start_array None
    p.item start_map None
    p.item map_key id
    p.item.id number 1
    p.item map_key q
    p.item.q start_array None
    p.item.q.item start_map None
    p.item.q.item map_key c
    p.item.q.item.c start_array None
    p.item.q.item.c.item start_map None
    p.item.q.item.c.item map_key i
    p.item.q.item.c.item.i start_array None
    p.item.q.item.c.item.i end_array None
    p.item.q.item.c.item end_map None
    p.item.q.item.c end_array None
    p.item.q.item end_map None
    p.item.q end_array None
    p.item end_map None
    p end_array None
    end_map None

Francesco - September 30, 2019 Reply

Hi Ayla!
Great article, it has been extremely helpful for me in a couple of projects I am developing.

I am having some trouble exploiting ijson’s features as the main use case for the items method is for array-like JSONS, and I have just a massive string.
Have you dealt with long-format JSONs before?
I am wondering if there is any way to get ijson to give me just the leaves under each directory, one by one, without having to reconstruct the json back from its ‘prefix, ‘event’ and ‘value’.

    Ayla Khan - October 15, 2019 Reply

    Thank you! I’m glad you found it helpful. I think the example in the ijson README with the description starting with Most common usage is having ijson yield native Python objects out of a JSON stream located under a prefix will help you, since it shows how to get Python objects from a known path.

Pawan Dubey - February 4, 2020 Reply

Thanks Ayla, that helped.

Though I am stuck while reading a geojson file.

As I have followd your blog I wrote this
`with open(‘head_bh_vaishali_farmSegments.json’, ‘rb’) as input_file:
objects = ijson.items(input_file, ‘features.item.geometry’)
for o in objects:
print(o)`
which outputs this
{‘type’: ‘Polygon’, ‘coordinates’: [[[Decimal(‘85.279564118794596’), Decimal(‘25.850682127125399’)], [Decimal(‘85.279570883678701’), Decimal(‘25.850688220372199’)], [Decimal(‘85.279719711130596’), Decimal(‘25.850688220372199’)]

As you can see it adds `Decimal` in front of each value and that makes my code crash.

    Ayla Khan - February 4, 2020 Reply

    Can you post a sample of the JSON you’re trying to parse?

Pawan Dubey - February 4, 2020 Reply

objects = ijson.items(input_file, ‘features.item.properties’)

does it load properties from all the items in the features list?

    Ayla Khan - February 4, 2020 Reply

    It does, but lazily since currently all the ijson backends implement ijson.items using a generator.

Shah - February 21, 2020 Reply

Hi,

Thanks for putting this together. I am new to Python and I haven’t found any useful detail, for a newbie, on ijson anywhere but here. I am trying to load this large json file (800MB) into a dataframe. Regular method ends up crashing my container.

So I found out about ijson. I tried your code on my file which looks like below but with more attributes. I was able to print the first record from json but after that I got an error saying “raise common.JSONError(“Additional data”). I am guessing it has to do with my file since there are not [] brackets around the individual records. I have no control over the format of json file. Is there any way I can make it work with your code?

{“ID”: “2342”, “code”:398, “action”: “EXECUTED”}
{“ID”: “1142”, “code”:333, “action”: “DELETED”}
{“ID”: “0042”, “code”:898, “action”: “ADDED”}

Any suggestion would be really appreciated!

Cheers

    Ayla Khan - February 24, 2020 Reply

    The ijson stable version definitely needs the file to contain valid JSON. There’s a new release candidate (https://github.com/ICRAR/ijson/tree/v3.0rc1) that supports using your own reader and then parsing using coroutines, but it looks like that needs valid JSON too. Your file would be valid JSON if these objects were in an array ([] brackets around individual comma separated records). You could try to correct the file using Python, or command line tools like awk or sed.

      Ayla Khan - February 24, 2020 Reply

      It may be reasonable to read lines of your file using generators and use the Python json module instead. It may be slow, but you wouldn’t have to keep file contents in memory. Code is in this gist: https://gist.github.com/a-y-khan/d9eacb817c812fcbb201522ec12047b8.

        Ayla Khan - February 24, 2020 Reply

        Another possibility is to use Dask bag (https://docs.dask.org/en/latest/bag-creation.html). It has the ability to parse records from a file like you describe. Assuming your JSON records are in a simple text file and the Pandas dataframe created from those records can fit into memory, this gist shows how to use Dask bag to get a Pandas dataframe: https://gist.github.com/a-y-khan/9d3913552624c62fc2ac882c025882ee. If the Pandas dataframe can’t fit into memory, this thread may be helpful: https://github.com/dask/dask/issues/1651.

          Sayyed Shah - February 29, 2020 Reply

          Hi,
          Thanks for your help. I solved it by setting the property “MultipleLines=True”. The json part is working fine now, no memory issues. But on the other side reading from database is causing the memory problem since there is one table with 560 million records which I am trying to bring, just to columns, into a dataframe. I’ll try dask to make this work.

          Ayla Khan - March 5, 2020 Reply

          I’m glad you found a solution to the JSON record problem! Dask could be a good way to scale up.

Jeff - January 25, 2023 Reply

How to you ship over TypeErrors when there are empty lists.

This is my error: TypeError: can only concatenate str (not “list”) to str

    Ayla Khan - January 30, 2023 Reply

    It’s hard to say without seeing your data or where the code is failing. I don’t suppose you’ve tried the ijson.dump tool to get the parsing method results? Also, are you using a high or low-level interface?

Leave a Reply: