How I parsed huge JSON files into an SQLite Database in under a second using the power of Crystal Language

Recently I worked on a project with my friend David Colombo, he needed to take huge JSON files, map them onto an object, and then copy the data over from that object and insert it into a SQLite Database. This post describes the technical challenge of taking large amounts of JSON Data and inserting it into a SQLite Database without wait times.

Background

Previously, David had created a parser in NodeJS that reads through the JSON files containing structured data and inserts the data into SQLite. These JSON Files contained 1,000s of keys, and are 100s of megabytes each. Unfortunately due to limitations of the language and some anti-practices that JavaScript allows, the script took 14 hours to complete it’s task. He attempted to write a parser in Python, except it was limited to 9999 keys and couldn’t meet the project’s expectations (in retrospect I’m sure there’s a library or way to get around that limit, I’m not a Python Developer and cannot comment much on this potential limitation). He sent me a message asking if I was still working with Ruby and asked if it’d be any faster, I explained while I’m able to write a script in Ruby just fine, I had been experimenting with a language called Crystal (a language with Ruby-like syntax and C-like performance) and asked if he’d be open to trying it instead of Ruby. I also wasn’t able to predict the performance of Ruby ahead of time and was not prepared to provide an answer on whether Ruby would be faster. Since I’ve been learning Crystal I decided to give the rewrite in it a shot.

A few requirements

We want the code to be long lived and require minimal changes (preferably no changes) to the project’s dependencies, and we also wanted to avoid using third party libraries (since we risk having to replace them in the event the maintainers end the project) and only using Crystal’s standard library was a goal. The standard library does not support databases, so we decided to use the official shard for SQLite, it’s maintained by the Crystal Core Team so it will probably be maintained well enough.

An object to represent the JSON Files

To maximize performance I decided to use a nested struct to contain the data using Crystal’s JSON.mapping() (https://crystal-lang.org/api/0.28.0/JSON.html#mapping). The data wouldn’t be changed, just copied into SQLite and stack memory is cheaper than heap memory so the drawbacks of structs were worth it in exchange for performance benefits.

module MyProgram
  extend self

  struct CVE_Data_Entity
    struct CVE_Items
      struct CVE
        struct DataMeta
          JSON.mapping(
            "id": {key: "ID", type: String, nilable: true},
            "assigner": {key: "ASSIGNER", type: String, nilable: true}
          )
        end

        struct Affects
          struct Vendor
            struct VendorData
              struct Product
                struct Data
                  struct Version
                    struct Data
                      JSON.mapping(
                        "version_value": {type: String, nilable: true},
                        "version_affected": {type: String, nilable: true},
                      )
                    end

                    JSON.mapping(
                      "version_data": {type: Array(Data), nilable: true},
                    )
                  end

                  JSON.mapping(
                    "product_name": {type: String, nilable: true},
                    "version": {type: Version, nilable: true}
                  )
                end

                JSON.mapping(
                  "product_data": {type: Array(Data), nilable: true},
                )
              end

              JSON.mapping(
                "vendor_name": {type: String, nilable: true},
                "product": {type: Product, nilable: true},
              )
            end

            JSON.mapping(
              "vendor_data": {type: Array(VendorData), nilable: true},
            )
          end

          JSON.mapping(
            "vendor": {type: Vendor, nilable: true},
          )
        end

        struct Problemtype
          struct Data
            struct Description
              JSON.mapping(
                "lang": {type: String, nilable: true},
                "value": {type: String, nilable: true}
              )
            end

            JSON.mapping(
              "description": {type: Array(Description), nilable: true}
            )
          end

          JSON.mapping(
            "problemtype_data": {type: Array(Data), nilable: true},
          )
        end

        struct References
          struct Data
            JSON.mapping(
              "url": {type: String, nilable: true},
              "name": {type: String, nilable: true},
              "refsource": {type: String, nilable: true},
              "tags": {type: Array(String), nilable: true},
            )
          end

          JSON.mapping(
            "reference_data": {type: Array(Data), nilable: true},
          )
        end

        struct Description
          struct Data
            JSON.mapping(
              "lang": {type: String, nilable: true},
              "value": {type: String, nilable: true},
            )
          end

          JSON.mapping(
            "description_data": {type: Array(Data), nilable: true}
          )
        end

        JSON.mapping(
          "data_type": {type: String, nilable: true},
          "data_format": {type: String, nilable: true},
          "data_version": {type: String, nilable: true},
          "cve_data_meta": {key: "CVE_data_meta", type: DataMeta, nilable: true},
          "affects": {type: Affects, nilable: true},
          "problemtype": {type: Problemtype, nilable: true},
          "references": {type: References, nilable: true},
          "description": {type: Description, nilable: true},
        )
      end

      struct Configurations
        struct Nodes
          struct CPE
            JSON.mapping(
              "vulnerable": {type: Bool, nilable: true},
              "cpe23Uri": {type: String, nilable: true},
            )
          end

          JSON.mapping(
            "operator": {type: String, nilable: true},
            "cpe_match": {type: Array(CPE), nilable: true},
          )
        end

        JSON.mapping(
          "cve_data_version": {key: "CVE_data_version", type: String, nilable: true},
          "nodes": {type: Array(Nodes), nilable: true},
        )
      end

      struct Impact
        struct BaseMetricV3
          struct CvssV3
            JSON.mapping(
              "version": {type: String, nilable: true},
              "vectorString": {type: String, nilable: true},
              "attackVector": {type: String, nilable: true},
              "attackComplexity": {type: String, nilable: true},
              "privilegesRequired": {type: String, nilable: true},
              "userInteraction": {type: String, nilable: true},
              "scope": {type: String, nilable: true},
              "confidentialityImpact": {type: String, nilable: true},
              "integrityImpact": {type: String, nilable: true},
              "availabilityImpact": {type: String, nilable: true},
              "baseScore": {type: Float64, nilable: true},
              "baseSeverity": {type: String, nilable: true},
            )
          end

          JSON.mapping(
            "cvssV3": {type: CvssV3, nilable: true},
            "exploitabilityScore": {type: Float64, nilable: true},
            "impactScore": {type: Float64, nilable: true},
          )
        end

        struct BaseMetricV2
          struct CvssV2
            JSON.mapping(
              "version": {type: String, nilable: true},
              "vectorString": {type: String, nilable: true},
              "accessVector": {type: String, nilable: true},
              "accessComplexity": {type: String, nilable: true},
              "authentication": {type: String, nilable: true},
              "confidentialityImpact": {type: String, nilable: true},
              "integrityImpact": {type: String, nilable: true},
              "availabilityImpact": {type: String, nilable: true},
              "baseScore": {type: Float64, nilable: true},
            )
          end

          JSON.mapping(
            "cvssV2": {type: CvssV2, nilable: true},
            "severity": {type: String, nilable: true},
            "exploitabilityScore": {type: Float64, nilable: true},
            "impactScore": {type: Float64, nilable: true},
            "acInsufInfo": {type: Bool, nilable: true},
            "obtainAllPrivilege": {type: Bool, nilable: true},
            "obtainUserPrivilege": {type: Bool, nilable: true},
            "obtainOtherPrivilege": {type: Bool, nilable: true},
            "userInteractionRequired": {type: Bool, nilable: true},
          )
        end

        JSON.mapping(
          "baseMetricV2": {type: BaseMetricV2, nilable: true},
          "baseMetricV3": {type: BaseMetricV3, nilable: true},
        )
      end

      JSON.mapping(
        "cve": {type: CVE, nilable: true},
        "configurations": {type: Configurations, nilable: true},
        "impact": {type: Impact, nilable: true},
        "publishedDate": {type: String, nilable: true},
        "lastModifiedDate": {type: String, nilable: true},
      )
    end

    JSON.mapping(
      "cve_data_type": {key: "CVE_data_type", type: String, nilable: true},
      "cve_data_format": {key: "CVE_data_format", type: String, nilable: true},
      "cve_data_version": {key: "CVE_data_version", type: String, nilable: true},
      "cve_data_numberofcves": {key: "CVE_data_numberOfCVEs", type: String, nilable: true},
      "cve_data_timestamp": {key: "CVE_data_timestamp", type: String, nilable: true},
      "cve_items": {key: "CVE_Items", type: Array(CVE_Items), nilable: true},
    )
  end
end

Improving the JSON parsing time

In Crystal’s development mode parsing one of the JSON files (2018.json containing around 200MB of data) into the object took 30 seconds, in release mode it took 5 seconds. This performance was pretty good already but I’d like it to be faster as the datasets would get larger and larger over time. The first thing I tried was changing the class to a struct (which is being used now). That had minimal impact on performance. Next I changed from using the File.open() method to the File.read() method which improved the file read speed and brought the parse time down to under a second. From this we learned that it’s much faster to open a file in read mode, than in read and write mode. When writing code we know to only ask for read permissions except when we also need to write to it. There are probably more file optimizations we could try, although that’s a topic of it’s own.

Inserting the data into SQLite

Gathering the data and attaching it into an object was only half the challenge, next I needed an efficient way to massively insert data into SQLite. At first I tried iterating over the various arrays and doing a lot of individual queries, on my Mac that still took around five minutes and much longer (never actually completed) on a Linux Laptop. I then learned I could group these queries into one bulk transaction. I came up with the following code that ran in under a second.

require "json"
require "sqlite3"
require "./cve_data_entity.cr"

module MyProgram
  VERSION = "0.1.0"

  filepath = "./src/example-json-files/example-full-2019-dataset.json"
  myobject = CVE_Data_Entity.from_json(File.read(filepath))

  DB.open "sqlite3://./src/example-json-files/dbname.sqlt" do |db|
    db.transaction do |tx|
      tx.begin_transaction
      myobject.try(&.cve_items).try(&.each do |item|
        # Insert General Information into the Database
        cve_item_id = item.try(&.cve).try(&.cve_data_meta).try(&.id) || "Not available"
        data_type = item.try(&.cve).try(&.data_type) || "Not available"
        data_format = item.try(&.cve).try(&.data_format) || "Not available"
        data_version = item.try(&.cve).try(&.data_version) || "Not available"
        published_date = item.try(&.publishedDate) || "Not available"
        last_modified_date = item.try(&.lastModifiedDate) || "Not available"
        tx.connection.exec("INSERT INTO GENERAL_INFORMATION (\"ID\", \"DATA_TYPE\", \"DATA_FORMAT\", \"DATA_VERSION\", \"PUBLISHDATE\", \"LASTMODIFIEDDATE\") VALUES (?, ?, ?, ?, ?, ?)", [cve_item_id, data_type, data_format, data_version, published_date, last_modified_date])

      end)
      tx.commit
    end
  end
end

Admittedly the .try() method calls can be a bit messy and we’re looking into cleaner ways to write this type of code. One recommendation was to port the JSON .dig() method to my struct, in the future I might attempt that. If its difficult reading through the try logic right now, read about capturing blocks and procs in the Crystal Documentation first and it’ll make more sense. This method is not ideal when working with larger amounts of code. Although other than the readability issues, there was not a huge performance impact.

In conclusion

By writing our parser and database insertion logic in Crystal we were forced to use better coding practices, we learned about SQLite transactions, and saved about 14 hours on our database’s build time and brought it down to under a second. If you have a similar challenge in your organization, consider trying to solve it using Crystal.

Create your website at WordPress.com
Get started
%d bloggers like this: