XML & JSON Processing

October 1, 2021

XML processing

Advantages: robust to transfer errors? (can be repaired?)

Disadvantages: very redundant, not line oriented, makes it hard to extract data without specialised tools

Xpath is a full programming language used to work with XML files

Tools:

  • xmllint: only xpath 1.0, encoding problems, misses a lot of concepts

  • xmlstarlet: encoding problems

  • xidel: tool of choice, supports xpath 3.0 and modern character encodings

Give examples using xidel with Xpath 3.0.


JSON processing

JSON is a storage format that is used commonly to return data from an API call.

The typical structure of a Tweet that is returned by the Twitter API, documented at https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.

{
  "created_at": "Thu Apr 06 15:24:15 +0000 2017",
  "id_str": "850006245121695744",
  "text": "1\/ Today we\u2019re sharing our vision for the
  future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
  "user": {
    "id": 2244994945,
    "name": "Twitter Dev",
    "screen_name": "TwitterDev",
    "location": "Internet",
    "url": "https:\/\/dev.twitter.com\/",
    "description": "Your official source for
    Twitter Platform news, updates & events.
    Need technical help? Visit
    https:\/\/twittercommunity.com\/ \u2328\ufe0f
    #TapIntoTwitter"
  },
  "place": {   
  },
  "entities": {
    "hashtags": [      
    ],
    "urls": [
      {
        "url": "https:\/\/t.co\/XweGngmxlP",
        "unwound": {
          "url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
          "title": "Building the Future of the Twitter API Platform"
        }
      }
    ],
    "user_mentions": [     
    ]
  }
}

jq is a stream editor for jsons (CITE). It uses filters to extract data from jsons. We extract the raw text by running this command in ‘jq’ in the terminal.

jq -r '.text| gsub("[\\n\\t]";"")' > text