Advantages: robust to transfer errors? (can be repaired?)
Disadvantages: very redundant, not line oriented, makes it hard to extract data without specialised tools
Xpath is a full programming language used to work with XML files
Tools:
xmllint: only xpath 1.0, encoding problems, misses a lot of concepts
xmlstarlet: encoding problems
xidel: tool of choice, supports xpath 3.0 and modern character encodings
Give examples using xidel with Xpath 3.0.
JSON is a storage format that is used commonly to return data from an API call.
The typical structure of a Tweet that is returned by the Twitter API, documented at https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.
{
"created_at": "Thu Apr 06 15:24:15 +0000 2017",
"id_str": "850006245121695744",
"text": "1\/ Today we\u2019re sharing our vision for the
future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
"user": {
"id": 2244994945,
"name": "Twitter Dev",
"screen_name": "TwitterDev",
"location": "Internet",
"url": "https:\/\/dev.twitter.com\/",
"description": "Your official source for
Twitter Platform news, updates & events.
Need technical help? Visit
https:\/\/twittercommunity.com\/ \u2328\ufe0f
#TapIntoTwitter"
},
"place": {
},
"entities": {
"hashtags": [
],
"urls": [
{
"url": "https:\/\/t.co\/XweGngmxlP",
"unwound": {
"url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
"title": "Building the Future of the Twitter API Platform"
}
}
],
"user_mentions": [
]
}
}
jq is a stream editor for jsons (CITE). It uses filters to extract data from jsons. We extract the raw text by running this command in ‘jq’ in the terminal.
jq -r '.text| gsub("[\\n\\t]";"")' > text