NLP Tools using simple word counting techniques

October 1, 2021

AWK

awk is programming language that was explicitly designed to handle text processing (CITE). It is old (CITE) and very fast (CITE comparison to C?).

It does not need to be compiled, arrays can be instantiated on-the-fly and its syntax is very similiar to bash.

‘awk -f’ allow to run files.

Tokenizer

To split a text into a number of discrete “words” (more formally we call these tokens), we employ a tokenizer. FILE should contain one text per line, therefore we made sure with jq above that no line contains a “\n” (newline) character. Usually, we store our data as “tabulator seperated values” (TSVs), to avoid problems along the way we also remove “\t” (tabulator) character. We use regular expressions (“Regular expression,” 2020) to remove hyperlinks, references of users (on Twitter), punctuation, quotes and additional special characters.

awk -vORS='\t' -F' ' '{gsub(/^"|"$|""|\.|\,|\!|\?|\:|\;|‘|"|'\''|&amp|
  “|”|-|➡️|→|↓|http[a-zA-Z0-9\:/.]+|#[a-zA-Z0-9]+|@[a-zA-Z0-9:_'\''’]+
  |…/,"");gsub(/\s{2,}/," ");split($0,m); for(i in m) print tolower($i);
  printf "\n"}' FILE | sed s'/\t$//' > FILE_CONTAINING_TOKENIZED_TEXT

Exact Matching

Stemming

We stem text using the “Porter stemmer” (Porter, 1980/2006).

An awk implemention can be downloaded on https://tartarus.org/martin/PorterStemmer/ as awk.txt to used in the following way.

awk -F"\t" -f awk.txt FILE_CONTAINING_TOKENIZED_TEXT >
  FILE_CONTAINING_STEMMED_TEXT

Dictionary Matching Script

This script performs exact matching of one of the dictionary files of Sections @ref(sentlex) on a file that consists of one tokenized file per line. Its output is the average value for valence, arousal and dominance per token for each line.

Copy this into a file called “dictmatching.awk”.

#!/usr/bin/gawk

#the FieldSeperator reads the input file tab-delimited
#the OutputFieldSeperator writes it tab-delimited
BEGIN {
    FS = "\t"
    OFS = "\t"
}

{
    #NR==FNR ist an awk trick to make sure that
    #we stay in the first file here, our lexicon file
    if (NR == FNR) {
        valence[$2] = $3 + 0
        arousal[$2] = $4 + 0
        dominance[$2] = $5 + 0
        words[$2]
        #we want to end the script here, if we are in the first file
        #this build the lexicon and goes to the next line
        next
    }
    #we set the hit count to 0
    sum_hits["valence"] = 0
    sum_hits["arousal"] = 0
    sum_hits["dominance"] = 0
    #NF refers to the Number of Fields
    #if the line of text is empty:
    if (NF == 0) {
        print NA, NA, NA
        next
    }
    #we loop over each field (one field = one token)
    for (i = 1; i <= NF; i++) {
        #if there is an exact match, we count 1 for the category a word
        #belongs to in case we stemmed the text before
      #(we normally do so),
        #make sure that the lexicon is stemmed too and that it doesn't
        #contain duplicates, if so
        #combine duplicates and take their mean value for example
        if ($i in words) {
            sum_hits["valence"] = sum_hits["valence"] + valence[$i]
            sum_hits["arousal"] = sum_hits["arousal"] + arousal[$i]
            sum_hits["dominance"] = sum_hits["dominance"] + dominance[$i]
        }
    }
    #we print the results
    print sum_hits["valence"] / NF, sum_hits["arousal"] / NF, \
    sum_hits["dominance"] / NF
}

And run it.

awk -f dictmatching.awk STEMMED_LEXICON_FILE FILE_CONTAINING_STEMMED_TEXT >
  FILE_SENTIMENT_ANALYSED

Wildcard Matching

Some dictionaries (such as LIWC) only need tokenization as a preprocessing step and no stemming. Instead they rely on wildcard matching. We can take advantage of the inbuilt capabilities of AWK to perform regex matching:

#!/usr/bin/gawk

# awk -F"\t" '{print length($1)}' liwc_german_2007_prosocial | \
# sort -n | uniq
# german liwc subset maxlength = 25

#we set the output record seperator, needed when printing an array
BEGIN {
    FS = "\t"
    ORS = "\t"
    maxchar = 0
    minchar = 100
}

{
    #NR==FNR is the same trick as before
    if (NR == FNR) {
        #if it contains a wildcard at the end
      #we do special substring matching later
        if ($1 ~ /\*$/) {
            #we remove the wildcard at the end
          #of the liwc word with substr(ing)
            pattern[tolower(substr($1, 1, length($1) - 1))][$2]
            if (length($1) > maxchar) {
                maxchar = length($1)
            }
            if (length($1) < minchar) {
                minchar = length($1)
            }
        } else {
            pattern_exact[tolower($1)][$2]
        }
        categories[$2 + 0]
        #for the first file, we just build our lookup table
        #and forget about the stuff below
        next
    }
    if (FNR == 1) {
        for (e in categories) {
            print e
        }
        printf "number_tokens\n"
    }
    for (e in categories) {
        sum_hits[e] = 0
    }
    #we loop over each field (one field = one token)
    for (i = 1; i <= NF; i++) {
        #if there is an exact match,
      #we count 1 for each category a word belongs to
      #and go to the next increment in the loop
        if ($i in pattern_exact) {
            for (l in pattern_exact[$i]) {
                sum_hits[l]++
            }
            continue
        }
        #if there is no exact match,
      #we reduce the string by one
      #and try to match it to the wildcard words on each iteration,
      #if there is a match we do the ususal counting
      #and break this character-reducing loop,
      #this means we find the longest matching string
        #we reduce the word in the first iteration
      #by the amount that makes it the same length
      #as the longest word in our wildcard dictionary
        #we stop when you have reached the minimum length
      #in the wildcard dictionary
        for (s = maxchar; s >= minchar; s--) {
            if (substr($i, 1, s) in pattern) {
                for (l in pattern[substr($i, 1, s)]) {
                    sum_hits[l]++
                }
                break
            }
        }
    }
    #we print the results
    for (e in sum_hits) {
        print sum_hits[e]
    }
    printf NF "\n"
}

References

Porter, M. (2006). An algorithm for suffix stripping. Program, 40(3), 211–218. doi: 10.1108/00330330610681286 (Original work published 1980)

Regular expression. (2020, January 28). In Wikipedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Regular_expression&oldid=938008315