Useful Commands

October 1, 2021

In the course of my PhD studies, many little tasks came up that can be efficiently solved with one-liners on the command line. In the following, I want to give a (by no means exhaustive) list of those little helpers in the hope that they may be as useful to others as they have been for me.

* #!/bin/bash

Make executable
* chmod u+x

* lines: wc -l <FILE>
* words: wc -w <FILE>

Comparing two files (outputs lines in the second that aren't contained in the first)

Appending to file
* >> <FILE>

Overwriting file
* > <FILE>

Get attribute from list of jsons
* jq ".ATTR" <FILE>

Remove ""
* tr -d '"'

Twitter Error Codes
* 50 	User not found. 	Corresponds with HTTP 404. The user is not found.
* 63 	User has been suspended. 	Corresponds with HTTP 403 The user account has been suspended and information cannot be retrieved.

Get Processes
* ps -eaf | grep <USERNAME>

Subset tsv
* awk -F $'\t' '$2 > 0.5 && $2 !~ /^Error:/' <TSVFILE>

Get file size
* ls -sh

Copy from linenumber to end
* sed -n '<linenumber>,$p' <FILE>

Output only unique values
* awk '!_[$0]++' <FILE>

Json values to tsv
* jq --raw-output '"\(.<first>)\t\(.<second>)"'

Grep ignore upper/lowercase: 
* grep -i

Remove users with errors
* awk '!/^Error/' <FILE> > <FILECHECKED>

Write to table
* jq --raw-output '"\(.user.id_str)\t\(.scores.universal)"' <FILE> > <TABLEFILE>

Filter error messages and display user ids that have to be tried again
* awk '/^Error/' <FILE> | awk '!_[$0]++' | grep -v '^Error: Not\|Error: user' | awk '{print $NF}'

Get current path
* pwd

Create textfile
* cat > <FILE>

Search directory of text files for pattern
* grep --exclude=*.* -rnw . -e "Error"

Count number of files in directory
* find . -type f | wc -l

Sort directory chronologically
* ls -t (most recent on top)
* ls -tr (most recent at bottom)

Split file in x number of lines per file
* split -l <x> -a 1 <FILE> <prefix>

Search for pattern in file names in directory
* ls -dq *pattern*

Remove folder
* rm -rf <FOLDER>

Show line number of occurences
* grep -n <pattern> <FILE>

Split in x groups
* split -n l/x <FILE> -a 1 <prefix>

Set delimiter in awk
* awk -F'<delimiter>'

Find out about encoding and line endings of text file
* file <FILE>

Grep all mentions
* grep -o "\w*@\w*"

Remove mentions (@+<username>) from tsv of tweets
* gawk -F$'\t' '{gsub(/@[[:alnum:]_][[:alnum:]_]*/,""); sub(/^[ ]+/, "", $<COLUMN>); print}' <FILE>

Do the same as above, but build up tsv correctly again
* gawk -F$'\t' '{gsub(/@[[:alnum:]_][[:alnum:]_]*/,""); sub(/^[ \t]+/,"",$3);print $1"\t"$2"\t"$3}' <FILE> > <FILE_CLEANED>

Remove last line:
* head -n -1 <FILE>

If there is something not working with text files that you don't understand, it is very often an encoding issue!

Show dates connected to a file:
* stat -c '%y' <FILE>

Put every line of a file in quotes:
* awk '{print"\"" $0"\""}' <FILE>

Check if a line contains a number in the beginng and a number in the end:
* grep -Ev '^[0-9]+$' <FILE>

Remove quotation marks from beginning and end of lines:
* sed -e 's/^"//' -e 's/"$//' <FILE>

Remove first line:
* tail -n +2 <FILE>

Select row by name in awk:
* awk -F',' -vcol=\"posemo\" '(NR==1){colnum=-1;for(i=1;i<=NF;i++)if($(i)==col)colnum=i;}{print $(colnum)}' <CSVFILE>

Select rows by name in awk:
* awk -F',' -vcols=\"posemo\",\"negemo\" '(NR==1){n=split(cols,cs,",");for(c=1;c<=n;c++){for(i=1;i<=NF;i++)if($(i)==cs[c])ci[c]=i}}{for(i=1;i<=n;i++)printf "%s" FS,$(ci[i]);printf "\n"}' <CSVFILE>

Select rows by name in awk:
* awk -F',' -v col="<NAME>" 'NR==1{for(i=1;i<=NF;i++){if($i~col){c=i;break}} print $c} NR>1{print $c}' <FILE> > <OUTPUTFILE>

Grep everything that is not a number:
* grep -E '[^0-9]'

Extract csvcolumn by name:
* csvtool namedcol <COLNAME> <FILE>

Get start time of processes:
* ps -eo pid,lstart,cmd

Put out clean csv with jq (example):
* python3 <FILE> | jq -r '"\(.id),\"\(.text)\""|gsub("\n";"\\n")' > <OUTFILE>

Grep numbers at the beginning of a line (eg messageids):
* grep -o "^[0-9]\+"

Grep certain files in directory and concatenate them, then extract field with jq:
* ls | grep <PATTERN> | xargs cat | jq <FIELD>

Pause certain number (10) of jobs:
* for i in {1..10}; do kill -STOP %$i; done

Restart (to background) certain number (10) of jobs:
* for i in {1..10}; do bg %$i; done

Print header name with corresponding row number:
* head -1 <FILE> | awk -F"," '{for(i=1;i<=NF;i++)print i,$i}' | grep <COLNAME>

Print header in commas withs seperator (have to remove first comma):
* head -1 <FILE> | awk -F"," '{for(i=1;i<=NF;i++)printf ",""\""$i"\""}'

Rotate PDF by 180 degrees:
* pdftk <INFILE> cat 1-endsouth output <OUTFILE>

Count blank lines:
* grep -cvP '\S' <FILE>

Get a dictionary style json to csv:
* jq -r 'to_entries[] | "\(.key),\(.value)"' <FILE>

Count number of total characters in file:
* awk -F"\t" '{i+=((length - NF)+1)} END {print i}' <FILE>

Tokenize (on "., ", you loose those), remove quotes in the beginning and end of a line and double quotes in the line and put out tabular seperated:
* awk -vORS='\t' -F'\\, |\\. | ' '{gsub(/^"|"$|""/,"");split($0,m); for(i in m) print $i;printf "\n"}' <FILE> > <OUTFILE>

Get permission to write on external drive:
* sudo chown user:user <PATH>
* sudo mount /dev/disk/by-label/<NAME> <PATH>

Get tweet text out of line delimited json, removing tabulators and new lines in the tweet:
* jq -r '.text| gsub("[\\n\\t]";"")' <FILE> > <TEXT>

Get regex pattern for code of LIWC category and remove last ORS with sed
* awk -F"\t" -vORS="|" '{gsub("*",".*");split($0,m); for(i in m) if(m[i]==<LIWC CATEGORY CODE>) print $1}END {ORS=""}' <FILE> | sed '$ s/.$//'

Backup to external drive
* wget -O /var/tmp/ignorelist
* rsync -aP --exclude-from=/var/tmp/ignorelist <PATH> <BACKUPPATH>

From word list, print out word and word length, sort by length
* awk -F"\t" '{gsub("*","");a[$1]=length($1)} END{for(l in a) print l,a[l]}' <FILE> | sort -t\t -nk3

Remove non-printable characters (but keep tabulator)
* tr -dc '[:print:]\t' <FILE>

Loop over content of file and get autocomplete suggestions (copy curl from google page request):
* while read p; do curl ''"$p"'&callback=callback' -o words/"$p"; done < <LIST>

Get something between something (example XML tags)
* grep -oP "(?<=<TAG>).*(?=</TAG>)" <FILE>

Get Boolean that tells us if a line of text contains a hyperlink
* awk '{if($0~/http[a-zA-Z0-9\:/.]+/){print 1;next} print 0}' <FILE> > <FILE>_url

Get statistics (size and number of files for google drive share)
* rclone --drive-shared-with-me size <NAME>:<FOLDER>

Get uncompressed size of zipped files in a directory
* gunzip -l *.gz

Check what rclone *would* copy
* rclone --drive-shared-with-me --dry-run copy <NAME>:<FOLDER> .

Copy directory to current folder with rclone and show progress bar
* rclone  --drive-shared-with-me copy <NAME>:<FOLDER> . -P --bwlimit 8M

Get filenames in directory ending with ".gz"
* for i in *.gz; do echo "$(basename $i)"; done

Make directory with loop
* for i in *.gz; do mkdir "$(echo "$(basename $i)" | tr -d .gz)"; done

Extract each archive to its own folder
* for i in *.gz; do tar xvfz "$(basename $i)" -C "$(basename $i | tr -d .gz)"; done

Count lines of file userid in all subdirectories
* for e in */userid; do wc -l $e; done

Move nested files one level up
* for i in ./**/;do mv $i**/* $i; done;

Remove now empty subfolder
* rmdir **/**/

Dates to timestamps
* date -f <DATEFILE> +%s |& sed '/date/s/.*/NA/' > <TIMESTAMPFILE>

Count all tweets
* for i in ./**/;do wc -l "$i"userid; done > wc

Clean timestamps (all errors to NA)
* awk '{if($0~/^[0-9]+$/) print $0; else print "NA";}' <TIMESTAMPFILE> > <TIMESTAMPFILE_CLEANED> &

If file exists, print folder name
* for i in **/; do test -f $i"created_at_timestamp_cleaned" && echo $i; done;

Run script for the directories where a certain file doesn't exist
* for i in **/; do [ -f $i"R/<NAME>.feather" ] || echo "$(basename $i)"; done;

Get file sizes in MB for all files in directory
* ls -l --block-size=M

Get NUM line from file
* sed 'NUMq;d' file

Remove single quotation from files containing the word LAPTOP (-n to check first)
* rename -n *LAPTOP* sed -e "s/'//"

Workflow to remove last tabulator from awk ouput:
* sed s'/\t$//' <FILE> > <FILE.TMP>
* rm <FILE>
* mv <FILE.TMP> <FILE>

Run Rscript with echo (to know at which execution step the script currently works):
* Rscript -e 'source("<SCRIPT.R>",echo=T)'

Nohup and read log:
* nohup Rscript -e 'source("<SCRIPT.R>",echo=T)' & tail -f nohup.out

Run source in R with parameters:
* for i in `seq 1 7`; do Rscript -e 'x='$i -e 'source("<SCRIPT.R>", echo=TRUE)'; done;

Kill nohup loop
* ps -ef|grep <SCRIPTNAME>
* kill -9 <PID>

Create archive from files

Extract archive

Extract archive and keep folder structure
* 7za x test.7z

Multithreaded zipping
* 7za a -mm=BZip2 <FILE.BZ2> **/<FILESTOINCLUDE>*

Get list of files with size and modification date
* ls -lh out/*

Check progress via ssh
* ssh <SSH_ALIAS> 'cat <FILE>'

Edit cronjobs
* crontab -e

Count words in pdftk
* pdftotext <FILE.PDF> - | tr -d '.' | wc -w

Rotate pdf 180 degrees
* pdftk <FILE> cat 1-endsouth output <FILEOUTPUT>

Display tsv nicely in terminal
* less -S <FILE>

Add newline to the end of file
* sed -i -e '$a\'

Resize multiple jpgs
* find . -maxdepth 1 -iname "*.jpg" | xargs -L1 -I{} convert -resize 30% "{}" _resized/"{}"

Show last modified files in a folder
* find . -printf '%T+ %p\n' | sort -r | head

Only transfer files
* scp `find . -maxdepth 1 -type f` <DESTINATION>

Solve argument list too long
* echo [0-9]* | xargs mv -t <DATA>

Get the last modified files in a directory
* find . -type f -printf '%T@ %p\n' | sort -n | tail -5 | cut -f2- -d" "

Remove last line efficiently from large file
* dd if=/dev/null of=<FILENAME> bs=1 seek=$(echo $(stat --format=%s <FILENAME> ) - $( tail -n1 <FILENAME> | wc -c) | bc )

Untar/unzip to stdout to post-process
* tar -xOzf <FILE>.tar.gz | jq .id

Download dropbox with curl

Unzip and keep original file
* gunzip -k <FILE>

Check cron log
* cat /var/mail/<USER>

Convert from day of the year to date
* date -d "2019-12-31 +90 days" +%F

Log10 ggplot
* + scale_x_continuous(trans = log10_trans(),breaks = trans_breaks("log10", function(x) 10^x),labels = trans_format("log10", math_format(10^.x))) + scale_y_continuous(trans = log10_trans(),breaks = trans_breaks("log10", function(x) 10^x),labels = trans_format("log10", math_format(10^.x)))

Useful ggplot2 theme
* theme_thesis <- function(){
  theme_bw() %+replace%
      #line = element_line(colour="black"),
      #text = element_text(colour="black"),
      axis.title = element_text(size = 14),
      axis.text = element_text(colour="black", size=10),
      #strip.text = element_text(size=12),
      # legend.key=element_rect(colour=NA, fill =NA),
      panel.grid.major = element_line(colour = "grey90"),
      panel.grid.minor = element_line(colour="grey90"),
      # panel.border = element_rect(fill = NA, colour = "black", size=1),
      panel.background = element_rect(fill = "white"), 
      # legend.position="none"

Extract pages from pdf
* pdftk <FULLPDFFILE> cat 12-15 output <OUTPDFFILE>

Rotate pdftk
* pdftk <INPDF> cat 1-endsouth output <OUTPDF>

Delete arg list too long file
* find . -maxdepth 1 -name "<PATTERN>" -print0 | xargs -0 rm

* yq r <YAMLFILE> -j | jq . > <JSONFILE>

Print specific lines in other file
* grep -n <PATTERN> <FILE> | awk -F":" '{print $1}' | while read a; do awk -va=$a 'NR==a{print $0}' <OTHERFILE>; done

Print those lines much faster
* grep -n 1 <PATTERN> <FILE> | awk -F":" '{print $1}' | while read a; do sed "${a}q;d" <OTHERFILE>; done

Jq replace tabulators and newlines
* jq -r '.body| gsub("[\\n\\t]";"")'

Batch rename (ctrl + v for block mode, highlight, then either delete or shift + i to insert text and ESC to write it to all columns)
* EDITOR="vi" qmv -f do

Git alias
* git config --global alias.add-commit '!git add . && git commit --allow-empty-message'

Unzip bz2 files
* bzip2 -dk <FILE>

Untar file
* tar -xvf <FILE>

Combine to pdf and lower file zie
* convert -compress jpeg -quality 70 <PATTERNSFORFILES> <OUTFILE>

Date was how long ago?
* echo $(( ( $(date +%s) - $(date -d "2020-03-11" +%s) ) /(24 * 60 * 60 ) ))

Rename wildcard search with suffx
* for filename in <PATTERN>; do mv "$filename" "$filename".old; done;

Compute md5 checksum for files in a directories
* find . -type f | xargs md5sum > checksum.md5

Check md5 checksums for files in a directory
* md5sum -c checksum.md5

Scan wifis around
* sudo iwlist <DEVICE> scanning | egrep 'Cell |Frequency|Quality|ESSID'

Connect to openconnect VPN
* sudo openconnect -b <ADDRESS>

Deduplicate without sorting in awk
* awk '!seen[$0]++' <FILE>

Versions used: