In the course of my PhD studies, many little tasks came up that can be efficiently solved with one-liners on the command line. In the following, I want to give a (by no means exhaustive) list of those little helpers in the hope that they may be as useful to others as they have been for me.
Shebang
* #!/bin/bash
Make executable
* chmod u+x
Counting
* lines: wc -l <FILE>
* words: wc -w <FILE>
Comparing two files (outputs lines in the second that aren't contained in the first)
* grep -Fxv -f <FIRSTFILE> <SECONDFILE>
Appending to file
* >> <FILE>
Overwriting file
* > <FILE>
Get attribute from list of jsons
* jq ".ATTR" <FILE>
Remove ""
* tr -d '"'
Twitter Error Codes
* 50 User not found. Corresponds with HTTP 404. The user is not found.
* 63 User has been suspended. Corresponds with HTTP 403 The user account has been suspended and information cannot be retrieved.
Get Processes
* ps -eaf | grep <USERNAME>
Subset tsv
* awk -F $'\t' '$2 > 0.5 && $2 !~ /^Error:/' <TSVFILE>
Get file size
* ls -sh
Copy from linenumber to end
* sed -n '<linenumber>,$p' <FILE>
Output only unique values
* awk '!_[$0]++' <FILE>
Json values to tsv
* jq --raw-output '"\(.<first>)\t\(.<second>)"'
Grep ignore upper/lowercase:
* grep -i
Remove users with errors
* awk '!/^Error/' <FILE> > <FILECHECKED>
Write to table
* jq --raw-output '"\(.user.id_str)\t\(.scores.universal)"' <FILE> > <TABLEFILE>
Filter error messages and display user ids that have to be tried again
* awk '/^Error/' <FILE> | awk '!_[$0]++' | grep -v '^Error: Not\|Error: user' | awk '{print $NF}'
Get current path
* pwd
Create textfile
* cat > <FILE>
Search directory of text files for pattern
* grep --exclude=*.* -rnw . -e "Error"
Count number of files in directory
* find . -type f | wc -l
Sort directory chronologically
* ls -t (most recent on top)
* ls -tr (most recent at bottom)
Split file in x number of lines per file
* split -l <x> -a 1 <FILE> <prefix>
Search for pattern in file names in directory
* ls -dq *pattern*
Remove folder
* rm -rf <FOLDER>
Show line number of occurences
* grep -n <pattern> <FILE>
Split in x groups
* split -n l/x <FILE> -a 1 <prefix>
Set delimiter in awk
* awk -F'<delimiter>'
Find out about encoding and line endings of text file
* file <FILE>
Grep all mentions
* grep -o "\w*@\w*"
Remove mentions (@+<username>) from tsv of tweets
* gawk -F$'\t' '{gsub(/@[[:alnum:]_][[:alnum:]_]*/,""); sub(/^[ ]+/, "", $<COLUMN>); print}' <FILE>
Do the same as above, but build up tsv correctly again
* gawk -F$'\t' '{gsub(/@[[:alnum:]_][[:alnum:]_]*/,""); sub(/^[ \t]+/,"",$3);print $1"\t"$2"\t"$3}' <FILE> > <FILE_CLEANED>
Remove last line:
* head -n -1 <FILE>
If there is something not working with text files that you don't understand, it is very often an encoding issue!
Show dates connected to a file:
* stat -c '%y' <FILE>
Put every line of a file in quotes:
* awk '{print"\"" $0"\""}' <FILE>
Check if a line contains a number in the beginng and a number in the end:
* grep -Ev '^[0-9]+$' <FILE>
Remove quotation marks from beginning and end of lines:
* sed -e 's/^"//' -e 's/"$//' <FILE>
Remove first line:
* tail -n +2 <FILE>
Select row by name in awk:
* awk -F',' -vcol=\"posemo\" '(NR==1){colnum=-1;for(i=1;i<=NF;i++)if($(i)==col)colnum=i;}{print $(colnum)}' <CSVFILE>
Select rows by name in awk:
* awk -F',' -vcols=\"posemo\",\"negemo\" '(NR==1){n=split(cols,cs,",");for(c=1;c<=n;c++){for(i=1;i<=NF;i++)if($(i)==cs[c])ci[c]=i}}{for(i=1;i<=n;i++)printf "%s" FS,$(ci[i]);printf "\n"}' <CSVFILE>
Select rows by name in awk:
* awk -F',' -v col="<NAME>" 'NR==1{for(i=1;i<=NF;i++){if($i~col){c=i;break}} print $c} NR>1{print $c}' <FILE> > <OUTPUTFILE>
Grep everything that is not a number:
* grep -E '[^0-9]'
Extract csvcolumn by name:
* csvtool namedcol <COLNAME> <FILE>
Get start time of processes:
* ps -eo pid,lstart,cmd
Put out clean csv with jq (example):
* python3 ConvertJSON.py <FILE> | jq -r '"\(.id),\"\(.text)\""|gsub("\n";"\\n")' > <OUTFILE>
Grep numbers at the beginning of a line (eg messageids):
* grep -o "^[0-9]\+"
Grep certain files in directory and concatenate them, then extract field with jq:
* ls | grep <PATTERN> | xargs cat | jq <FIELD>
Pause certain number (10) of jobs:
* for i in {1..10}; do kill -STOP %$i; done
Restart (to background) certain number (10) of jobs:
* for i in {1..10}; do bg %$i; done
Print header name with corresponding row number:
* head -1 <FILE> | awk -F"," '{for(i=1;i<=NF;i++)print i,$i}' | grep <COLNAME>
Print header in commas withs seperator (have to remove first comma):
* head -1 <FILE> | awk -F"," '{for(i=1;i<=NF;i++)printf ",""\""$i"\""}'
Rotate PDF by 180 degrees:
* pdftk <INFILE> cat 1-endsouth output <OUTFILE>
Count blank lines:
* grep -cvP '\S' <FILE>
Get a dictionary style json to csv:
* jq -r 'to_entries[] | "\(.key),\(.value)"' <FILE>
Count number of total characters in file:
* awk -F"\t" '{i+=((length - NF)+1)} END {print i}' <FILE>
Tokenize (on "., ", you loose those), remove quotes in the beginning and end of a line and double quotes in the line and put out tabular seperated:
* awk -vORS='\t' -F'\\, |\\. | ' '{gsub(/^"|"$|""/,"");split($0,m); for(i in m) print $i;printf "\n"}' <FILE> > <OUTFILE>
Get permission to write on external drive:
* sudo chown user:user <PATH>
* sudo mount /dev/disk/by-label/<NAME> <PATH>
Get tweet text out of line delimited json, removing tabulators and new lines in the tweet:
* jq -r '.text| gsub("[\\n\\t]";"")' <FILE> > <TEXT>
Get regex pattern for code of LIWC category and remove last ORS with sed
* awk -F"\t" -vORS="|" '{gsub("*",".*");split($0,m); for(i in m) if(m[i]==<LIWC CATEGORY CODE>) print $1}END {ORS=""}' <FILE> | sed '$ s/.$//'
Backup to external drive
* wget https://raw.githubusercontent.com/rubo77/rsync-homedir-excludes/master/rsync-homedir-excludes.txt -O /var/tmp/ignorelist
* rsync -aP --exclude-from=/var/tmp/ignorelist <PATH> <BACKUPPATH>
From word list, print out word and word length, sort by length
* awk -F"\t" '{gsub("*","");a[$1]=length($1)} END{for(l in a) print l,a[l]}' <FILE> | sort -t\t -nk3
Remove non-printable characters (but keep tabulator)
* tr -dc '[:print:]\t' <FILE>
Loop over content of file and get autocomplete suggestions (copy curl from google page request):
* while read p; do curl 'http://suggestqueries.google.com/complete/search?client=chrome&q='"$p"'&callback=callback' -o words/"$p"; done < <LIST>
Get something between something (example XML tags)
* grep -oP "(?<=<TAG>).*(?=</TAG>)" <FILE>
Get Boolean that tells us if a line of text contains a hyperlink
* awk '{if($0~/http[a-zA-Z0-9\:/.]+/){print 1;next} print 0}' <FILE> > <FILE>_url
Get statistics (size and number of files for google drive share)
* rclone --drive-shared-with-me size <NAME>:<FOLDER>
Get uncompressed size of zipped files in a directory
* gunzip -l *.gz
Check what rclone *would* copy
* rclone --drive-shared-with-me --dry-run copy <NAME>:<FOLDER> .
Copy directory to current folder with rclone and show progress bar
* rclone --drive-shared-with-me copy <NAME>:<FOLDER> . -P --bwlimit 8M
Get filenames in directory ending with ".gz"
* for i in *.gz; do echo "$(basename $i)"; done
Make directory with loop
* for i in *.gz; do mkdir "$(echo "$(basename $i)" | tr -d .gz)"; done
Extract each archive to its own folder
* for i in *.gz; do tar xvfz "$(basename $i)" -C "$(basename $i | tr -d .gz)"; done
Count lines of file userid in all subdirectories
* for e in */userid; do wc -l $e; done
Move nested files one level up
* for i in ./**/;do mv $i**/* $i; done;
Remove now empty subfolder
* rmdir **/**/
Dates to timestamps
* date -f <DATEFILE> +%s |& sed '/date/s/.*/NA/' > <TIMESTAMPFILE>
Count all tweets
* for i in ./**/;do wc -l "$i"userid; done > wc
Clean timestamps (all errors to NA)
* awk '{if($0~/^[0-9]+$/) print $0; else print "NA";}' <TIMESTAMPFILE> > <TIMESTAMPFILE_CLEANED> &
If file exists, print folder name
* for i in **/; do test -f $i"created_at_timestamp_cleaned" && echo $i; done;
Run script for the directories where a certain file doesn't exist
* for i in **/; do [ -f $i"R/<NAME>.feather" ] || echo "$(basename $i)"; done;
Get file sizes in MB for all files in directory
* ls -l --block-size=M
Get NUM line from file
* sed 'NUMq;d' file
Remove single quotation from files containing the word LAPTOP (-n to check first)
* rename -n *LAPTOP* sed -e "s/'//"
Workflow to remove last tabulator from awk ouput:
* sed s'/\t$//' <FILE> > <FILE.TMP>
* rm <FILE>
* mv <FILE.TMP> <FILE>
Run Rscript with echo (to know at which execution step the script currently works):
* Rscript -e 'source("<SCRIPT.R>",echo=T)'
Nohup and read log:
* nohup Rscript -e 'source("<SCRIPT.R>",echo=T)' & tail -f nohup.out
Run source in R with parameters:
* for i in `seq 1 7`; do Rscript -e 'x='$i -e 'source("<SCRIPT.R>", echo=TRUE)'; done;
Kill nohup loop
* ps -ef|grep <SCRIPTNAME>
* kill -9 <PID>
Create archive from files
* 7z a <ARCHIVE NAME> <FILES>
Extract archive
* 7z e <ARCHIVE NAME>
Extract archive and keep folder structure
* 7za x test.7z
Multithreaded zipping
* 7za a -mm=BZip2 <FILE.BZ2> **/<FILESTOINCLUDE>*
Get list of files with size and modification date
* ls -lh out/*
Check progress via ssh
* ssh <SSH_ALIAS> 'cat <FILE>'
Edit cronjobs
* crontab -e
Count words in pdftk
* pdftotext <FILE.PDF> - | tr -d '.' | wc -w
Rotate pdf 180 degrees
* pdftk <FILE> cat 1-endsouth output <FILEOUTPUT>
Display tsv nicely in terminal
* less -S <FILE>
Add newline to the end of file
* sed -i -e '$a\'
Resize multiple jpgs
* find . -maxdepth 1 -iname "*.jpg" | xargs -L1 -I{} convert -resize 30% "{}" _resized/"{}"
Show last modified files in a folder
* find . -printf '%T+ %p\n' | sort -r | head
Only transfer files
* scp `find . -maxdepth 1 -type f` <DESTINATION>
Solve argument list too long
* echo [0-9]* | xargs mv -t <DATA>
Get the last modified files in a directory
* find . -type f -printf '%T@ %p\n' | sort -n | tail -5 | cut -f2- -d" "
Remove last line efficiently from large file
* dd if=/dev/null of=<FILENAME> bs=1 seek=$(echo $(stat --format=%s <FILENAME> ) - $( tail -n1 <FILENAME> | wc -c) | bc )
Untar/unzip to stdout to post-process
* tar -xOzf <FILE>.tar.gz | jq .id
Download dropbox with curl
* curl -L -o <OUTFILE> <DROPBOXLINK>
Unzip and keep original file
* gunzip -k <FILE>
Check cron log
* cat /var/mail/<USER>
Convert from day of the year to date
* date -d "2019-12-31 +90 days" +%F
Log10 ggplot
* + scale_x_continuous(trans = log10_trans(),breaks = trans_breaks("log10", function(x) 10^x),labels = trans_format("log10", math_format(10^.x))) + scale_y_continuous(trans = log10_trans(),breaks = trans_breaks("log10", function(x) 10^x),labels = trans_format("log10", math_format(10^.x)))
Useful ggplot2 theme
* theme_thesis <- function(){
theme_bw() %+replace%
theme(
#line = element_line(colour="black"),
#text = element_text(colour="black"),
axis.title = element_text(size = 14),
axis.text = element_text(colour="black", size=10),
#strip.text = element_text(size=12),
# legend.key=element_rect(colour=NA, fill =NA),
panel.grid.major = element_line(colour = "grey90"),
panel.grid.minor = element_line(colour="grey90"),
# panel.border = element_rect(fill = NA, colour = "black", size=1),
panel.background = element_rect(fill = "white"),
strip.background=element_rect(fill="white")#,
#legend.title=element_blank()
# legend.position="none"
)
}
Extract pages from pdf
* pdftk <FULLPDFFILE> cat 12-15 output <OUTPDFFILE>
Rotate pdftk
* pdftk <INPDF> cat 1-endsouth output <OUTPDF>
Delete arg list too long file
* find . -maxdepth 1 -name "<PATTERN>" -print0 | xargs -0 rm
From YAML to JSON
* yq r <YAMLFILE> -j | jq . > <JSONFILE>
Print specific lines in other file
* grep -n <PATTERN> <FILE> | awk -F":" '{print $1}' | while read a; do awk -va=$a 'NR==a{print $0}' <OTHERFILE>; done
Print those lines much faster
* grep -n 1 <PATTERN> <FILE> | awk -F":" '{print $1}' | while read a; do sed "${a}q;d" <OTHERFILE>; done
Jq replace tabulators and newlines
* jq -r '.body| gsub("[\\n\\t]";"")'
Batch rename (ctrl + v for block mode, highlight, then either delete or shift + i to insert text and ESC to write it to all columns)
* EDITOR="vi" qmv -f do
Git alias
* git config --global alias.add-commit '!git add . && git commit --allow-empty-message'
Unzip bz2 files
* bzip2 -dk <FILE>
Untar file
* tar -xvf <FILE>
Combine to pdf and lower file zie
* convert -compress jpeg -quality 70 <PATTERNSFORFILES> <OUTFILE>
Date was how long ago?
* echo $(( ( $(date +%s) - $(date -d "2020-03-11" +%s) ) /(24 * 60 * 60 ) ))
Rename wildcard search with suffx
* for filename in <PATTERN>; do mv "$filename" "$filename".old; done;
Compute md5 checksum for files in a directories
* find . -type f | xargs md5sum > checksum.md5
Check md5 checksums for files in a directory
* md5sum -c checksum.md5
Scan wifis around
* sudo iwlist <DEVICE> scanning | egrep 'Cell |Frequency|Quality|ESSID'
Connect to openconnect VPN
* sudo openconnect -b <ADDRESS>
Deduplicate without sorting in awk
* awk '!seen[$0]++' <FILE>
Versions used:
gawk==5.0.1
jq==1.6
xidel==0.9.8