CMSC398W: Practical Tools for Efficient Development

Course Description

This course will provide a broad overview of many common and useful tools, like the command line, Git, debuggers, build systems, and more. Through a hands-on approach, you will be introduced to a variety of tools and techniques that can immediately be applied to everyday problems. We aim to provide students with material that improves their computing ecosystem literacy and increases their efficiency as a developer.

Course Details

Course: Practical Tools For Efficient Development
Prerequisites: Minimum grade of C- in CMSC216 and CM250
Credits: 1
Seats: 30
Lecture Time: Friday 11:00 AM - 11:50 AM
Location: IRB 2207
Semester: Spring 2026
Course Facilitator(s): Mohammad Durrani
Faculty Advisor: Prof. Christopher Kauffman

Course Schedule

This is subject to change.

Date	Concept	Assignment
01/30/2026	The Shell	System Monitoring Project Released
02/06/2026	Shell Tools and Scripting
02/13/2026	Data Wrangling / Command-line Environment
02/20/2026	Shell "Application Day"
02/27/2026	Debugging and Profiling	System Monitoring Project Due, Release Pacman Part 1
03/06/2026	Version Control (Git)
03/13/2026	Version Control (Git)
03/20/2026	Spring Break
03/27/2026	Build Systems / CI	PacMan Project Part 1 Due, release part 2
04/03/2026	Git "Application Day"
04/10/2026	Docker
04/17/2026	Networking	Part 2 due, Networking Project Released
04/24/2026	Docker/Networking Application Day
05/01/2026	ML/AI Tools
05/08/2026	Flex	Networking Project Due

Grading

Grades will be maintained on ELMS. You will be responsible for all material discussed in lecture as well as other standard means of communication (Piazza, email announcements, etc.), including but not limited to deadlines, policies, assignment changes, etc.

Any request for reconsideration of any grading on coursework must be submitted within one week of when it is returned. No requests will be considered afterwards.

Participation grades will be determined by the tracking of participation in class.

Your final course grade will be determined according to the following percentages:

Percentage	Title	Description
80%	Projects	4 major projects
15%	Application Days	Completion of Application Days assignments.
5%	Participation	Participation in class.

Late Policy

There will be a standard 10% late policy per 24-hour period late for any projects submitted past the deadline. This means a project due at 11:59 PM the day before that gets submitted at 12:02 AM the next day will get a standard 10% penalty automatically applied. No late submissions will be accepted for the last project (to ensure that there is enough time to submit grades).

Communicating with course staff

Communication should be done over Piazza, with preferably public posts unless a private post is necessary (grading disputes, student-specifc questions, etc). Communication should primarily be done with the course facilitators (Mohammad and Karan).

Lecturers / Instructors:
- Mohammad Durrani: durranim@terpmail.umd.edu
Advisor:
- Prof. Kauffman: profk@umd.edu

Excused Absence and Academic Accommodations

See the section titled "Attendance, Absences, or Missed Assignments" available at Course Related Policies.

Disability Support Accommodations

See the section titled "Accessibility" available at Course Related Policies.

Academic Integrity

Note that academic dishonesty includes not only cheating, fabrication, and plagiarism, but also includes helping other students commit acts of academic dishonesty by allowing them to obtain copies of your work. In short, all submitted work must be your own. Cases of academic dishonesty will be pursued to the fullest extent possible as stipulated by the Office of Student Conduct. It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council, please visit http://www.shc.umd.edu.

AI / LLM Policy

The ultimate goal of this course is that you learn a wide breadth of tools that help you become a more efficient developer. Large Language Models (LLMs) are likely one of the tools you will have access to in your development, so reasonably, you should be able to use it in this class. However, like everything else in this course, this is a tool for you to use and not something that should be completely replacing your learning. Thus, our policy is that you are allowed to use LLMs to clarify your understanding, ask questions, etc. (think of it as a tutor) but any and all submitted work must be your own. Any violation of this policy will be escalated as per the standard University procedures.

Course Evaluations

If you have a suggestion for improving this class, don't hesitate to tell the instructors at any point during the semester. At the end of the semester, please don't forget to provide your feedback using the campus-wide CourseEvalUM system. Your comments will help make this class better.

Citation

This course pulls material heavily from "The Missing Semester of Your CS Education" from MIT and Anish Athalye, Jon Gjengset, Jose Javier Ortiz. The materials follow the license as specified by CC BY-NC-SA, as detailed here.

Setup Guide

This course is centered around the Bash shell and a Unix platform. You are free to set this up however you would like as long as you can complete all projects for the class. Below you can find setup guides for both Windows and macOS.

We will update this over the course of the semester as needed.

Windows

If you don't have WSL installed, follow these instructions on how to install WSL2 with Ubuntu 24.04 (earlier versions are okay) in Windows 10 / 11. Then, ensure you have bash installed by running sudo apt-get install bash.

macOS

zsh is the default shell on Mac, and is a Unix shell that is built on top of bash. For the purposes of this course, it should be fine to use.

Using The Shell

What is the shell?

Computers these days have a variety of interfaces for giving them commands; fanciful graphical user interfaces, voice interfaces, and even AR/VR are everywhere. These are great for 80% of use-cases, but they are often fundamentally restricted in what they allow you to do — you cannot press a button that isn’t there or give a voice command that hasn’t been programmed. To take full advantage of the tools your computer provides, we have to go old-school and drop down to a textual interface: The Shell.

Shell: A shell is simply a macro processor that executes commands. The term macro processor means functionality where text and symbols are expanded to create larger expressions. This is the program that is prompting you for commands and processing them.

Nearly all platforms you can get your hands on have a shell in one form or another, and many of them have several shells for you to choose from. While they may vary in the details, at their core they are all roughly the same: they allow you to run programs, give them input, and inspect their output in a semi-structured way.

In this lecture, we will focus on the Bourne Again SHell, or “bash” for short. This is one of the most widely used shells, and its syntax is similar to what you will see in many other shells. To open a shell prompt (where you can type commands), you first need a terminal. Your device probably shipped with one installed, or you can install one fairly easily.

Terminology Distinction

A Traditional Terminal is a historical computing device shaped like a desktop computer (keyboard and screen) that connected through wiring to an actual computer that was shared among terminal users. The Terminal used a communication protocol to convey information to and from the actual computer and display information on the screen. Traditional Terminals had only keyboard/typing interfaces as they predated the invention of the mouse/trackpad. Terminals had a variety of capabilities that evolved over time with later terminals adding more colors, font styles, additional characters beyond ASCII, and so forth. Wikipedia: Computer Terminal
A Modern Terminal Emulator is a graphical program that acts more or less like a Traditional Terminal and present a primarily text-based interface that focuses on typing. Modern Terminal Emulators are software programs and there are a lot of them : Windows has CMD.exe and Powershell built in with Putty and others to download, MacOS has its Terminal.app built in with iTerm2 as a popular download, Linux has dozens including xTerm, Gnome Terminal, Konsole, Kitty, Ghostty, Foot, Alacrity, etc. Wikipedia: Terminal Emulator
A Shell or Command-line Interface is a program that is typically started when logging into a Terminal or Terminal Emulator. It interprets typed commands that allows the user to run programs, navigate directories, and interact with the underlying computer. Shells provide conveniences such as tab completion, command history, and often its own semi-standardized programming language allowing for "Shell Scripts" to be created. Windows has its own Shells, traditional CMD.exe and Powershell, while MacOS and Linux have dozens of Shells in the Unix tradition, the most popular being BASH (Bourne Again Shell), ZSH (Z-Shell), and TCSH. Wikipedia:Command-Line Interface

Using the shell

A terminal (or these days, more aptly, a terminal emulator) is a wrapper program which runs a shell. When you launch the terminal, you will see a prompt that will look like this:

username@domain directory$

This is the main text based interface to the shell, and the prompt tells us some important information.

It displays the user that is currently logged in as (username), the machine you are logged into (domain), and the current directory you are in. The following symbol, in our case $, indicates a user level shell, while a root shell will be indicated by a hash sign (#).

Inside of this prompt, you can type a command, which will then be interpreted / executed by the shell. The most basic command is to execute a program:

cmsc398w:~$ date
Sat Nov 30 23:24:14 EST 2024
cmsc398w:~$

Here, we executed the date program, which (perhaps unsurprisingly) prints the current date and time. The shell then asks us for another command to execute. We can also execute a command with arguments:

cmsc398w:~$ echo hi
hi

In this case, we told the shell to execute the program echo with the argument hello. The echo program simply prints out its arguments. The shell parses the command by splitting it by whitespace, and then runs the program indicated by the first word, supplying each subsequent word as an argument that the program can access. If you want to provide an argument that contains spaces or other special characters (e.g., a directory named “My Photos”), you can either quote the argument with ' or " ("My Photos"), or escape just the relevant characters with \ (My\ Photos).

But how does the shell know how to find the date or echo programs? Well, the shell is a programming environment, just like Python or Ruby, and so it has variables, conditionals, loops, and functions (next lecture!). When you run commands in your shell, you are really writing a small bit of code that your shell interprets. If the shell is asked to execute a command that doesn’t match one of its programming keywords, it consults an environment variable called $PATH that lists which directories the shell should search for programs when it is given a command:

cmsc398w:~$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
cmsc398w:~$ which echo
/usr/bin/echo
cmsc398w:~$ /bin/echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

When we run the echo command, the shell sees that it should execute the program echo, and then searches through the :-separated list of directories in $PATH for a file by that name. When it finds it, it runs it (assuming the file is executable; more on that later). We can find out which file is executed for a given program name using the which program. We can also bypass $PATH entirely by giving the path to the file we want to execute.

Navigating in the shell

A path (in the context of the shell) is a list of directories; separated by / on Linux and macOS and \ on Windows. On Linux and macOS, the path / is the "root" of the file system, under which all directories and files lie, whereas on Windows there is one root for each disk partition (e.g., C:\). We will generally assume that you are using a Linux filesystem in this class. A path that starts with / is called an absolute path. Any other path is a relative path. Relative paths are relative to the current working directory, which we can see with the pwd command and change with the cd command. In a path, . refers to the current directory, and .. to its parent directory.

Info

cd [directory]: Changes current working directory

pwd: Prints the current working directory

cmsc398w:~/stic$ pwd
/home/mdurrani/stic
cmsc398w:~/stic$ cd /home
cmsc398w:/home$ pwd
/home
cmsc398w:/home$ cd ..
cmsc398w:/$ pwd
/
cmsc398w:/$ cd ./home
cmsc398w:/home$ pwd
/home
cmsc398w:/home$ cd mdurrani/stic
cmsc398w:~/stic$ pwd
/home/mdurrani/stic
cmsc398w:~$ ../../bin/echo hello
hello

Notice that our shell prompt kept us informed about what our current working directory was. You can configure your prompt to show you all sorts of useful information, which we will cover in a later lecture.

In general, when we run a program, it will operate in the current directory unless we tell it otherwise. For example, it will usually search for files there, and create new files there if it needs to.

To see what lives in a given directory, we use the ls command:

Info

ls [OPTION]... [FILE]...: list directory contents

cmsc398w:~/stic$ ls
README.md  SystemMonitoringSolution
cmsc398w:~/stic$ cd /
cmsc398w:/$ ls
bin
boot
dev
etc
home
...

Unless a directory is given as its first argument, ls will print the contents of the current directory. Most commands accept flags and options (flags with values) that start with - to modify their behavior. Usually, running a program with the -h or --help flag will print some help text that tells you what flags and options are available. For example, ls --help tells us:

  -l                         use a long listing format

cmsc398w:/$ ls -l /home
total 4
drwxr-x--- 67 cmsc398w cmsc398w 4096 Dec  2 19:22 cmsc398w

This gives us a bunch more information about each file or directory present. First, the d at the beginning of the line tells us that cmsc398w is a directory. Then follow three groups of three characters (rwx). These indicate what permissions the owner of the file (cmsc398w), the owning group (users), and everyone else respectively have on the relevant item. A - indicates that the given principal does not have the given permission. Above, only the owner is allowed to modify (w) the cmsc398w directory (i.e., add/remove files in it). To enter a directory, a user must have "search" (represented by "execute": x) permissions on that directory (and its parents). To list its contents, a user must have read (r) permissions on that directory. For files, the permissions are as you would expect. Notice that nearly all the files in /bin have the x permission set for the last group, "everyone else", so that anyone can execute those programs.

Info

mv [OPTION]... SOURCE... DIRECTORY: Rename SOURCE to DEST, or move SOURCE(s) to DIRECTORY.

cp [OPTION]... SOURCE... DIRECTORY: Copy SOURCE to DEST, or multiple SOURCE(s) to DIRECTORY.

If you ever want more information about a program's arguments, inputs, outputs, or how it works in general, give the man program a try. It takes as an argument the name of a program, and shows you its manual page. Press q to exit.

cmsc398w:~$ man ls

Searching man pages

Often times, you will want to search for a specific flag or action within the man pages for a command. To do this, you can type /[regex] where [regex] is a valid regular expression and then hit enter. For example, if I wanted to search for the flag -B, I would type /-B. Manual pages are usually displayed via a pager, a program that displays text. The most common pager is less and typing man less will show information about that program and its many other shortcut keys.

Other times, it may be more convenient to use the online versions of these man pages, which can be found here: https://man7.org/linux/man-pages/. Alternatively, you could use curl cheat.sh/[yourcommand], which uses cheat.sh to give you nice examples for the command you want to run.

If you are unsure of what command / function is of interest, you can search all manual pages for a word or phrase via either of man -k <query_phrase> or apropos <query_phrase> which will show one-line summaries of relevant manual pages.

mdurrani@MDXPS139380:~/stic$ apropos bash
bash (1)             - GNU Bourne-Again SHell
bashbug (1)          - report a bug in bash
screenfetch (1)      - The Bash Screenshot Information Tool

mdurrani@MDXPS139380:~/stic$ apropos gcc
avr-gcc (1)          - GNU project C and C++ compiler
gcc (1)              - GNU project C and C++ compiler
gccmakedep (1)       - create dependencies in makefiles using 'gcc -M'

TLDR pages are a nifty complementary solution that focuses on giving example use cases of a command so you can quickly figure out which options to use.

Connecting Programs

In the shell, programs have two primary "streams" associated with them: their input stream and their output stream. When the program tries to read input, it reads from the input stream, and when it prints something, it prints to its output stream. Normally, a program's input and output are both your terminal. That is, your keyboard as input and your screen as output. However, we can also rewire those streams!

The simplest form of redirection is < file and > file. These let you rewire the input and output streams of a program to a file respectively:

mdurrani@MDXPS139380:~/stic$ echo hello > hello.txt
mdurrani@MDXPS139380:~/stic$ cat hello.txt
hello
mdurrani@MDXPS139380:~/stic$ cat < hello.txt
hello
mdurrani@MDXPS139380:~/stic$ cat < hello.txt > hello2.txt
mdurrani@MDXPS139380:~/stic$ cat hello2.txt
hello
mdurrani@MDXPS139380:~/stic$

Demonstrated in the example above, cat is a program that concatenates files. When given file names as arguments, it prints the contents of each of the files in sequence to its output stream. But when cat is not given any arguments, it prints contents from its input stream to its output stream (like in the third example above). You can also use >> to append to a file adding output to the end of an existing file.

Where this kind of input/output redirection really shines is in the use of pipes. The | operator lets you "chain" programs such that the output of one is the input of another:

missing:~$ ls -l / | tail -n1
drwxr-xr-x 1 root  root  4096 Jun 20  2019 var
missing:~$ curl --head --silent google.com | grep --ignore-case content-length | cut --delimiter=' ' -f2
219

This example retrieves data from a website the uses a combination of grep / cut to extract the total number of bytes in the output.

Pipes allow any two programs that deal with text input/output to be combined. Their presence has led to a proliferation of "small, sharp tools" in UNIX: programs that do a specific, limited set of operations well. For text processing, the most important of these are roughly.

grep to search files for specific text
sed to do modest text manipulations
awk to do programmatic text manipulations
find to locate files with specific attributes
cut / paste to do limited field selection / addition
xargs to treat output as arguments to another command

The power of pipes becomes apparent once you get acquainted with these programs but that will take time and be the subject of later lectures.

Shell Scripting and Shell Tools

Shell Scripting

In this lecture, we will present some of the basics of using bash as a scripting language along with a number of shell tools that cover several of the most common tasks that you will be constantly performing in the command line.

So far we have seen how to execute commands in the shell and pipe them together. However, in many scenarios you will want to perform a series of commands and make use of control flow expressions like conditionals or loops.

Shell scripts are the next step in complexity. Most shells have their own scripting language with variables, control flow and its own syntax. What makes shell scripting different from other scripting programming languages is that it is optimized for performing shell-related tasks. Thus, creating command pipelines, saving results into files, and reading from standard input are primitives in shell scripting, which makes it easier to use than general purpose scripting languages. For this section we will focus on bash scripting since it is the most common.

Creating a Basic Script

Start by creating a script file using any text editor, here we will create myscript.sh.

#!/bin/bash

echo "Hello world!"

The first line, called a shebang (#!/bin/bash), tells the system to use bash to execute this script. For better portability, use #!/usr/bin/env bash, which locates bash using the system's PATH variable. The second line executes the echo program with an argument of "Hello world!".

Info

The shebang mentioned earlier is short of "Shell Bang" as the ! mark is historically referred to as a "bang" in computing circles. The first line of scripts will have this syntax though bash may be replaced by other script interpreters depending on the programming language used int the script. Some common examples are

Shebang	Interpreter / Language
`#!/bin/bash`	Bash shell script
`#!/bin/sh`	Traditional vanilla shell script
`#!/usr/bin/python`	Python script
`#!/usr/bin/awk -f`	AWK script

Now, make the script executable and run it.

chmod +x myscript.sh
./myscript.sh

Here, we are adding execute permission for all to the myscript.sh file and then execute it. You should see "Hello world!" in your stdout.

Commands will often produce output using Standard Output (STDOUT, defaults to the screen), errors through Standard Error (STDERR, defaults to the screen), accept input through Standard Input (STDIN, defaults to typed input), and a Return Code (also called Exit Code) to report errors in a more script-friendly manner. The return code is the way scripts/commands communicate the success or failure of their execution. A value of 0 usually means everything went OK; anything different from 0 means an error occurred. Commands can also be separated within the same line using a semicolon ;.

Assigning Variables

To assign variables in bash, use the syntax foo=bar and access the value of the variable with $foo or ${foo}. Note that foo = bar will not work since it is interpreted as calling the foo program with arguments = and bar. In general, in shell scripts, the space character will perform argument splitting. This behavior can be confusing to use at first, so always check for that. All variables in bash will have global scope by default (unless noted otherwise).

You can also assign variables using let [expression] ... which only supports arithmetic / integer expressions. Arithmetic expressions can be evaluated and set to the value of a variable by following the syntax of var=$((5+5)).

It is also important to note that all variables in Bash are untyped, but there is a mechanism to declare types (outside the scope of this lesson).

Strings in bash can be defined with ' and " delimiters, but they are not equivalent. Strings delimited with ' are string literals and will not substitute variable values whereas " delimited strings will.

foo=bar
echo "$foo"
# prints bar
echo '$foo'
# prints $foo

Unlike other scripting languages, bash uses a variety of special variables to refer to arguments, error codes, and other relevant variables. Below is a list of some of them. A more comprehensive list can be found here.

$0 - Name of the script
$1 to $9 - Arguments to the script. $1 is the first argument and so on.
$@ - All the arguments
$# - Number of arguments
$? - Return code of the previous command
$$ - Process identification number (PID) for the current script
!! - Entire last command, including arguments. A common pattern is to execute a command only for it to fail due to missing permissions; you can quickly re-execute the command with sudo by doing sudo !!
$_ - Last argument from the last command. If you are in an interactive shell, you can also quickly get this value by typing Esc followed by . or Alt+.

You also have another type of variable, environment variables which serve as a way to pass information about the current environment to the program being executed. By convention, these variables are written in all caps, though they function like any other variable. You can set an environment variable using the export command, like export KEY=VALUE. To view all current environment variables, use the env command.

Some common environment variables include PATH (which tells your shell where to look for executable programs), HOME (your user's home directory), and USER (your username). Environment variables persist only for the duration of your shell session by default. To make them permanent, you can add the export commands to your shell's configuration file (like ~/.bashrc for Bash or ~/.zshrc for Zsh)can list all environment variables using the env command.

Control Flow

`if` / `else` statements

Basic if statement syntax:

if [[ condition ]]; then
    echo "Condition is true"
elif [[ another_condition ]]; then
    echo "Second condition is true"
else
    echo "No conditions were true"
fi

Common conditional tests:

-e file: File exists
-d file: Directory exists
-f file: Regular file exists
-z string: String is empty
-n string: String is not empty
str1 = str2: Strings are equal
n1 -eq n2: Numbers are equal
n1 -lt n2: Less than
n1 -gt n2: Greater than
if ! [[ expr ]]; then: Executes the expression and then negates the result
if [[ ! expr ]]; then: Negates the individual expression

Exit codes can be used to conditionally execute commands using && (and operator) and || (or operator), both of which are short-circuiting operators. The true program will always have a 0 return code and the false command will always have a 1 return code.

Info

Both the following syntaxes will be honored for conditions in BASH scripts:

if [ condition ]; then ...; fi : Historical
if [[ condition ]]; then ...; fi : Modern The difference is age: the first which uses a single pair of [ ] is the original shell syntax and uses a subshell (starts another program) to evaluate the condition. This gets the job done but is computationally costly for its need to start a new shell. Newer shells including BASH offer the double pair [[ ]] operator which evaluates a condition within the running shell. Favor the Modern version in all code that you write unless you expect it will be run on an ancient computing platform.

For loops

# Iterate over a list
for name in Alice Bob Charlie; do
    echo "Hello, $name"
done

# Iterate over files
for file in *.txt; do
    echo "Processing $file"
done

# C-style for loop
for ((i=0; i<5; i++)); do
    echo "Count: $i"
done

# counting loop via the seq command
for i in $(seq 0 5 30); do 
    echo i is $i; 
done

While loops

# Basic while loop using builtin [[ ]] and -lt comparison
count=0
while [[ $count -lt 5 ]]; do
    echo "Count: $count"
    ((count++))
done

# Use more standard arithmetic comparison via (( expr ))
count=0
while (( $count < 5 )); do
    echo "Count: $count"
    ((count++))
done

# Read file line by line
while read -r line; do
    echo "Line: $line"
done < input.txt

# iterating through command line flags
while [[ $# -gt 0 ]]; do
    if [[ $1 = "--help" ]]; then
        echo "you asked for help"
        break
    shift
done

As the last example indicates, loop syntax can use I/O redirection and pipes via the < > | shell operators.

Functions

Functions make your code more modular and reusable. Note that the function definition must be placed before any calls to the function. Local variables can be declared within the function definition using the local modifier (and can only be used in that function, as they have local scope). Unlike functions you see in other programming languages, Bash functions can't to return a value when called. When a bash function completes, its return value is the status of the last statement executed in the function, 0 for success and non-zero decimal number between 1 - 255 range for failure.

# Function definition
check_file() {
    local filename="$1"  # First argument
    if [[ -f "$filename" ]]; then
        echo "File exists"
        return 0
    else
        echo "File not found"
        return 1
    fi
}

# Function usage
check_file "example.txt" 
echo Function returned $?

Note several features

Functions in shell scripts do not declare a parameter list making their prototypes less informative than in modern programming languages.
The argument to the function is obtained via the $1 automatic variable; functions called with several arguments will have $2 and so on populated and the $# variable indicates how many arguments were passed.
The final echo command shows the return value of the function using the built-in $? mentioned earlier which contains the last return code from a function or child process.

To return a non-integer value from a function, we have a few options. The simplest option is to assign the result of the function to a global variable:

#!/bin/bash

func () {
  toRet="my result"
}

func
echo $toRet

Alternatively, we can send our return value to stdout using echo or similar and use command substitution to get the output.Whenever you place $( CMD ) it will execute CMD, get the output of the command and substitute it in place.

#!/bin/bash

func () {
  local toRet="my result"
  echo "$toRet"
}

func_result="$(func)"
echo $func_result

Since that was a huge information dump, let's see an example that showcases some of these features. It will iterate through the arguments we provide, grep for the string foobar, and append it to the file as a comment if it's not found.

#!/bin/bash

echo "Starting program at $(date)" # Date will be substituted

echo "Running program $0 with $# arguments with pid $$"

for file in "$@"; do
    grep foobar "$file" > /dev/null 2> /dev/null
    # When pattern is not found, grep has exit status 1
    # We redirect STDOUT and STDERR to a null register since we do not care about them
    if [[ $? -ne 0 ]]; then
        echo "File $file does not have any foobar, adding one"
        echo "# foobar" >> "$file"
    fi
done

Shell Globs and Script Arguments

When launching scripts, you will often want to provide arguments that are similar. Bash has ways of making this easier, expanding expressions by carrying out filename expansion. These techniques are often referred to as shell globbing.

Wildcards - Whenever you want to perform some sort of wildcard matching, you can use ? and * to match one or any amount of characters respectively. For instance, given files foo, foo1, foo2, foo10 and bar, the command rm foo? will delete foo1 and foo2 whereas rm foo* will delete all but bar.
Curly braces {} - Whenever you have a common substring in a series of commands, you can use curly braces for bash to expand this automatically. This comes in very handy when moving or converting files.

convert image.{png,jpg}
# Will expand to
convert image.png image.jpg

cp /path/to/project/{foo,bar,baz}.sh /newpath
# Will expand to
cp /path/to/project/foo.sh /path/to/project/bar.sh /path/to/project/baz.sh /newpath

# Globbing techniques can also be combined
mv *{.py,.sh} folder
# Will move all *.py and *.sh files


mkdir foo bar
# This creates files foo/a, foo/b, ... foo/h, bar/a, bar/b, ... bar/h
touch {foo,bar}/{a..h}
touch foo/x bar/y
# Show differences between files in foo and bar
diff <(ls foo) <(ls bar)
# Outputs
# < x
# ---
# > y

Info

Shell Globs and Regular Expressions are related but distinct methods to specify a pattern to be matched. Globs are tailored best to easily do the most common types of file name matching like all *.txt files (all text files). Regular expressions allow finer-grained control over matching at the expense being somewhat longer to specify. Some programming libraries allow you to specify use of whichever is more convenient such as Python which has a glob library for file matching and a regular expression library in re.

Shell Check

Writing bash scripts can be tricky and unintuitive. There are tools like shellcheck that will help you find errors in your sh/bash scripts.

Additional Built-in Syntax

Most shells have additional built-in commands and capabilities that they recognize like cd / if / for / (( expr )) and so on. Bash will reveal a summary of its syntactic features by typing help with help CMD giving more information on the specific command. It is best to do some online reading to look for examples of the builtins as they can be tricky to use effectively.

>> help bash
...
 job_spec [&]                                  history [-c] [-d offset] [n] or history -a>
 (( expression ))                              if COMMANDS; then COMMANDS; [ elif COMMAND>
 . filename [arguments]                        jobs [-lnprs] [jobspec ...] or jobs -x com>
 :                                             kill [-s sigspec | -n signum | -sigspec] p>
 [ arg... ]                                    let arg [arg ...]
 [[ expression ]]                              local [option] name[=value] ...
 alias [-p] [name[=value] ... ]                logout [n]
 bg [job_spec ...]                             mapfile [-d delim] [-n count] [-O origin] >
 bind [-lpsvPSVX] [-m keymap] [-f filename] >  popd [-n] [+N | -N]
 break [n]                                     printf [-v var] format [arguments]
 builtin [shell-builtin [arg ...]]             pushd [-n] [+N | -N | dir]
 caller [expr]                                 pwd [-LP]
 case WORD in [PATTERN [| PATTERN]...) COMMA>  read [-ers] [-a array] [-d delim] [-i text>
 cd [-L|[-P [-e]] [-@]] [dir]                  readarray [-d delim] [-n count] [-O origin>
 command [-pVv] command [arg ...]              readonly [-aAf] [name[=value] ...] or read>
 compgen [-abcdefgjksuv] [-o option] [-A act>  return [n]
 complete [-abcdefgjksuv] [-pr] [-DEI] [-o o>  select NAME [in WORDS ... ;] do COMMANDS; >
 compopt [-o|+o option] [-DEI] [name ...]      set [-abefhkmnptuvxBCEHPT] [-o option-name>
 continue [n]                                  shift [n]
 coproc [NAME] command [redirections]          shopt [-pqsu] [-o] [optname ...]
 declare [-aAfFgiIlnrtux] [name[=value] ...]>  source filename [arguments]
 dirs [-clpv] [+N] [-N]                        suspend [-f]
 disown [-h] [-ar] [jobspec ... | pid ...]     test [expr]
 echo [-neE] [arg ...]                         time [-p] pipeline
 enable [-a] [-dnps] [-f filename] [name ...>  times
 eval [arg ...]                                trap [-lp] [[arg] signal_spec ...]
 exec [-cl] [-a name] [command [argument ...>  true
 exit [n]                                      type [-afptP] name [name ...]
 export [-fn] [name[=value] ...] or export ->  typeset [-aAfFgiIlnrtux] name[=value] ... >
 false                                         ulimit [-SHabcdefiklmnpqrstuvxPRT] [limit]
 fc [-e ename] [-lnr] [first] [last] or fc ->  umask [-p] [-S] [mode]
 fg [job_spec]                                 unalias [-a] name [name ...]
 for NAME [in WORDS ... ] ; do COMMANDS; don>  unset [-f] [-v] [-n] [name ...]
 for (( exp1; exp2; exp3 )); do COMMANDS; do>  until COMMANDS; do COMMANDS-2; done
 function name { COMMANDS ; } or name () { C>  variables - Names and meanings of some she>
 getopts optstring name [arg ...]              wait [-fn] [-p var] [id ...]
 hash [-lr] [-p pathname] [-dt] [name ...]     while COMMANDS; do COMMANDS-2; done
 help [-dms] [pattern ...]                     { COMMANDS ; }

>> help read
read: read [-ers] [-a array] [-d delim] [-i text] [-n nchars] [-N nchars] [-p prompt] [-t timeout] [-u fd] [name ...]
    Read a line from the standard input and split it into fields.
    
    Reads a single line from the standard input, or from file descriptor FD
    if the -u option is supplied.  The line is split into fields as with word
...

Limitations of Shell Scripts

As seen, the Programming Language understood by BASH and other shells has many of the features of other programming languages though the syntax for them is archaic. One can accomplish a lot with shell scripts (example 1, example 2). That should not encourage you attempt such monoliths regularly: the Shell Programming language lacks adequate abstraction mechanisms to scale up to large code bases and is notoriously difficult to maintain.

If a script begins to grow beyond a few dozen lines, it is a good idea to refactor and rewrite, possibly adopting a new language better suited to growing. Python is a good choice as it has specific features aimed to make shell-like scripts easy to write but also many modern features including object-oriented programming and a modules system.

Shell Tools

System Health / State

You can easily view statistics of your system like CPU usage, memory usage, running processes, and more using the top(table of processes) command. You can learn more on how to parse this output here. You can also use free but note that this is not avaliable on macOS.

Finding files

One of the most common repetitive tasks that every programmer faces is finding files or directories. All UNIX-like systems come packaged with find, a great shell tool to find files. find will recursively search for files matching some criteria. Some examples:

# Find all directories named src
find . -name src -type d
# Find all python files that have a folder named test in their path
find . -path '*/test/*.py' -type f
# Find all files modified in the last day
find . -mtime -1
# Find all zip files with size in range 500k to 10M
find . -size +500k -size -10M -name '*.tar.gz'

Beyond listing files, find can also perform actions over files that match your query. This property can be incredibly helpful to simplify what could be fairly monotonous tasks.

# Delete all files with .tmp extension
find . -name '*.tmp' -exec rm {} \;
# Find all PNG files and convert them to JPG
find . -name '*.png' -exec convert {} {}.jpg \;

Despite find's ubiquitousness, its syntax can sometimes be tricky to remember. For instance, to simply find files that match some pattern PATTERN you have to execute find -name '*PATTERN*' (or -iname if you want the pattern matching to be case insensitive). You could start building aliases for those scenarios, but part of the shell philosophy is that it is good to explore alternatives. Remember, one of the best properties of the shell is that you are just calling programs, so you can find (or even write yourself) replacements for some. For instance, fd is a simple, fast, and user-friendly alternative to find. It offers some nice defaults like colorized output, default regex matching, and Unicode support. It also has, in my opinion, a more intuitive syntax. For example, the syntax to find a pattern PATTERN is fd PATTERN.

Most would agree that find and fd are good, but some of you might be wondering about the efficiency of looking for files every time versus compiling some sort of index or database for quickly searching. That is what locate is for. locate uses a database that is updated using updatedb. In most systems, updatedb is updated daily via cron. Therefore one trade-off between the two is speed vs freshness. Moreover find and similar tools can also find files using attributes such as file size, modification time, or file permissions, while locate just uses the file name. A more in-depth comparison can be found here.

Finding code

Finding files by name is useful, but quite often you want to search based on file content. A common scenario is wanting to search for all files that contain some pattern, along with where in those files said pattern occurs. To achieve this, most UNIX-like systems provide grep, a generic tool for matching patterns from the input text. grep is an incredibly valuable shell tool that we will cover in greater detail during the data wrangling lecture.

For now, know that grep has many flags that make it a very versatile tool. Some I frequently use are -C for getting Context around the matching line and -v for inverting the match, i.e. print all lines that do not match the pattern. For example, grep -C 5 will print 5 lines before and after the match. When it comes to quickly searching through many files, you want to use -R since it will Recursively go into directories and look for files for the matching string.

But grep -R can be improved in many ways, such as ignoring .git folders, using multi CPU support, &c. Many grep alternatives have been developed, including ack, ag and rg. All of them are fantastic and pretty much provide the same functionality. For now I am sticking with ripgrep (rg), given how fast and intuitive it is. Some examples:

# Find all python files where I used the requests library
rg -t py 'import requests'
# Find all files (including hidden files) without a shebang line
rg -u --files-without-match "^#\!"
# Find all matches of foo and print the following 5 lines
rg foo -A 5
# Print statistics of matches (# of matched lines and files )
rg --stats PATTERN

Note that as with find/fd, it is important that you know that these problems can be quickly solved using one of these tools, while the specific tools you use are not as important.

Finding shell commands

So far we have seen how to find files and code, but as you start spending more time in the shell, you may want to find specific commands you typed at some point. The first thing to know is that typing the up arrow will give you back your last command, and if you keep pressing it you will slowly go through your shell history. The history command will let you access your shell history programmatically. It will print your shell history to the standard output. If we want to search there we can pipe that output to grep and search for patterns. history | grep find will print commands that contain the substring "find".

In most shells, you can make use of Ctrl+R to perform backwards search through your history. After pressing Ctrl+R, you can type a substring you want to match for commands in your history. As you keep pressing it, you will cycle through the matches in your history. This can also be enabled with the UP/DOWN arrows in zsh. A nice addition on top of Ctrl+R comes with using fzf bindings. fzf is a general-purpose fuzzy finder that can be used with many commands. Here it is used to fuzzily match through your history and present results in a convenient and visually pleasing manner.

Another cool history-related trick I really enjoy is history-based autosuggestions. First introduced by the fish shell, this feature dynamically autocompletes your current shell command with the most recent command that you typed that shares a common prefix with it. It can be enabled in zsh and it is a great quality of life trick for your shell.

You can modify your shell's history behavior, like preventing commands with a leading space from being included. This comes in handy when you are typing commands with passwords or other bits of sensitive information. To do this, add HISTCONTROL=ignorespace to your .bashrc or setopt HIST_IGNORE_SPACE to your .zshrc. If you make the mistake of not adding the leading space, you can always manually remove the entry by editing your .bash_history or .zsh_history.

So far, we have assumed that you are already where you need to be to perform these actions. But how do you go about quickly navigating directories? There are many simple ways that you could do this, such as writing shell aliases or creating symlinks with ln -s, but the truth is that developers have figured out quite clever and sophisticated solutions by now.

As with the theme of this course, you often want to optimize for the common case. Finding frequent and/or recent files and directories can be done through tools like fasd and autojump. Fasd ranks files and directories by frecency, that is, by both frequency and recency.By default, fasd adds a z command that you can use to quickly cd using a substring of a frecent directory. For example, if you often go to /home/user/files/cool_project you can simply use z cool to jump there. Using autojump, this same change of directory could be accomplished using j cool.

More complex tools exist to quickly get an overview of a directory structure: tree, broot or even full fledged file managers like nnn, ranger, and midnight commander mc.

Data Wrangling

Have you ever wanted to take data in one format and turn it into a different format? Of course you have! That, in very general terms, is what this lecture is all about. Specifically, massaging data, whether in text or binary format, until you end up with exactly what you wanted.

We've already seen some basic data wrangling in past lectures. Pretty much any time you use the | operator, you are performing some kind of data wrangling. Consider a command like journalctl | grep -i intel. It finds all system log entries that mention Intel (case insensitive). You may not think of it as wrangling data, but it is going from one format (your entire system log) to a format that is more useful to you (just the intel log entries). Most data wrangling is about knowing what tools you have at your disposal, and how to combine them.

Regular Expression Refresher

Many text tools utilize regular expressions to specify patterns of interest. Hopefully this is not your first time encountering "regexs" as they often come up in beginning CS courses either as a theoretical topic, a practical tool, or both (UMD's CMSC330 covers regex theory and practice).

CAUTION: Regular expression syntax varies between tools. The refresher below emphasizes the most common syntax used by UNIX tools but some tools may not honor all constructs. These are referred to as POSIX Compatible regular expressions in both the basic and extended flavor. A good example is grep supports limited regular expression syntax and an Extended set when run via grep -E. A quick example to show the difference:

# sample file contents
$ cat sample_data.txt
A line ending with foo
Another line with foo but not at the end
A bar line (with parentheses)
A foo line ending in bar
An unbalanced line ending with )
foo and bar both appear on this line

$ grep 'foo$' sample_data.txt
A line ending with foo

# normal alternation/grouping not supported in standard grep
$ grep '(foo|bar)$' sample_data.txt

# extended regex syntax does support grouping/alternation
$ grep -E '(foo|bar)$' sample_data.txt
A line ending with foo
A foo line ending in bar

# standard grep has a slightly different syntax
$ grep '\(foo\|bar\)$' sample_data.txt
A line ending with foo
A foo line ending in bar

Tools like grep, awk, sed, vim, emacs, etc. all support slightly different regular expressions with the most common features described below.

Basic Characters

Most letters and numbers in a regex pattern match themselves exactly. For example, the pattern cat matches the word "cat".

Special Characters

.: Matches any single character except newline
^: Matches the start of a line
$: Matches the end of a line
*: Matches zero or more occurrences of the previous character
+: Matches one or more occurrences of the previous character
?: Makes the previous character optional
\: Escapes special characters (use \. to match an actual period)

Character Classes

Square brackets let us define sets of characters to match:

[abc]: Matches any single character from the set (a, b, or c)
[^abc]: Matches any character except those in the set
[a-z]: Matches any lowercase letter
[A-Z]: Matches any uppercase letter
[0-9]: Matches any digit

Common Shortcuts

Some frequently-used patterns have their own shortcuts. Note that common shortcuts like \d don't exist in your POSIX compatible regular expressions.

[:digit:]: Matches any digit (equivalent to [0-9])
[:alnum:]: Matches alphanumeric characters (equivalent to [A-Za-z0-9])
[:space:]: Matches any whitespace (equivalent to [ \t\r\n\v\f])
Must be enclosed with a set of brackets (ex. [[:digit:]])

Common Quantifiers

When you need to specify how many times something should match:

{n}: Exactly n times
{n,}: At least n times
{n,m}: Between n and m times
*: Zero or more times (equivalent to {0,})
+: One or more times (equivalent to {1,})
?: Zero or one time (equivalent to {0,1})

Examples

^[A-Z][a-z]+$
- Matches capitalized words like "Hello", "John", "America"
[[:digit:]]{3}-[[:digit:]]{3}-[[:digit:]]{4}
- Matches phone numbers in format "123-456-7890"
[[:alpha:]]+@[[:alnum:]]+\.[[:alpha:]]{2,3}
- Matches simple email addresses like "user@domain.com"
^[[:space:]]*$
- Matches lines containing only whitespace characters
^(19|20)[[:digit:]]{2}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$
- Matches dates in YYYY-MM-DD format between 1900-2099
^#[[:xdigit:]]{6}$
- Matches hex color codes like “#FF0000", "#123ABC"

sed

The stream editor, commonly known as sed, is one of the most powerful text processing tools available in Unix-like operating systems. Created by Lee E. McMahon at Bell Labs in 1974, sed was designed as a successor to the revolutionary ed editor, bringing automation to text editing tasks. Sed is a text transformation pipeline - it reads input line by line, applies specified operations, and outputs the result to the output stream. It can either take input from the standard input stream or a file.

It has two buffers, called the pattern space and the hold space. The pattern buffer is a temporary buffer, the scratchpad where the current information is stored. The hold buffer is for long term storage. There is also an address range, that specifies which lines the command should operate on.

Basic `sed` Command structure

The syntax of a sed command is:

sed [options] SCRIPT INPUT

Options allow you to modify the default behavior of sed. Common options include:

-e: Allows multiple commands chained together
-i: Edit files in-place (rather than just outputting the changes to the output stream)
-n: Suppress automatic printing of pattern space
-f: Read commands from a file

A sed script will consist of one or more commands to execute on the input file, can either be “on the fly” (in a string) or a proper file with a .sed extension. A sed command will follow this syntax:

[addr]X[options]

Where addr is optional and can be a single line number, a regular expression, or a range of lines. When addr is specified, the command will only be executed on matched lines. X will be a single letter sed command. Additional options can be specified for some sed commands. Commands within a script can be seperated using a ; or newlines.

Substitution Command (s)

The substitution command is the most frequently used sed command. Its basic syntax is:

sed 's/pattern/replacement/flags'

The s command will attempt to match what you have in the pattern space (the current line) against your supplied regular expression (pattern). If there is a match, then that portion of your pattern space is replaced with replacement. By default, this will only operate on the first match. You can use flags to modify the behavior of the s command as well.

Flags:

g: Global replacement (replace all occurrences)
p: Print the modified line
I: Case-insensitive matching
w file: Write the result to a file Example:

# Replace first occurrence of 'dog' with 'cat'
sed 's/dog/cat/' file.txt

# Replace all occurrences of 'dog' with 'cat'
sed 's/dog/cat/g' file.txt

Here are a few more examples that illustrate uses of sed. Note that by default sed prints the altered content to the screen but options and I/O redirection can adjust this.

$ cat sample_data.txt
A line ending with foo
Another line with foo but not at the end
A bar line (with parentheses)
A foo line ending in bar
An unbalanced line ending with )
foo and bar both appear on this line

# transform foo to FUBAR
$ sed 's/foo/FUBAR/g' sample_data.txt
A line ending with FUBAR
Another line with FUBAR but not at the end
A bar line (with parentheses)
A FUBAR line ending in bar
An unbalanced line ending with )
FUBAR and bar both appear on this line

# transform either of foo or bar to FUBAR, note use of -E for
# extended regular expressions simliar to grep
$ sed -E 's/(foo|bar)/FUBAR/g' sample_data.txt
A line ending with FUBAR
Another line with FUBAR but not at the end
A FUBAR line (with parentheses)
A FUBAR line ending in FUBAR
An unbalanced line ending with )
FUBAR and FUBAR both appear on this line

# put double quotes "" around instances of foo or bar using
# the grouping and 1st group match in the substitution
$ sed -E 's/(foo|bar)/"\1"/g' sample_data.txt
A line ending with "foo"
Another line with "foo" but not at the end
A "bar" line (with parentheses)
A "foo" line ending in "bar"
An unbalanced line ending with )
"foo" and "bar" both appear on this line

Delete Command (d)

The delete command will delete the current pattern space and move on to the next line.

# Delete lines containing 'pattern'
sed '/pattern/d' file.txt

# Delete lines 3 through 5
sed '3,5d' file.txt

Print Command (p)

Print out the pattern space to the standard output, usually in conjunction with the -n command-line option.

# Print lines containing 'pattern'
sed -n '/pattern/p' file.txt

Append (a) and Insert (i) Commands

Add text after or before matching lines:

# Append 'New Line' after each line containing 'pattern'
sed '/pattern/a\New Line' file.txt

# Insert 'New Line' before each line containing 'pattern'
sed '/pattern/i\New Line' file.txt

Sequencing Commands

Several sed commands sequenced by separating them with semi-colons to produce more complex effects. The below example combines 3 substitution commands.

Substitute foo for FOO
Substitute bar for BAR
For only lines 2 to 5, substitute FOO or BAR with itself with double quotes around it.

The three commands are done in sequence to the output before printing.

$ cat sample_data.txt
A line ending with foo
Another line with foo but not at the end
A bar line (with parentheses)
A foo line ending in bar
An unbalanced line ending with )
foo and bar both appear on this line

# 3 sed commands sequenced
$ sed -E 's/foo/FOO/g; s/bar/BAR/g; 2,5 s/FOO|BAR/"\0"/g' sample_data.txt
A line ending with FOO
Another line with "FOO" but not at the end
A "BAR" line (with parentheses)
A "FOO" line ending in "BAR"
An unbalanced line ending with )
FOO and BAR both appear on this line

Practical Examples

# Replace all instances of 'apple' with 'orange' in a file
sed 's/apple/orange/g' input.txt > output.txt

# ----

# Comment out all lines containing a specific string
sed '/password/s/^/#/' config.txt > config_safe.txt

# Common use: Quickly commenting out sensitive configuration lines
# before sharing configs or temporarily disabling features

# ----

# Delete all lines containing "DEBUG" from a log file
sed '/DEBUG/d' app.log > production.log

# Common use: Filtering out verbose debug messages when you only need
# to see warnings and errors

In Place Modification of Files

sed prints its output to stdout (the screen) by default. A common desire is to actually modify a file using sed rather than output to the screen. Running via sed -i ... or sed --in-place will make changes to files. Both of these will overwrite the existing file but can also create a backup of the original (highly recommended).

$ cat sample_data.txt
A line ending with foo
Another line with foo but not at the end
A bar line (with parentheses)
A foo line ending in bar
An unbalanced line ending with )
foo and bar both appear on this line

# transform the data in place; no output as the file changes
$ sed --in-place=.bk 's/foo/FOO/g; s/bar/BAR/g;' sample_data.txt

# show contents of file have changed
$ cat sample_data.txt
A line ending with FOO
Another line with FOO but not at the end
A BAR line (with parentheses)
A FOO line ending in BAR
An unbalanced line ending with )
FOO and BAR both appear on this line

# show the backup file created with the specified .bk suffix
$ cat sample_data.txt.bk
A line ending with foo
Another line with foo but not at the end
A bar line (with parentheses)
A foo line ending in bar
An unbalanced line ending with )
foo and bar both appear on this line

Limitations of `sed`

sed is a powerful tool but is best used in small doses for things like

printing a specified range of lines in file via sed -n '25,70p' file.txt
substituting one string for another via s/original/transformed/g

If more than 3 or 4 operations are needed and if conditional structure is required, its best to consider a more structured alternative like AWK.

AWK

AWK is a specialized programming language designed for processing text data. Created by Aho, Weinberger, and Kernighan (whose initials form its name), AWK shines when working with data organized in rows and columns, such as CSV files, log files, or any structured text data.

AWK Sytnax

At its core, AWK operates on records and fields, where records are lines by default and fields are "words" (or whitespace seperated chunks by default). An awk command will follow this format:

awk options ‘pattern {actions}’ input

AWK generally follows the cycle of "when you see this pattern, perform this action. If you omit this pattern, the action applies to every line. If you omit, the action, AWK will print the matching lines. AWK will automatically split the records into different fields, which you can reference using dollar signs.

$0 refers to the entire line
$1 refers to the first field
$2 refers to the second field

Options

There are various options you can use with AWK, below are some common ones, but you can find more here.

-F fs or --field-seperator fs: sets the field seperator to fs
-f source-file or --file source-file: read the awk program source from source-file instead of in the first nonoption argument
-v var=val or --assign var=val: assigns value val to the variable var

Pattern and Actions

Patterns

A regular expression enclosed in slashes (‘/’) is an awk pattern that matches every input record whose text belongs to that set.

Patterns can take any of the following forms:

BEGIN
- Executed before any of the input is read
END
- Executed after all the input is read
/regular expression/
- Standard regular expression to match
relational expressions
- <, >,<=, >=, !=, ==
pattern && pattern
pattern || pattern
pattern ? pattern : pattern
(pattern)
! pattern
pattern1, pattern2
- Range expression

Actions

Actions are enclosed in {braces} and contain all the standard assignment, conditional, control flow, etc. that exist in most languages. We won't go into the various actions here, but you can read more here.

Input

AWK can either take input from a file provided or from the standard input stream.

Built-In Variablest

AWK provides several built-in variables that make text processing easier:

NR    # Keeps track of the current line number
NF    # Tells you how many fields are in the current line
FS    # The field separator (default is whitespace)
OFS   # The output field separator (what to put between fields when printing)

Real-World Examples Explained

Let's look at some common text processing tasks and how AWK handles them:

Calculating Column Sums

# Sum all values in the third column
awk '{ sum += $3 } END { print sum }' data.txt

This script does something simple but powerful. For each line, it adds the value from the third column ($3) to a running total (sum). The END block runs after all lines are processed, printing the final total. Note that variables in AWK do not need to be declared before.

Processing CSV Data

# Print records where the amount exceeds 1000
awk -F',' '$3 > 1000 { print $1, $3 }' sales.csv

Here we're doing several things at once:

-F',' tells AWK to use commas as field separators instead of spaces
$3 > 1000 is our pattern, matching lines where the third field exceeds 1000
print $1, $3 prints the first and third fields of matching lines This is useful for tasks like finding high-value transactions in sales data or filtering large CSV files.

Log File Analysis

# Count occurrences of HTTP status codes
awk '{ codes[$9]++ }
     END { for(code in codes) print code, codes[code] }' access.log

This script demonstrates AWK's ability to handle associative arrays:

codes[$9]++ uses the ninth field (typically the HTTP status code in access logs) as an array index and counts occurrences
The END block loops through all unique codes and prints their counts This is invaluable for analyzing web server logs or any data where you need to count occurrences of values.

Data Transformation

# Convert space-separated data to CSV format
awk '{ name=$1; salary=$2; print name "," salary }' employees.txt

This script shows how AWK can transform data formats:

It takes the first two fields from each line
Stores them in variables for clarity
Prints them with a comma between them This is useful when you need to convert data between different formats or extract specific fields from a larger dataset.

Control Structures for Complex Processing

AWK supports familiar programming constructs for more complex tasks:

# Example of control structures
if ($3 > 1000) {
    total += $3
}

for (i=1; i<=NF; i++) {
    sum += $i
}

while (getline > 0) {
    count++
}

xargs

xargs is a utility that bridges the gap between commands that produce output and commands that expect arguments. It's particularly useful when you need to process many files or handle command line limitations.

syntax

The basic xargs syntax is:

xargs [OPTIONS] [COMMAND]

Basic Usage Explained

Here's how xargs can help with common tasks:

Finding and Processing Files

# Find and remove all temporary files
find . -name "*.tmp" | xargs rm

This command pipeline:

find locates all files ending in .tmp
xargs takes those filenames and passes them as arguments to rm This is more efficient than using rm directly with find's -exec option.

Parallel Processing

# Compress files in parallel
find . -type f | xargs -P 4 gzip

The -P 4 option tells xargs to run up to 4 gzip processes simultaneously, significantly speeding up processing on multi-core systems.

Handling Special Characters

# Safely handle filenames with spaces
find . -type f -print0 | xargs -0 file

The -print0 and -0 options work together to handle filenames containing spaces, newlines, or other special characters safely.

Practical Applications

Let's look at some real-world uses:

Batch Processing

# Convert images in groups of three
ls *.jpg | xargs -n 3 convert -resize 800x600

This processes files in batches of three, which can be useful when dealing with memory-intensive operations.

Custom Commands

# Rename log files with .old extension
find . -name "*.log" | xargs -I {} mv {} {}.old

The -I {} option lets you specify where in the command to place each input item, giving you more flexibility in how you use the input.

`xargs` versus Shell Loops

There is overlap between what xargs and shell for loops can accomplish. Take the previous examples of xargs which can be done with shell loops as well.

# xargs: resize images, allows for 3-wide parallelism
ls *.jpg | xargs -n 3 convert -resize 800x600

# loop: resize images, allows for 3-wide parallelism
for f in *.jpg; do
  convert -resize 800x600 $f
done

# xargs: Rename log files with .old extension
find . -name "*.log" | xargs -I {} mv {} {}.old

# loop: Rename log files with .old extension
for f in $(find . -name "*.log"); do
    mv $f $f.old
done

When confronted with a choice between these, consider these trade-offs

xargs allows for built-in parallel execution via xargs -n to specify how many jobs processes are run. For loops do not have this capability.
for loops can include multiple statements, nested conditionals, variable substitutions, and other features. xargs has a much harder time with these types of activities. Overall xargs is best suited for a single command that is to be run on multiple inputs with limited conditional or naming tasks involved while shell loops are preferable when the task has these complexities.

Debugging and Profiling

The Diagnostic Mindset

Software will execute exactly as instructed, regardless of the programmer's intentions. Debugging bridges the gap between intended and actual behavior, and while this process can be time-intensive, there are effective techniques for identifying and resolving issues in buggy or resource-intensive code. Often times, debugging can be seen as a reactive process that slows down development, but implementing systematic debugging practices accelerates development cycles and reduces time spent tracking down issues.

A Systematic Approach (Observe, Hypothesize, Test)

Effective debugging follows a methodical process rather than random trial and error. The systematic approach consists of five key steps:

Reproduce consistently - If you can't reliably reproduce the bug, you can't verify your fix. Document the exact steps, inputs, and environment that trigger the issue.
Isolate the problem - Use binary search through your code. Comment out half the functionality, see if the bug persists, and narrow down the problematic section.
Form a hypothesis - Based on symptoms and isolated code, develop a theory about what's wrong. "I think the null pointer occurs because the API returns empty data."
Test the hypothesis - Add logging, use a debugger, or write a minimal test case to prove or disprove your theory. Make one change at a time.
Fix and verify - Once confirmed, implement the fix and verify it resolves the issue without breaking other functionality. Run related tests.

Example: Suppose your web app crashes when processing user uploads. You reproduce it (step 1), isolate it to the file parsing function (step 2), hypothesize that large files cause memory issues (step 3), confirm by testing with various file sizes (step 4), and implement streaming processing instead of loading entire files into memory (step 5).

Debugging Fundamentals

Print Debugging & Logging

As Brian Kernighan noted in "Unix for Beginners" (1979), "The most effective debugging tool is still careful thought, coupled with judiciously placed print statements." The simplest approach to debugging involves adding print statements near suspected problem areas and iterating until sufficient information is gathered to identify the root cause. This method's simplicity and ease of implementation make it a preferred choice for many software engineers.

Print debugging can be enhanced by implementing proper logging instead of simple print statements. Logging systems offer several advantages over basic printing: they can output to multiple destinations including files, sockets, and remote servers, making log review more convenient than scanning terminal output. They also support severity levels (INFO, DEBUG, WARN, ERROR) for filtered output, and they establish a logging infrastructure that can serve both immediate debugging needs and long-term monitoring requirements.

If used correctly, logging can significantly increase development velocity. Below, are a few tips to help make logs more useful:

Set log levels properly to allow you to filter out unnecessary messages (ex. diagnostic messages) to help you narrow down to the actual issue.
Many libraries support structured logging, a method of organizing log data into a structured format, making it easier to analyze and interpret. Instead of recording raw text, structured logging uses key-value pairs which provide context and additional information about the logged event. The structured nature makes it easier to search, filter and analyze logs.
You want to always be able to find the source code for any given log entry. This means using unique messages, prefixes, etc. which will help you trace a code path using log messages. This is more difficult if multiple places create the same log entry.
Use a log viewer to make it easier to process and view logs
It can be useful (especially in a web-development context) to use a correlation id to track a request throughout the entire transaction.
Logging should not be competing with your software for resources. Use logging levels strategically as well as opting for accumulated metrics over textual logs as needed.
Never log sensitive information.
Only log what you need but finding out what you actually need is an iterative process.

Here's a practical example demonstrating different log levels:

import logging

# Configure logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

logger = logging.getLogger(__name__)

def process_user_data(user_id, data):
    logger.debug(f"Starting to process data for user {user_id}")

    if not data:
        logger.warning(f"User {user_id} submitted empty data")
        return None

    try:
        # Simulate processing
        result = {"user_id": user_id, "processed": len(data)}
        logger.info(f"Successfully processed {len(data)} items for user {user_id}")
        return result
    except Exception as e:
        logger.error(f"Failed to process data for user {user_id}: {e}")
        raise

# Run with different scenarios
process_user_data(123, ["item1", "item2", "item3"])
process_user_data(456, [])

Run this script to see all log levels:

$ python logging_demo.py

You can change the logging level to logging.INFO to filter out DEBUG messages:

# Edit the script to change level=logging.DEBUG to level=logging.INFO
# Then run again to see only INFO, WARNING, and ERROR messages
$ python logging_demo.py

Or set to logging.WARNING to see only warnings and errors. This makes it easy to control verbosity in development vs. production environments.

External Logging

When working with external dependencies like web servers, databases, or containerized services, you'll often need to check their logs since client-side error messages may not provide enough detail. Most programs write logs to /var/log on UNIX systems, and modern containerized applications expose logs through commands like docker logs <container-name>.

Interactive Debuggers

When print debugging is not enough you should use a debugger. Debuggers are programs that let you interact with the execution of a program, allowing the following:

Halt execution of the program when it reaches a certain line.
Step through the program one instruction at a time.
Inspect values of variables after the program crashed.
Conditionally halt the execution when a given condition is met.
And many more advanced features

Many programming languages come with some form of debugger.

Core Concepts (Stepping, State Inspection, Call Stack)

Debuggers allow you to pause program execution at breakpoints, then step through code line by line. Key operations include "Step Over" (execute the current line including any function calls), "Step Into" (enter a function to debug its internals), and "Step Out" (continue until the current function returns). While paused, you can inspect the call stack (the sequence of function calls that led to the current point), examine and modify variable values, and set watch expressions to monitor specific values as the program executes.

Advanced Breakpoints (Conditional, Hit Count, Logpoints)

Conditional breakpoints extend the basic breakpoint concept by adding programmable conditions. Instead of stopping every time a particular line is reached, the debugger only halts execution when the specified condition is true. For example, a conditional breakpoint might only trigger when a variable reaches a certain value or when a specific error condition occurs. This capability is especially valuable when debugging issues that only manifest under specific circumstances or when dealing with code that executes frequently but only occasionally exhibits problematic behavior.

Hit count breakpoints pause execution only after a line has been hit a certain number of times. This is useful when debugging loops - for example, breaking only on the 100th iteration when you suspect an issue occurs after many iterations.

Logpoints (also called tracepoints) allow you to log messages to the console without stopping execution and without modifying your source code. Instead of adding print statements and rerunning your program, you can inject logging at any point during a debugging session.

Hands-On: Using Python's pdb

Python's built-in debugger pdb provides an interactive debugging experience. Here's a buggy program and how to debug it:

# buggy_calculator.py
def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

def process_data(data):
    results = []
    for item in data:
        avg = calculate_average(item['values'])
        results.append({'name': item['name'], 'average': avg})
    return results

if __name__ == '__main__':
    data = [
        {'name': 'dataset1', 'values': [1, 2, 3, 4, 5]},
        {'name': 'dataset2', 'values': []},  # This will cause an error!
        {'name': 'dataset3', 'values': [10, 20, 30]}
    ]
    print(process_data(data))

To debug with pdb, you can either:

Run with python -m pdb buggy_calculator.py
Add import pdb; pdb.set_trace() where you want to break

def calculate_average(numbers):
    import pdb; pdb.set_trace()  # Execution will pause here
    total = 0
    for num in numbers:
        total += num
    return total / len(numbers)

Common pdb commands:

l (list) - Show current code context
n (next) - Execute current line (step over)
s (step) - Step into function calls
c (continue) - Continue execution until next breakpoint
p variable_name - Print variable value
pp variable_name - Pretty-print variable value
w (where) - Show call stack
b line_number - Set breakpoint at line number
b function_name - Set breakpoint at function
condition bp_number condition - Make breakpoint conditional

Specialized Tools

System Call Tracers (strace, dtrace)

Even if what you are trying to debug is a black box binary there are tools that can help you with that. Whenever programs need to perform actions that only the kernel can, they use System Calls. There are commands that let you trace the syscalls your program makes. In Linux there's strace and macOS and BSD have dtrace. dtrace can be tricky to use because it uses its own D language, but there is a wrapper called dtruss that provides an interface more similar to strace (more details here).

Below are some examples of using strace or dtruss to show stat syscall traces for an execution of ls. For a deeper dive into strace, this article and this zine are good reads.

# On Linux
sudo strace -e lstat ls -l > /dev/null
# On macOS
sudo dtruss -t lstat64_extended ls -l > /dev/null

Here's a practical example debugging why a Python script is slow. First, create a simple script:

# slow_script.py
with open('/etc/hosts', 'r') as f:
    for line in f:
        pass

Now trace it with strace to see what system calls it makes:

$ strace -c python slow_script.py
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 34.21    0.000026          13         2           read
 23.68    0.000018           9         2           openat
 15.79    0.000012          12         1           write
  ...

The -c flag provides a summary. You can also see individual calls with -e trace=openat,read to focus on specific syscalls. This helps identify if your program is doing excessive file I/O, network calls, or other system operations.

Web Development Tools

Browser developer tools (Chrome DevTools, Firefox Developer Tools) are essential for web development debugging. Here's a practical walkthrough of common debugging scenarios:

Debugging JavaScript:

Open DevTools (F12 or Ctrl+Shift+I / Cmd+Option+I)
Go to the Sources tab
Find your JavaScript file in the file tree
Click on a line number to set a breakpoint
Refresh the page - execution will pause at your breakpoint
Use the controls to step through code, inspect variables in the Scope panel
Use the Console to evaluate expressions in the current context

Debugging Network Requests:

Open the Network tab
Perform the action that triggers the request (e.g., submit a form, click a button)
Click on the request to see:
- Headers (request/response headers)
- Preview (formatted response)
- Response (raw response)
- Timing (how long each phase took)
Right-click a request and select "Copy as cURL" to replay it from the command line
Use "Preserve log" to keep requests across page navigations

Common Use Cases:

API debugging: Check if your frontend is sending the correct data, verify response structure
Performance issues: Use the Network tab's timing column to find slow requests
Cookie/Storage issues: Application tab shows cookies, localStorage, sessionStorage
JavaScript errors: Console tab shows errors with stack traces; click to jump to the problematic line
Live editing: Modify CSS in the Elements tab or JavaScript in Sources to test fixes without redeploying

Performance Profiling

Even if your code functionally behaves as you would expect, that might not be good enough if it takes all your CPU or memory in the process. Algorithms classes often teach big O notation but not how to find hot spots in your programs. Since premature optimization is the root of all evil, you should learn about profilers and monitoring tools. They will help you understand which parts of your program are taking most of the time and/or resources so you can focus on optimizing those parts.

Key Profiling Areas

CPU Profiling (incl. Real/User/Sys Time)

Timing

Similarly to the debugging case, in many scenarios it can be enough to just print the wall clock time it took your code between two points. However, wall clock time can be misleading since your computer might be running other processes at the same time or waiting for events to happen. It is common for tools to make a distinction between Real, User and Sys time. In general, User + Sys tells you how much time your process actually spent in the CPU (more detailed explanation here).

Real - Wall clock elapsed time from start to finish of the program, including the time taken by other processes and time taken while blocked (e.g. waiting for I/O or network)
User - Amount of time spent in the CPU running user code
Sys - Amount of time spent in the CPU running kernel code

For example, try running a command that performs an HTTP request and prefixing it with time. Under a slow connection it can take over 2 seconds for the request to complete but the process itself will only take ~15ms of CPU user time and 12ms of kernel CPU time.

CPU Profilers

Most of the time when people refer to profilers they actually mean CPU profilers, which are the most common. There are two main types of CPU profilers: tracing and sampling profilers. Tracing profilers keep a record of every function call your program makes whereas sampling profilers probe your program periodically (commonly every millisecond) and record the program's stack. They use these records to present aggregate statistics of what your program spent the most time doing. Here is a good intro article if you want more detail on this topic.

In Python we can use the cProfile module to profile time per function call. Here is a simple example that implements a rudimentary grep in Python:

#!/usr/bin/env python

import sys, re

def grep(pattern, file):
    with open(file, 'r') as f:
        print(file)
        for i, line in enumerate(f.readlines()):
            pattern = re.compile(pattern)
            match = pattern.search(line)
            if match is not None:
                print("{}: {}".format(i, line), end="")

if __name__ == '__main__':
    times = int(sys.argv[1])
    pattern = sys.argv[2]
    for i in range(times):
        for file in sys.argv[3:]:
            grep(pattern, file)

We can profile this code using the following command. Analyzing the output we can see that IO is taking most of the time and that compiling the regex takes a fair amount of time as well.

$ python -m cProfile -s tottime grep.py 1000 '^(import|\s*def)[^,]*$' *.py

[omitted program output]

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     8000    0.266    0.000    0.292    0.000 {built-in method io.open}
     8000    0.153    0.000    0.894    0.000 grep.py:5(grep)
    17000    0.101    0.000    0.101    0.000 {built-in method builtins.print}
     8000    0.100    0.000    0.129    0.000 {method 'readlines' of '_io._IOBase' objects}
    93000    0.097    0.000    0.111    0.000 re.py:286(_compile)
    93000    0.069    0.000    0.069    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
    93000    0.030    0.000    0.141    0.000 re.py:231(compile)
    17000    0.019    0.000    0.029    0.000 codecs.py:318(decode)
        1    0.017    0.017    0.911    0.911 grep.py:3(<module>)

[omitted lines]

Notice that re.py:286(_compile) is called 93,000 times! The regex is being recompiled on every line. Let's fix this:

def grep(pattern, file):
    regex = re.compile(pattern)  # Compile once, outside the loop
    with open(file, 'r') as f:
        print(file)
        for i, line in enumerate(f.readlines()):
            match = regex.search(line)  # Use the compiled regex
            if match is not None:
                print("{}: {}".format(i, line), end="")

After the optimization, profiling again shows:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     8000    0.234    0.000    0.259    0.000 {built-in method io.open}
     8000    0.116    0.000    0.642    0.000 grep.py:5(grep)
    17000    0.098    0.000    0.098    0.000 {built-in method builtins.print}
    93000    0.067    0.000    0.067    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
     8000    0.055    0.000    0.070    0.000 {method 'readlines' of '_io._IOBase' objects}
        8    0.026    0.003    0.029    0.004 {built-in method _sre.compile}
     ...

The _sre.compile calls dropped from 93,000 to 8 (once per file), and total time improved significantly. This demonstrates how profiling identifies bottlenecks and validates optimizations.

A caveat of Python's cProfile profiler (and many profilers for that matter) is that they display time per function call. That can become unintuitive really fast, especially if you are using third party libraries in your code since internal function calls are also accounted for. A more intuitive way of displaying profiling information is to include the time taken per line of code, which is what line profilers do.

For instance, the following piece of Python code performs a request to the class website and parses the response to get all URLs in the page:

#!/usr/bin/env python
# urls.py
import requests
from bs4 import BeautifulSoup

# This is a decorator that tells line_profiler
# that we want to analyze this function
@profile
def get_urls():
    response = requests.get('https://missing.csail.mit.edu')
    s = BeautifulSoup(response.content, 'lxml')
    urls = []
    for url in s.find_all('a'):
        urls.append(url['href'])

if __name__ == '__main__':
    get_urls()

If we used Python's cProfile profiler we'd get over 2500 lines of output, and even with sorting it'd be hard to understand where the time is being spent. First install line_profiler (pip install line_profiler), then run it:

$ kernprof -l -v urls.py
Wrote profile results to urls.py.lprof
Timer unit: 1e-06 s

Total time: 0.636188 s
File: a.py
Function: get_urls at line 5

Line #  Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 5                                           @profile
 6                                           def get_urls():
 7         1     613909.0 613909.0     96.5      response = requests.get('https://missing.csail.mit.edu')
 8         1      21559.0  21559.0      3.4      s = BeautifulSoup(response.content, 'lxml')
 9         1          2.0      2.0      0.0      urls = []
10        25        685.0     27.4      0.1      for url in s.find_all('a'):
11        24         33.0      1.4      0.0          urls.append(url['href'])

Memory Profiling

In languages like C or C++ memory leaks can cause your program to never release memory that it doesn't need anymore. To help in the process of memory debugging you can use tools like Valgrind that will help you identify memory leaks.

In garbage collected languages like Python it is still useful to use a memory profiler because as long as you have pointers to objects in memory they won't be garbage collected. Here's an example program and its associated output when running it with memory-profiler (note the decorator like in line-profiler). First, install it: pip install memory-profiler

@profile
def my_func():
    a = [1] * (10 ** 6)
    b = [2] * (2 * 10 ** 7)
    del b
    return a

if __name__ == '__main__':
    my_func()

$ python -m memory_profiler example.py
Line #    Mem usage  Increment   Line Contents
==============================================
     3                           @profile
     4      5.97 MB    0.00 MB   def my_func():
     5     13.61 MB    7.64 MB       a = [1] * (10 ** 6)
     6    166.20 MB  152.59 MB       b = [2] * (2 * 10 ** 7)
     7     13.61 MB -152.59 MB       del b
     8     13.61 MB    0.00 MB       return a

Here's a more realistic example showing a memory leak in a web application:

# memory_leak.py
cache = []  # Global cache that grows unbounded

@profile
def process_requests():
    for i in range(1000):
        data = fetch_data(i)  # Simulates fetching data
        cache.append(data)  # Memory leak: never cleaned up
        result = analyze(data)

def fetch_data(id):
    return [id] * 10000  # Simulate large data

def analyze(data):
    return sum(data)

if __name__ == '__main__':
    process_requests()

Running memory profiler reveals the leak:

$ python -m memory_profiler memory_leak.py
Line #    Mem usage  Increment   Line Contents
==============================================
     3                           @profile
     4     14.1 MB    0.0 MB    def process_requests():
     5     14.1 MB    0.0 MB        for i in range(1000):
     6     90.2 MB   76.1 MB            data = fetch_data(i)
     7     90.2 MB    0.0 MB            cache.append(data)  # Memory keeps growing!
     8     90.2 MB    0.0 MB            result = analyze(data)

The fix: implement a bounded cache or clear old entries:

from collections import deque

cache = deque(maxlen=100)  # Only keep last 100 items

Event & I/O Profiling

As it was the case for strace for debugging, you might want to ignore the specifics of the code that you are running and treat it like a black box when profiling. The perf command abstracts CPU differences away and does not report time or memory, but instead it reports system events related to your programs. For example, perf can easily report poor cache locality, high amounts of page faults or livelocks. Here is an overview of the command:

perf list - List the events that can be traced with perf
perf stat COMMAND ARG1 ARG2 - Gets counts of different events related to a process or command
perf record COMMAND ARG1 ARG2 - Records the run of a command and saves the statistical data into a file called perf.data
perf report - Formats and prints the data collected in perf.data

Visualizing Performance: Flame Graphs

Flame graphs are a visualization technique for understanding where your program spends time. They show the call stack on the Y-axis and time/sample count on the X-axis, making it easy to identify hot paths in your code.

How to read a flame graph:

Each box represents a function in the call stack
Width of the box = how much time was spent (or how many samples)
Height = call stack depth (functions calling other functions)
Colors are usually random (for visual distinction)
Look for wide boxes at the top - these are your bottlenecks

Generating flame graphs with py-spy:

# Install py-spy
pip install py-spy

# Profile a running Python program and generate flame graph
py-spy record -o profile.svg --duration 30 -- python your_script.py

# Or attach to a running process
py-spy record -o profile.svg --pid 12345

This creates an interactive SVG file you can open in your browser. Click on boxes to zoom in and see specific call paths.

Example interpretation: If you see a wide box labeled json.loads, it means your program is spending significant time parsing JSON. You might optimize by:

Parsing JSON once and caching the result
Using a faster JSON library like orjson
Reducing the amount of JSON data being parsed

Flame graphs work well with the perf command on Linux:

# Record performance data
perf record -F 99 -g python your_script.py

# Generate flame graph (requires flamegraph.pl from github.com/brendangregg/FlameGraph)
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Resource Monitoring

Sometimes, the first step towards analyzing the performance of your program is to understand what its actual resource consumption is. Programs often run slowly when they are resource constrained, e.g. without enough memory or on a slow network connection.

Essential monitoring tools:

htop - Interactive process viewer showing CPU, memory, and process information in real-time

Press F6 to sort by different columns (CPU%, MEM%, TIME)
Press t to show process tree hierarchy
Press h for help with all keybindings
Use this when: Your program is slow and you want to see if it's using all CPU or running out of memory

# Example: Find which process is using the most CPU
htop  # then press F6 and select CPU%

du - Disk usage analyzer

du -h shows human-readable sizes
du -sh * shows size of each item in current directory
Use this when: Your disk is full and you need to find large directories

# Find largest directories in your home folder
du -h ~ | sort -h | tail -20

lsof - List open files and which processes have them open

lsof /path/to/file shows which process is using a file
lsof -i :8080 shows what process is using port 8080
Use this when: You get "file in use" errors or need to find what's using a port

# Find what's running on port 3000
lsof -i :3000

# Example output:
# COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
# node    12345 user   23u  IPv4 123456      0t0  TCP *:3000 (LISTEN)

Git Theory

Why Version Control?

Version control systems (VCSs) are tools used to track changes to source code (or other collections of files and folders). As the name implies, these tools help maintain a history of changes; furthermore, they facilitate collaboration. VCSs track changes to a folder and its contents in a series of snapshots, where each snapshot encapsulates the entire state of files/folders within a top-level directory. VCSs also maintain metadata like who created each snapshot, messages associated with each snapshot, and so on.

Why is version control useful? Even when you're working by yourself, it can let you look at old snapshots of a project, keep a log of why certain changes were made, work on parallel branches of development, and much more. When working with others, it's an invaluable tool for seeing what other people have changed, as well as resolving conflicts in concurrent development.

Modern VCSs also let you easily (and often automatically) answer questions like:

Who wrote this module?
When was this particular line of this particular file edited? By whom? Why was it edited?
Over the last 1000 revisions, when/why did a particular unit test stop working?

What is Git?

Git is a distributed version control system that tracks changes to files over time. Originally created for Linux kernel development, Git allows developers to maintain a complete history of their work. As a distributed system, each developer has a full copy of the entire repository on their local machine. This enables them to work independently before synchronizing their changes with remote repositories hosted on services like GitHub, facilitating effective team collaboration.

Aside: Git vs. GitHub

A common point of confusion is realizing the difference between Git and GitHub. Git is the version control software, but GitHub is a website that provides cloud hosting for Git repositories as well as as a front end to a lot of Git features. It also includes some GitHub specific features like issues, pull requests, and more. There are alternatives to GitHub like GitLab or BitBucket, and there are also alternatives to Git like SVN or Mercurial.

The Data Model: How Git Stores Your Code

Over time, Git has emerged as the de facto standard for version control systems. However, many developers learn Git through memorizing commands without understanding its elegant underlying design. This approach often leads to confusion when things go wrong, as developers lack the theoretical foundation to reason about Git's behavior.

This text takes a different approach. Instead of starting with commands, we'll build understanding from the ground up by exploring Git's data model and theoretical foundations. When you understand these fundamentals, you'll be able to reason about Git's behavior rather than memorizing commands, solve complex version control problems with confidence, and develop mental models that translate across different Git workflows.

Blobs

The most basic unit in Git's data model is the blob (binary large object). A blob represents the contents of a file, stripped of all metadata. When you add a file to Git, its contents are stored as a blob, identified by a SHA-1 hash of its content. The same file content always produces the same blob hash, regardless of where it appears in your project or what you name the file. Blobs are immutable and make it so that if you have the same file content in multiple places in your project, Git only stores it once. If you modify a file, Git creates a new blob, leaving the original untouched.

Trees

While blobs store content, they don't maintain structure or metadata. Git uses trees to organize blobs into directories and provide metadata like file names. A tree object is essentially a snapshot of a directory structure, mapping names to blobs (for files) or other trees (for subdirectories).

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#2E7DAF',
    'primaryBorderColor': '#1B4B69',
    'mainBkg': '#FFFFFF',
    'secondBkg': '#F4F4F4',
    'lineColor': '#666666',
    'textColor': '#333333',
    'border1': '#CCCCCC',
    'border2': '#AAAAAA',
    'noteBkgColor': '#FFF9C4',
    'noteTextColor': '#333333',
    'noteBorderColor': '#E7C000'
  }
}}%%

graph TD
    Root[Tree: project_root] --> Dir1[Tree: src]
    Root --> Dir2[Tree: docs]
    Dir1 --> File1[Blob: main.c]
    Dir1 --> File2[Blob: helper.c]
    Dir2 --> File3[Blob: readme.md]
    
    style Root fill:#b8d4ff,stroke:#333
    style Dir1 fill:#b8d4ff,stroke:#333
    style Dir2 fill:#b8d4ff,stroke:#333
    style File1 fill:#f8f8f8,stroke:#333
    style File2 fill:#f8f8f8,stroke:#333
    style File3 fill:#f8f8f8,stroke:#333

History: Tracking Changes Over Time

Commits

A commit represents a snapshot of your project at a specific point in time. Its important to realize that Git isn't storing deltas between commits, it is storing a complete "snapshot" of the entire project. Each commit contains:

A pointer to the tree representing the project's state
Pointers to parent commit(s)
Metadata about who made the change and why (commit message)
A timestamp

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryBorderColor': '#1B4B69',
    'mainBkg': '#FFFFFF',
    'secondBkg': '#F4F4F4',
    'lineColor': '#666666',
    'textColor': '#333333',
    'border1': '#CCCCCC',
    'border2': '#AAAAAA',
    'noteTextColor': '#333333',
    'noteBorderColor': '#E7C000'
  }
}}%%
graph TD
    subgraph Commit
        M[Metadata:<br/>Author<br/>Date<br/>Message] 
        T[Tree]
        P[Parent Commit]
    end
    T --> B1[Blob: file1]
    T --> B2[Blob: file2]
    style M fill:#f9f,stroke:#333
    style T fill:#b8d4ff,stroke:#333
    style P fill:#f9f,stroke:#333

Once created, a commit cannot be changed without affecting all commits that come after it, since each commit is identified by a hash of its contents, including the parent commit hash.

The Commit Graph

As you make commits, Git builds a directed acyclic graph (DAG) of your project's history. In simpler terms, this means is that each snapshot in Git refers to a set of "parents" / the snapshots that preceded it. Note that a snapshot can have multiple parents, if for example two branches of development were merged into a single commit.

gitGraph
    commit
    commit
    branch feature
    checkout feature
    commit
    checkout main
    commit
    merge feature

References: Naming Points in History

References provide human-readable names to specific points in your commit history. Git uses two main types of references:

Branches: Mutable references that automatically point to the latest commit in a line of development. When you commit changes, the current branch reference updates to point to the new commit. Creating a branch is lightweight because Git just creates a new reference pointing to an existing commit.
Tags: Immutable references that permanently mark specific commits, typically used for releases (e.g., v1.0.0).

HEAD is a special reference that points to the commit you're currently working with, usually through a branch. For example, when you're working on the main branch, HEAD points to main, which points to a specific commit.

Branches

Branches are Git's way of allowing parallel development streams within a single repository. At the conceptual level, a branch is simply a lightweight, movable pointer to a specific commit. This explains why creating a branch in Git is nearly instantaneous—Git only needs to write a small file containing the SHA-1 hash of a commit.

When you're working on a branch, Git updates the special HEAD reference to point to that branch. As you make new commits, the branch pointer automatically moves forward to your latest commit. This automatic movement is what makes branches so useful for isolating work—each branch maintains its own independent line of development.

gitGraph
    commit
    commit
    branch feature
    checkout feature
    commit
    commit
    checkout main
    commit
    merge feature
    commit

In the diagram above, we can see how the commit history forms a directed acyclic graph (DAG) when branches are involved. The main branch and feature branch diverge after the second commit, then proceed independently until they're merged back together.

When you merge branches, Git creates a "merge" commit with multiple parent commits (in this case, the latest commits from both branches). This preserves the complete history of development on both branches and records when and how they were integrated.

Understanding branches as simple pointers to commits in Git's object database helps explain many Git operations: creating a branch (adding a pointer), deleting a branch (removing a pointer), and merging branches (creating a commit with multiple parents and moving pointers).

Git's Working Spaces

Git works with two primary "spaces":

The Repository (.git directory): Where Git stores all history, metadata, and the database of all versions of your project
The Working Directory: Where you actually edit your files and create new content

Git transforms changes in your working directory into permanent history in your repository through a series of states and transitions. You will see this terminology used often when working with git commands.

File States

Files in your working directory can exist in several states:

Untracked: Files that Git doesn't yet manage. These are files in your working directory that have never been added to Git's version control.
Tracked: Files that Git is actively managing, which can be in three sub-states:
- Unmodified: Files that haven't changed since your last commit
- Modified: Files that have changed but haven't been staged
- Staged: Modified files that are marked for inclusion in your next commit

The Staging Area

The staging area, also known as the "index", is an intermediate state between your working directory and repository. It represents the changes you're preparing to permanently record in your next commit. One common usecase of the staging area is if you want to only include some changes in a commit, you can stage just those changes and push the others seperately.

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#2E7DAF',
    'primaryBorderColor': '#1B4B69',
    'mainBkg': '#FFFFFF',
    'secondBkg': '#F4F4F4',
    'lineColor': '#666666',
    'textColor': '#333333',
    'border1': '#CCCCCC',
    'border2': '#AAAAAA',
    'noteBkgColor': '#FFF9C4',
    'noteTextColor': '#333333',
    'noteBorderColor': '#E7C000'
  }
}}%%

graph LR
    A[Working Directory] -->|git add| B[Staging Area]
    B -->|git commit| C[Repository]
    style A fill:#f9f9f9,stroke:#333
    style B fill:#b8d4ff,stroke:#333
    style C fill:#90EE90,stroke:#333

When you stage changes with git add:

Git creates new blob objects for the changed files
Updates the staging area to point to these new blobs
When you commit, this tree becomes your new commit's root tree

Workflow

On disk, all Git stores are objects and references: that's all there is to Git's data model. All git commands map to some manipulation of the commit DAG by adding objects and adding/updating references.

When you work with Git, work progresses as:

Files in your working directory are tracked according to their state (untracked, modified, staged, or unmodified)
When you stage changes, Git:
- Creates immutable blobs from file contents
- Updates the staging area's tree structure
- Maintains all the metadata needed for the eventual commit
When you commit, Git:
- Creates a new commit object pointing to the staged tree
- Updates references (like your current branch) to point to the new commit
- Adds the commit to the repository's history graph

Whenever you're typing in any command, think about what manipulation the command is making to the underlying graph data structure. Conversely, if you're trying to make a particular kind of change to the commit DAG, e.g. "discard uncommitted changes and make the 'master' ref point to commit 5d83f9e", there's probably a command to do it (e.g. in this case, git checkout master; git reset --hard 5d83f9e).

I've written a new section on Git remotes that matches the existing style of the document. You should place this after the "Workflow" section:

Remotes: Collaborating Beyond Your Local Repository

Remotes are Git repositories hosted on a network or the internet that allow you to collaborate with others. A remote is essentially a copy of your repository that exists elsewhere, enabling you to push your changes to it or pull others' changes from it. Each remote has a name (commonly "origin" for the primary remote) and a URL pointing to its location.

When you clone a repository, Git automatically sets up the source as a remote called "origin." You can add multiple remotes to a single local repository, allowing you to fetch changes from or push changes to various sources.

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#2E7DAF',
    'primaryBorderColor': '#1B4B69',
    'mainBkg': '#FFFFFF',
    'secondBkg': '#F4F4F4',
    'lineColor': '#666666',
    'textColor': '#333333',
    'border1': '#CCCCCC',
    'border2': '#AAAAAA',
    'noteBkgColor': '#FFF9C4',
    'noteTextColor': '#333333',
    'noteBorderColor': '#E7C000'
  }
}}%%
graph TD
    L[Local Repository] -->|push| R1[Remote: origin]
    L -->|push| R2[Remote: upstream]
    R1 -->|fetch/pull| L
    R2 -->|fetch/pull| L
    style L fill:#b8d4ff,stroke:#333
    style R1 fill:#90EE90,stroke:#333
    style R2 fill:#90EE90,stroke:#333

Remote branches are references to the state of branches in your remote repositories. When you fetch from a remote, Git updates these remote-tracking branches to reflect the remote's state. Note that you can't modify these remote branches directly, they only change when you communicate with the remote repository. This separation provides a clear distinction between your local work and the shared history on the remote.

Understanding remote operations is crucial to Git's collaboration model:

Fetching: Downloads objects and references from a remote repository without integrating them into your working files
Pulling: Fetches from a remote repository and automatically merges the remote branch into your current branch
Pushing: Uploads your local branch commits to a remote repository, updating its references

This remote collaboration model enables distributed teams to work on the same codebase asynchronously, with each developer maintaining their own complete repository while still being able to share and integrate changes with others.

Git in Practice

Introduction / Setup

Now that you've learned about how Git actually works under the hood, we can start learning about how to use Git. There are a variety of interfaces avaliable to interact with Git, like the command line, various GUIs, tools like GitHub, and more. In this lesson, we'll focus on the command line version. If you don't have Git installed on your machine, install it using the instructions here. You may also need to complete some some first time setup which we will not be covering here.

Note that a lot of this content is referenced from the Pro Git book, with the respective liscense found here. I've also heavily referenced the excellent Beej's Guide To Git for structure and examples.

Getting Started with Repositories

The first thing you'll need to do when working with Git is to get your Git repository. This is done either by taking a local directory and converting it into a Git repository or by cloning an existing Git repository from somehwere like GitHub.

Initializing A Local Repository Using `git init`

To create a new Git repository, navigate to your desired directory and run:

git init

This creates a .git subdirectory that stores all version control data, including objects, references, and the commit history.

Cloning an Existing Repository

To copy an existing repository from services like GitHub or GitLab, use:

git clone <url>

This command will first create a new directory named after the project. Next, it will download the complete repository history, including all commits and branches. Sets up the working directory with files from the default branch (ex. main / master).

Git supports both HTTPS and SSH protocols for cloning, which you can choose based on your authentication requirements. While cloning is typically done once at the start of working with a repository, you can also clone specific branches if needed.

Basic Git Workflow

The basic workflow follows five simple steps that you'll repeat each time you want to make and save changes. Note that this ties back to the concepts we talked about in the Git Theory section.

First, you'll work on your files locally, making whatever changes you need to your code or documents. Think of this as your normal editing process. This is also known as making changes in your working directory.

Next, you'll tell Git which changes you want to include in your next snapshot using the "staging" process. This gives you control over exactly which modifications will be saved. This is called staging your changes.

Third, you'll create a "commit", which is taking a snapshot of your staged changes, along with a message describing what you changed and why.

Fourth, you'll send your committed changes to GitHub (or another remote repository). This is called "pushing" your changes.

You can then repeat this changes as time progresses. Note that this isn't the only workflow that is used, just a very common one.

We've covered the theoretical idea behind all of these concepts previously, but now we'll talk about the commands for these operations.

Core Git Operations

Now that we have a repository, we'll walk through the workflow and the associated command one by one. For this section, we'll be starting with an empty directory and run git init to initialize our repository.

Making Changes

Before we make any changes, we can ask Git what the current status of the local repository is by using the git status command.

❯ git status
On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)

This tells us that we are currently on the main branch. Recall from the previous lesson that branches are a reference to a specific commit and allow for parallel tracks of development. It also tells us that we haven't made any commits yet and that there aren't any changes to commit.

Let's start by creating a file called hello.py, with the contents of:

print("Hello, world!")
print("Welcome to CMSC398W!")

Once we save the file and run git status again, we see that:

❯ git status
On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        hello.py

nothing added to commit but untracked files present (use "git add" to track)

This tells us that Git has detected that we have created a file named hello.py that is not currently tracked by Git. Recall that files in Git can be either untracked or tracked, where untracked means that it hasn't added to version control, while tracked has. It also says that there is "nothing added to commit but untracked files present", which means that we haven't added any of our modified files to the staging area so we can commit them. Hopefully its started to get clearer why understanding the various terminologies and data models is useful when using Git. So, let's do exactly that and add our changes to the staging area.

Staging (git add)

We can use the git add command to stage changes to prepare to commit. We can either stage all of the changes in your directory, or stage specific files. Let's look at an example:

❯ git add hello.py
❯ git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
        new file:   hello.py

This has changed the message from "Untracked files" to now tell us that what changes are currently ready to be committed, namely, the new file hello.py. If we accidentally staged the file and want to unstage it, a helpful message is included as well.

Committing (git commit)

Now that our changes have been added to the staging area, we can commit them. Only files that are added to the staging area will be committed, so if I change a file but don't call either git add . (stage all the changes in the current directory) or git add <filename>, that change won't be committed. With each commit, you include a message that describes what a change in, why it was made, and other information.

❯ git commit -m "added hello.py for testing"
[main (root-commit) 34903ef] added hello.py for testing
 1 file changed, 2 insertions(+)
 create mode 100644 hello.py

This tells us that we have committed to the main branch with the message of "added hello.py for testing". In this commit, we've changed 1 file and inserted two lines. It also says we've created file hello.py (mode 100644 just indicates the file permissions, but generally is not something you need to worry about).

Pushing (git push)

If we cloned our repository from a pre-existing remote repository, we could then push our changes. Since we created this repository locally, we don't have a remote to push to, but we'll work more with remotes and how to use them in a later section.

Viewing History (git log)

Now that we've added our changes and committed them, we can see all of our commits within a log, accessible (unsurprisingly) via git log.

❯ git log

commit 34903ef910501690b5c619da5378c2d4b3fd82dc (HEAD -> main)
Author: John Doe <johndoe@gmail.com>
Date:   Tue Mar 4 22:40:00 2025 -0500

    added hello.py for testing

This gives us the SHA-1 hash for the commit Recall that everything in Git is generally addressed by it's SHA-1 hash, which makes it easy to identify a particular commit if you needed to go back in history. This commit is also at the "tip" of the main branch. It tells us who made the commit, the time it was made, and the commit message. If we add another commit, the log grows like a stack.

❯ git log

commit 0b142e96b92f9f07c54ecc3f4c22a068f0eac8ea (HEAD -> main)
Author: John Doe <johndoe@gmail.com>
Date:   Tue Mar 4 22:51:59 2025 -0500

    added goodbye

commit 34903ef910501690b5c619da5378c2d4b3fd82dc
Author: John Doe <johndoe@gmail.com>
Date:   Tue Mar 4 22:40:00 2025 -0500

    added hello.py for testing

Understanding HEAD

In our previous log outputs, you might have noticed the term HEAD appearing. HEAD is Git's way of tracking "where you are right now" in the repository's history. Most of the time, HEAD points to the name of the current branch (in our case, main), which in turn points to the latest commit on that branch.

Let's look at our current situation:

❯ git log
commit 0b142e96b92f9f07c54ecc3f4c22a068f0eac8ea (HEAD -> main)
Author: John Doe <johndoe@gmail.com>
Date:   Tue Mar 4 22:51:59 2025 -0500

    added goodbye

commit 34903ef910501690b5c619da5378c2d4b3fd82dc
Author: John Doe <johndoe@gmail.com>
Date:   Tue Mar 4 22:40:00 2025 -0500

    added hello.py for testing

Here, (HEAD -> main) tells us that HEAD is pointing to the main branch, which is at commit 0b142e9. When working with branches, we use the modern git switch command to move between them:

❯ git switch main     # Switch to main branch

We can also look at previous commits directly, which creates a "detached HEAD" state:

❯ git checkout 34903ef
Note: switching to '34903ef'.

You are in 'detached HEAD' state...

This brings us to an important concept: the "detached HEAD" state. A detached HEAD occurs when you point HEAD directly to a commit instead of a branch.

Normal HEAD:

HEAD -> main -> commit 0b142e9

Detached HEAD:

main -> commit 0b142e9
HEAD -> commit 34903ef

While in a detached HEAD state, you can look at files as they were at that commit, make experimental changes, and create new commits. However, since HEAD isn't attached to a branch, any new commits you make will be "floating" and could be lost when you switch to a different commit. If you want to keep changes made in a detached HEAD state, you should create a new branch:

❯ git switch -c old-version   # Create and switch to new branch

We'll talk more about branches in the next section.

To get back to your latest work, you can always return to your main branch:

❯ git switch main

This will reattach HEAD to the main branch and bring you back to your most recent commit.

Branching and Merging

Branching is one of Git's most powerful features, allowing developers to work on multiple versions of their code simultaneously. Building on our understanding of Git's data model, we know that branches are just pointers to specific commits in the repository's history.

Creating and Managing Branches

There are two main commands for working with branches:

git branch: Lists, creates, or deletes branches
git switch: Moves between branches (or creates and switches with -c)

To create a new branch:

git switch -c feature-branch  # Create and switch to new branch
# or
git branch feature-branch     # Create branch only
git switch feature-branch     # Switch to branch

To list all branches:

git branch  # The current branch is marked with an asterisk (*)

To delete a branch:

git branch -d feature-branch  # Safe delete (prevents deletion of unmerged changes)
git branch -D feature-branch  # Force delete (use with caution)

Understanding Branch Operations

When you create a branch, Git simply creates a new pointer to the current commit. Once you switch / checkout to the new branch, the HEAD reference is updated to point to your new branch, indicating which branch you're currently working on.

For example, if you're on the main branch and create a new feature branch:

Initially:
```
main    → commit A
HEAD    → main
```

After creating and switching to feature-branch:

main           → commit A
feature-branch → commit A
HEAD          → feature-branch

After making a new commit on feature-branch:

main           → commit X
feature-branch → commit B
HEAD          → feature-branch

gitGraph
    commit id: "A"
    branch feature-branch
    checkout feature-branch
    commit id: "B"

Merging Branches

Git provides several strategies for combining work from different branches. Understanding these strategies is crucial for maintaining a clean and manageable repository history. Types of merges include:

Fast-Forward Merge
- Occurs when there are no new commits on the target branch
- Simply moves the branch pointer forward
- Creates a linear history
- No merge commit is created

gitGraph
    commit
    commit
    branch feature
    checkout feature
    commit
    commit
    checkout main
    merge feature

This diagram shows a fast-forward merge scenario. Notice how all commits in the feature branch are direct descendants of the main branch's last commit. When merged, the main branch pointer simply moves forward to the latest feature branch commit, creating a straight line of history without any merge commits.

Three-Way Merge (Recursive)
- Used when branches have diverged
- Creates a new merge commit
- Preserves complete history of both branches
- May require conflict resolution

gitGraph
    commit
    commit
    branch feature
    checkout feature
    commit
    checkout main
    commit
    merge feature

This diagram demonstrates a three-way merge. The branches have diverged as both main and feature branches have received unique commits. The merge creates a new commit (shown as the merge node) that combines both histories. This preserves the complete development history of both branches and shows exactly where they were integrated.

There are additional merge strategies avaliable, but we will focus on these two.

Performing A Merge

To perform a merge:

git switch main          # Switch to the target branch
git merge feature-branch # Merge changes from feature-branch

During a merge:

Git checks if a fast-forward is possible
If yes, it simply moves the target branch pointer forward to match the source branch
No new commit is created

If it's not possible, we will do a three way merge. If there are no conflicts, this will happen cleanly, but there may be conflicts that create things called merge conflicts.

Creating and Resolving Merge Conflicts

When Git can't automatically merge changes, it creates a merge conflict. This typically happens when one of the following things has happened.

The same file was modified in different ways on both branches
A file was modified on one branch and deleted on another
A file was added with the same name but different content on both branches

When a conflict occurs, Git will mark the conflicting sections in the affected files and pause the merge process. You have to then manually resulve the conflicts (if using VSCode with the Git extension, you can use the lovely merge editor), and then can finally create a merge commit.

Example of a conflict marker:

<<<<<<< HEAD
Your changes on the current branch
=======
Changes from the branch being merged
>>>>>>> feature-branch

To resolve a conflict, follow these steps:

Open the conflicting files
Choose which changes to keep (or combine them)
Remove the conflict markers
Stage the resolved files (git add)
Complete the merge (git commit)

Working with Remotes

A remote in Git is simply a name for a remote server that hosts a Git repository. While you could refer to remotes using their full URLs (like https://github.com/chrislgarry/Apollo-11/tree/master), Git provides a more convenient way to reference them using nicknames. The most common remote name is origin, which is automatically set when you clone a repository.

Git uses slash notation to refer to branches on remotes. For example, origin/main refers to the main branch on the remote named origin.

Viewing and Managing Remotes

To list your current remotes, use the git remote -v command:

❯ git remote -v
origin  git@github.com:mdurrani808/STIC.git (fetch)
origin  git@github.com:mdurrani808/STIC.git (push)

This output shows that we have a remote named origin that we can both fetch from and push to. While the fetch and push URLs are often the same, they can be different if needed.

Adding New Remotes

Let's work through a common scenario: adding the original repository as a remote after forking a project. We'll use the Linux kernel as an example (this example was taken from Beej's guide, as detailed above).

When you fork a repository on GitHub, you create your own copy that you can modify. Initially, your fork will have one remote:

origin    git@github.com:mdurrani808/linux.git (fetch)
origin    git@github.com:mdurrani808/linux.git (push)

To keep your fork up-to-date with Linus Torvalds' original repository, you'll want to add it as a second remote:

❯ git remote add reallinux https://github.com/torvalds/linux.git

Now your remotes list will show both repositories:

origin    git@github.com:mdurrani808/linux.git (fetch)
origin    git@github.com:mdurrani808/linux.git (push)
reallinux    https://github.com/torvalds/linux.git (fetch)
reallinux    https://github.com/torvalds/linux.git (push)

Syncing with Remotes

To get changes from a remote repository:

Fetch the changes:

❯ git fetch reallinux

Merge them into your local branch:

❯ git switch master          # switch to your local branch
❯ git merge reallinux/master # merge the remote changes

When you make local commits, they'll advance your local HEAD and branch pointer while leaving the remote references (like origin/master and reallinux/master) behind. For example, after making two local commits, your log might look like this:

commit 2d7d5d (HEAD -> master)
commit cde831
commit 311eb3 (origin/master)
commit d5d2cc (reallinux/master)

To send your changes back to GitHub, use git push. After pushing, the remote reference will update:

commit 2d7d5d (HEAD -> master, origin/master)
commit cde831
commit 311eb3
commit d5d2cc (reallinux/master)

This workflow allows you to maintain your own version of the code while still being able to incorporate updates from the original repository.

Remote Tracking Branches

When you clone a repository, Git creates something called "remote-tracking branches". These are local references that represent the state of branches on your remote repositories. For example, when you clone a repository, you'll have:

main            # Your local main branch
origin/main     # Remote-tracking branch for main on origin

While this might look like just two branches, there are actually three branches involved:

Your local main branch
The remote-tracking branch origin/main on your computer
The actual main branch on the remote repository

The remote-tracking branch (origin/main) is your local copy of the remote branch's state. Git automatically updates it when you interact with the remote (through push, fetch, or pull operations).

Viewing Remote Tracking Branches

To see all your branches, including remote-tracking branches, use:

❯ git branch -avv
* main                  2d63af5 [origin/main] Latest commit message
  feature-branch        cdac325 [origin/feature] Feature work
  remotes/origin/HEAD   -> origin/main
  remotes/origin/main   2d63af5 Latest commit message
  remotes/origin/feature cdac325 Feature work

This shows your local branches (main and feature-branch), which remote branch they're tracking (shown in brackets), and the remote-tracking branches under remotes/origin/.

Setting Up Branch Tracking

When you want to push a local branch to a remote for the first time, you need to set up tracking:

git push --set-upstream origin feature-branch
# or the shorter version
git push -u origin feature-branch

This pushes your local branch to the remote and sets up tracking so future pushes and pulls know where to go. After setting up tracking, you can simply use:

git push
git pull

Pushing New Branches

When you create a new local branch and try to push it, you'll need to tell Git where to push it:

❯ git switch -c new-feature
❯ git push
fatal: The current branch new-feature has no upstream branch.
To push the current branch and set the remote as upstream, use:
    git push --set-upstream origin new-feature

Simply follow Git's suggestion to set up the tracking:

❯ git push --set-upstream origin new-feature

Managing Multiple Remotes

It's common to work with multiple remotes, especially when working with forked repositories. For example, you might have:

origin: Your fork of the repository
upstream: The original repository you forked from

To work with branches from different remotes:

Fetch changes from a remote:

❯ git fetch upstream

Create a local branch based on the remote branch:

❯ git switch -c feature upstream/feature

Push to your own remote:

❯ git push -u origin feature

Cleaning Up Remote Branches

To clean up remote-tracking branches that no longer exist on the remote:

❯ git fetch --prune        # Prune a specific remote
❯ git fetch --prune --all  # Prune all remotes

To delete a branch on the remote:

❯ git push origin --delete feature-branch

Common Git Tasks

Comparing Changes (git diff)

The git diff command allows you to see the differences between any two states in your repository. Without any arguments, it shows unstaged changes in your working directory:

❯ git diff
diff --git a/hello.py b/hello.py
index e4445b1..f022404 100644
--- a/hello.py
+++ b/hello.py
@@ -1,2 +1,3 @@
 print("Hello, world!")
 print("Welcome to CMSC398W!")
+print("Changes not yet staged")

This output tells us several things:

The file being changed is hello.py
The lines starting with - show content being removed (none in this case)
The lines starting with + show new content being added
The @@ -1,2 +1,3 @@ indicates that we're seeing lines 1-2 in the old file and lines 1-3 in the new file
The unchanged context lines are shown without any prefix

To see staged changes that will be included in your next commit:

❯ git diff --staged
diff --git a/hello.py b/hello.py
index f022404..3d4c568 100644
--- a/hello.py
+++ b/hello.py
@@ -1,3 +1,4 @@
 print("Hello, world!")
 print("Welcome to CMSC398W!")
 print("Changes not yet staged")
+print("This change is staged")

The format is the same as before, showing us exactly what changes are staged for commit.

You can also compare branches or specific commits:

❯ git diff main feature-branch        # Compare two branches
❯ git diff HEAD~1 HEAD               # Compare with previous commit
❯ git diff 1a2b3c 4d5e6f            # Compare specific commits

File Operations

Git provides several commands for managing files in your repository. While you can use regular shell commands (mv, rm), using Git's commands ensures proper tracking of these operations.

Renaming Files

To rename a file in Git:

❯ git mv old-name.txt new-name.txt
❯ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        renamed:    old-name.txt -> new-name.txt

The status output confirms that Git has recognized this as a rename operation rather than a separate delete and add. It also provides a helpful hint about how to unstage the change if needed.

This is equivalent to:

❯ mv old-name.txt new-name.txt
❯ git rm old-name.txt
❯ git add new-name.txt

Removing Files

To remove files from both your working directory and Git's tracking:

❯ git rm filename.txt
❯ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        deleted:    filename.txt

The status output shows that the file is staged for deletion in the next commit.

To stop tracking a file but keep it in your working directory:

❯ git rm --cached filename.txt
❯ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        deleted:    filename.txt

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        filename.txt

Notice how the status now shows the file as both deleted (from Git's tracking) and untracked (still present in the working directory). This is particularly useful for files that were accidentally committed but should be ignored (like configuration files or build artifacts).

Stashing Changes

Git stash is a powerful feature that allows you to temporarily store modified tracked files when you need to switch contexts but aren't ready to commit.

Basic stash operations:

❯ git stash                 # Save current changes to stash
Saved working directory and index state WIP on main: 1234abc initial commit

This output confirms that your changes have been saved and indicates which branch and commit they were based on.

❯ git stash list           # View all stashed changes
stash@{0}: WIP on main: 1234abc initial commit
stash@{1}: On feature-branch: Experimental changes

The list shows all stashes, with the most recent at the top (stash@{0}). Each entry shows the branch name and commit message where the stash was created.

❯ git stash show stash@{0}  # Show contents of specific stash
 hello.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

This output summarizes what changes are in the stash, similar to what you'd see in a commit message.

❯ git stash pop           # Apply and remove most recent stash
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   hello.py

Dropped refs/stash@{0} (1234abc5678def)

The output shows that the stashed changes have been reapplied to your working directory and the stash entry has been removed.

Amending Commits

Sometimes you need to modify your most recent commit, we can use the the --amend flag for this.

❯ git commit --amend              # Update last commit message
[main 1234abc] Updated commit message
 Date: Wed Mar 5 10:00:00 2025 -0500
 1 file changed, 1 insertion(+)

The output shows the new commit hash and confirms that the commit was updated.

❯ git add forgotten-file.txt
❯ git commit --amend --no-edit
[main 5678def] Original commit message
 Date: Wed Mar 5 10:00:00 2025 -0500
 2 files changed, 1 insertion(+)
 create mode 100644 forgotten-file.txt

The output shows that the commit was updated to include the new file while keeping the original commit message. Notice that the commit hash has changed (from 1234abc to 5678def), which is why you should only amend commits that haven't been pushed to a shared repository.

Advanced Git Operations

Rebasing

Rebasing is a Git operation that allows you to modify your commit history by moving a sequence of commits to a new base commit. While merging creates a new commit that combines changes from two branches, rebasing rewrites history by creating new commits that replicate your changes on top of a different starting point.

Let's say you have a feature branch that branched off from main some time ago. Here's the initial state:

gitGraph
    commit id: "A"
    commit id: "B"
    branch feature
    commit id: "D"
    commit id: "E"
    checkout main
    commit id: "C"

If you want to include the latest changes from main (commit C) into your feature branch, you could merge main into feature, or you could rebase feature onto main. Here's how to rebase:

git switch feature     # First, switch to the branch you want to rebase
git rebase main       # Then rebase onto the target branch

This will:

Temporarily save your feature branch commits (D, E)
Move to the latest commit on main (C)
Replay your commits one by one on top of C

After the rebase, your history will look like this:

gitGraph
    commit id: "A"
    commit id: "B"
    commit id: "C"
    branch feature
    commit id: "D'"
    commit id: "E'"

Note that D' and E' are new commits that contain the same changes as D and E, but have different commit hashes because they now have a different parent commit.

Handling Conflicts During Rebase

Sometimes when rebasing, Git can't automatically apply your changes because they conflict with changes in the base branch. When this happens, Git will pause the rebase and let you fix the conflicts:

git rebase main
# Git encounters a conflict
CONFLICT (content): Merge conflict in file.txt
Auto-merging file.txt
Failed to merge in the changes.

To resolve conflicts during a rebase:

Open the conflicting files and resolve the conflicts
Add the resolved files: git add <filename>
Continue the rebase: git rebase --continue

At any point, you can:

Continue the rebase: git rebase --continue
Skip the current commit: git rebase --skip
Abort the rebase entirely: git rebase --abort

Interactive Rebasing

Interactive rebasing is a powerful feature that lets you modify commits as they're being replayed. You start an interactive rebase by specifying how many commits back you want to modify:

git rebase -i HEAD~3    # Modify the last 3 commits

This opens your text editor with a list of commits and possible actions:

pick abc123 Add feature X
pick def456 Fix typo
pick ghi789 Add tests

# Commands:
# pick = keep this commit as is
# reword = keep the changes, but edit the commit message
# edit = pause to amend this commit
# squash = combine this commit with the previous commit
# fixup = combine this commit with the previous commit, discard this message
# drop = remove this commit

Common interactive rebase operations:

Changing a Commit Message

Change pick to reword:

reword abc123 Add feature X
pick def456 Fix typo
pick ghi789 Add tests

Save and close. Git will prompt you to edit each commit message you marked for rewording.

Combining Multiple Commits

To combine commits, change pick to squash or fixup:

pick abc123 Add feature X
squash def456 Fix typo
squash ghi789 Add tests

squash lets you edit the combined commit message
fixup discards the commit messages of the commits being combined

Before squashing:

gitGraph
    commit id: "A"
    commit id: "Feature X"
    commit id: "Fix typo"
    commit id: "Add tests"

After squashing:

gitGraph
    commit id: "A"
    commit id: "Add feature X with tests"

Removing a Commit

Change pick to drop (or just delete the line):

pick abc123 Add feature X
drop def456 Fix typo
pick ghi789 Add tests

Before dropping:

gitGraph
    commit id: "A"
    commit id: "Feature X"
    commit id: "Fix typo"
    commit id: "Add tests"

After dropping the middle commit:

gitGraph
    commit id: "A"
    commit id: "Feature X"
    commit id: "Add tests"

When to Use Rebase

Rebase is best used for:

Cleaning up your local commit history before sharing
Incorporating latest changes from main into your feature branch
Maintaining a linear project history

Avoid rebasing commits that you've already pushed to a shared repository, as this rewrites history and can cause problems for other developers.

Continuous Integration & Continuous Deployment

Continuous Integration (CI)

Why/How is CI used in devlopment?

Let’s say I’m working on software that controls a robot that makes food. S, from pasta to pancakes, and everything works pretty smoothly. But now we’ve been asked to add support for a new recipe: an omelet.

This new recipe is a little more complicated because it depends on additional timing, temperature adjustments, and the robot’s ability to flip the omelet mid-cook. It means adding new logic to our code and adjusting some test cases to make sure nothing else breaks.

Here’s how Continuous Integration (CI) helps make this kind of change easy and safe:

Step-by-Step: Adding a New Feature with CI

Get the Latest Code
- I open my development environment and grab the latest code using:
```
git pull origin main
```
- This ensures I’m working on top of the most recent version of the robot's recipe software.
Build and Test Locally
- Before making any changes, I run a local build:
  - It checks my environment is set up properly.
  - It compiles or packages the code.
  - It runs all the current automated tests.
- If anything fails, I fix it first before starting on the omelet logic, that way I know I’m not building on top of something already broken.
Add the Omelet Logic
- I implement the new cooking instructions for the robot to handle omelets.
- I write new automated tests to confirm that it cooks an omelet correctly (and doesn't mess up scrambled eggs or pancakes).
- I keep running the build and test suite often as I work to catch bugs early.
Sync Before Pushing
- Before I push my changes, I pull from the main branch again because one of my teammates just added code for a new salad recipe.
- If there are conflicts or updates, I merge them and rerun tests to make sure everything still works.
Push the Changes
- Once all tests pass locally, I push the changes to the central repo:
```
git push origin main
```
CI Server Kicks In
- A CI service like GitHub Actions, GitLab CI, or Jenkins sees the push.
- It checks out the updated code, runs the build and all tests again in a clean, isolated environment.
- This confirms the change works not just on my machine but in the shared build system too.
Get Feedback
- If everything passes, I get a notification that the omelet logic is good to go.
- If anything fails, I can review logs, fix the problem, and repeat.

What is Continuous Integration?

Continuous Integration (CI) is the practice of automatically building and testing code every time you make a change and push it to a shared repository.

CI is built around this philosophy:

"Integrate early, integrate often."

It helps development teams:

Avoid painful last-minute integrations.
Catch bugs early.
Ensure everyone is working with the latest, working codebase.
Encourage modular, testable code design.

Core Components of CI

Component	Purpose
Version Control	Store and manage source code (e.g., Git, GitHub)
CI Server	Detect changes, trigger builds, run tests (e.g., GitHub Actions)
Build System	Compile and bundle your project (e.g., `make`, `npm`, `gradle`)
Test Suite	Ensure correctness via automated unit/integration tests
Notification	Alert devs to success/failure (Slack, email, GitHub UI)

Why is CI Important?

CI keeps teams fast, clean, and confident. It:

Shortens the feedback loop.
Prevents the "merge hell" of late integration.
Helps spot bugs before they hit production.
Makes onboarding and collaboration easier.

What a CI-Ready Project Looks Like

The earlier story gives a sense of how it feels to work in a Continuous Integration environment but to actually implement CI effectively, a few technical requirements must be met. Below are the essential foundations for a working CI system.

Everything Belongs in Version Control

Almost every team uses version control (typically Git), but to support CI, it’s not enough to just track the source code. Your repository should provide everything a developer needs to build and run the project from scratch.

A new team member should be able to:

Clone the repository on a fresh machine.
Run a single setup script or command.
Build and test the entire product with no missing files or configuration. This includes:

Application source code
Automated test code
Database schemas & seed/test data
Build scripts
Configuration files (e.g., .env, .json, .yml)
Version-locked dependencies

While not everything has to be stored directly in the repository, it should be accessible via immutable links such as pinned dependencies or asset IDs that always resolve to the same resource.

Don't Check In Build Outputs

Only store what’s necessary to create the product, not the product itself.

What to store:

src/
tests/
package.json, pom.xml, requirements.txt
CI configs (.github/workflows/, .gitlab-ci.yml)
Dockerfiles, setup scripts

What not to store:

Compiled binaries
Build artifacts (e.g., /dist, /build)
Generated documentation
Output from test reports or coverage tools

Build products should be reproducible and disposable. If they're hard to regenerate, that's a deeper problem CI is trying to fix.

Use a Single Source of Truth: The Mainline

CI is centered around a shared, reliable main branch. This is the source of truth for your product’s current state and should reflect the next version that’s going into production.

In Git, this is typically:

main (modern default)
master (legacy default)
trunk (used by teams practicing Trunk-Based Development)

All feature branches are temporary; their goal is to merge cleanly into the mainline after passing all automated checks.

CI only works well when:

Everyone integrates into main frequently.
Tests are run against this branch constantly.
The build is always green (or immediately fixed when not).

Automate the Build

Turning source code into a running system can soametimes involve multiple steps: compiling, copying files, generating assets, setting environment variables, loading database schemas, etc. While that can sound complex, everything involved in building the system should be automated.

“Computers are designed to perform simple, repetitive tasks. As soon as you have humans doing repetitive tasks on behalf of computers, all the computers get together late at night and laugh at you.”
— Neal Ford

Why Automate Builds?

Manual builds are error-prone and inconsistent.
CI systems need builds to be reproducible on clean machines.
Developers save time and avoid typos or environment mismatches.
New contributors should be able to clone the repo and run a single command to get started.

Build Scripts Should Live in the Repository

All build logic should be defined using text-based configuration or scripting and committed to version control. This includes:

Build tools like make, gradle, maven, npm, poetry, etc.
Shell scripts (build.sh)
CI workflow definitions (e.g., .github/workflows/build.yml)

Storing build logic as code allows:

Easy inspection
Collaboration
Clear version history and diffs
Portability across machines

Avoid tools that rely on point-and-click GUI setups for builds as they’re hard to track and impossible to replicate reliably in CI.

Use Dependency-Aware Build Tools

While small builds can be written as shell scripts, larger projects benefit from dedicated build systems that use a dependency graph model. These systems break the build process into tasks with clearly defined inputs and outputs, which allows for:

Incremental builds: only what's changed gets rebuilt
Optimized performance: avoid re-running expensive tasks
Consistency: build steps execute in the correct order

Examples include:

make
ninja
gradle
bazel

These tools determine what needs to be rebuilt based on file modification times or hashes. For example, if you edit a CSS file, the system might only recompile that page instead of rebuilding the entire site.

A One-Command Build Rule

CI-Ready Rule of Thumb:
Anyone should be able to take a clean machine, clone the repo, and run a single command (e.g., ./build.sh or make) to build and run the entire system.

This includes:

Installing dependencies
Compiling or packaging
Spinning up a test or dev database
Running tests
Starting the server or UI

If the system supports different environments (e.g., production, staging, test), your build scripts should support targets or flags for those cases:

# Examples
make test
npm run build:dev
./build.sh --skip-tests

Make the Build Self-Testing

Compiling and packaging code is just the beginning, a build that runs but doesn't do the right thing is just as dangerous as a build that doesn’t run at all. To support Continuous Integration, your build must also verify behavior through automated tests.

That’s what makes a self-testing build so important. It’s the idea that the build doesn’t just assemble the code but it also automatically confirms that the system still works as expected.

Your CI build pipeline should:

Automatically run unit tests to verify logic.
Run integration tests to check component interactions.
Possibly include end-to-end tests for full workflows (when needed).
Fail the build if any test fails. 99.9% green is still red.

Most languages now offer easy-to-use testing frameworks:

Python: pytest, unittest
JavaScript/Node: jest, mocha
Java: JUnit
C#: xUnit, NUnit
Go: go test

These frameworks usually integrate directly with your CI tool. You’ll often see CI dashboards or terminals refer to a “green build” (all tests pass) or a “red build” (one or more fail).

Self-testing builds can also include:

Linters to catch code smells and enforce style (e.g., eslint, flake8, pylint)
Security scanners to find vulnerabilities (e.g., Bandit, Snyk)
Code formatters like black, prettier, or clang-format No test suite can catch everything. Tests don’t prove the absence of bugs but even imperfect tests that run automatically are far better than no tests at all.

S. The more confidence you have in your test suite, the safer it is to make changes and ship frequently.

Every Push to Mainline Should Trigger a Build

If every team member is integrating changes very frequently, the shared mainline should always reflect a clean, deployable state. But real-world development isn't perfect, people forget to pull, merge conflicts happen, environments differ.

That’s why every push to the main branch must automatically trigger a build in a clean, shared CI environment (e.g., GitHub Actions, Jenkins, CircleCI, etc.).

The CI service monitors the mainline (main, trunk, or master).
On each commit, it checks out the code, builds it, and runs tests.
If the build is green, the integration is considered successful.
If it fails, the problem can be traced directly to that recent push.

Note: While it’s possible to use CI tools to build many branches, true Continuous Integration focuses on verifying the mainline. The goal is to keep one branch working at all times, not to test in isolation.

Although CI Services automate this flow, it’s also possible (though less common) to perform integrations manually on a shared machine but that defeats the purpose of fast, reliable feedback.

Keep the Build Fast

The value of Continuous Integration comes from fast feedback. If the build takes too long, developers won’t integrate frequently or they’ll ignore the results.

The Extreme Programming (XP) rule of thumb:
A build should complete in 10 minutes or less.

Modern CI tools and cloud infrastructure make this achievable. If your build takes 30–60 minutes, you’re less likely to run it on every commit and you’ll lose the rapid feedback that CI is all about.

How to Speed Up CI Builds:

Stage the build pipeline:
- Commit build: Fast, core unit tests + build checks.
- Secondary build: Slower, full integration or end-to-end tests.
Use test doubles for slow services (e.g., in-memory DBs, API mocks).
Run expensive tests in parallel on separate CI runners.
Use caching and only rebuild parts of the system that changed.
Monitor dependency updates automatically treat them like external contributors that could break the build.
Push failed slow-stage tests upstream by writing faster equivalents for the commit build when possible.

Use a Deployment Pipeline

A deployment pipeline (also called a staged build) is a CI best practice that breaks the build process into multiple sequential stages.

Typical setup:

Commit Stage — Fast checks, must pass before merging.
Acceptance Stage — Full system tests, DB integration, real environment.
Staging/Production Stage — Manual or auto-triggered deployment.

Each stage adds more certainty. Early builds give developers quick feedback. Later stages run heavier tests, without slowing everyone down.

If the late-stage build catches a bug, the commit build should be strengthened to catch it earlier next time.

Test in a Clone of the Production Environment

A critical part of reliable CI is running your tests in an environment that matches production as closely as possible. Otherwise, bugs caused by configuration differences may slip through unnoticed.

Strategies for environment consistency:

Use Docker or other containers to ensure identical app behavior.
Match:
- OS version
- Database engine + version
- Third-party services
- Networking (IP/port settings)
- Library versions
If production has special constraints (e.g., low memory, weak CPU, flaky network), simulate those too.

A passing test in dev but a crash in prod usually means the environments weren’t aligned well enough.

By testing in a cloned environment, you eliminate an entire category of bugs related to environment mismatch.

Benefits of CI

1. Reduced Risk of Delivery Delays

Large-scale integrations (pre-release or long-lived feature branches) are unpredictable.
The longer you wait to integrate, the more difficult and time-consuming the merge becomes.
CI eliminates this by integrating small, frequent changes that are easier to manage.
Delays are replaced with quick merges and fast recoveries often within minutes or hours, not days or weeks.
Problems surface while there’s still time to fix them, rather than during crunch time.

2. Less Time Wasted on Integration

Integration becomes routine and uneventful.
You’re always working with a fresh and known-good codebase.
Less time is spent rebasing, resolving merge conflicts, or debugging old branches.
Small, continuous merges mean you’re working with code that’s still fresh in your mind.
The development team becomes more collaborative, because source control turns into a real-time communication channel.

3. Fewer Bugs, Easier Debugging

CI enforces self-testing code: automated tests that catch bugs before they spread.
Integration bugs are caught early while the change set is small and easier to diagnose.
Diff debugging becomes easier: if a test fails, you only need to inspect the last handful of commits.
CI exposes gaps in the test suite quickly and teams are incentivized to close them.

4. Enables Safe, Continuous Refactoring

CI encourages and enables continuous improvement of the codebase structure.
Refactoring is safer because:
- Changes are kept small.
- Tests are run automatically.
- Integration is fast and frequent.
Teams don’t have to “freeze” code or avoid reworking critical modules due to fear of breaking someone else's work.
The result: healthier codebases, faster onboarding, and easier scaling of features over time.

The teams that invest more in refactoring and tests deliver features faster and more reliably.

5. Release to Production Becomes a Business Decision

With a Release-Ready Mainline, any successful build is potentially shippable.
Stakeholders can decide to release new features based on business needs, not technical readiness.
This enables:
- More frequent deployments
- Faster user feedback
- Tighter customer-developer collaboration
CI removes one of the biggest blockers to fast, user-centered software development: fear of releasing unfinished or unstable code.

When to not use CI

With all the benefits CI offers, it’s fair to ask: Is there ever a case where we shouldn’t use Continuous Integration?

The short answer is: CI is usually worth adopting, but it does require the right context and team readiness. Without that, CI can cause more frustration than value.

When the Team Isn't Committed

CI works best when the team:

Works full-time on the product
Integrates code frequently
Collaborates closely

Not ideal for:

Projects where contributors are loosely coordinated
Teams without shared working hours or visibility into each other’s work

In these scenarios, feature branching with pull requests is often more practical. Even so, increasing integration frequency (e.g. shorter-lived branches) is still beneficial when possible.

When the Team Lacks Key Practices

Trying CI without these prerequisites usually backfires:

No self-testing code: Bugs slip through undetected
No automation: Manual builds and tests make integration painful
No fixing: Developers push broken code to mainline, constantly breaking the build

CI isn’t just a tool, it’s a workflow supported by technical practices:

Automated builds
Strong test suites
Clean, version-controlled mainline

Continuous Deployment (CD)

What is Continuous Deployment?

Continuous Deployment (CD) is the practice of automatically pushing every change that passes your CI pipeline straight to production wiht no manual approval required.

While Continuous Integration ensures that your code is always tested and merged cleanly, CD goes a step further and ships that code to your users.

CI = “It works.”
CD = “Now send it!”

Example: Robot Chef Deploys Omelets

Let’s say our robot chef already supports 100 recipes. After testing a new omelet feature through CI, CD takes over:

The new omelet logic is pushed and tested
CD packages the update and deploys it to production
Users can now order omelets with no downtime or manual work

CD vs. Continuous Delivery

Term	Description
Continuous Delivery	Every change is automatically tested and ready to deploy but requires a manual trigger to go live.
Continuous Deployment	Every change that passes CI is automatically deployed no human involvement.

What a CD Pipeline Looks Like

Push to Main
CI Validates with tests & build
CD Packages the app (e.g., Docker container)
CD Deploys to prod (or staging → prod)
Smoke Tests + Monitoring check production
Feature is Live (and rollback-ready!)

CD General Practices

Green Builds Are Deployable

Only merge to main when you're ready to ship. If it passes CI, it should be safe to go live.

Manage Secrets Safely

Never hardcode credentials. Use:

GitHub Actions Secrets
AWS Secrets Manager / Parameter Store
HashiCorp Vault
.env.production (encrypted)

Monitor Everything

Use logging and metrics tools like:

Prometheus + Grafana
Datadog / New Relic
CloudWatch / Stackdriver

Setup alerts for:

Crashes
High latency
Error spikes

Be Ready to Rollback

CD only works if bad deploys are reversible. Rollback techniques:

Docker image version re-pull
Git reverts + redeploy
Infra-level rollback (e.g., Kubernetes, Terraform)

Blue/Green Deployment Strategy

Blue/Green Deployments keep two environments:

Blue = current live version
Green = new version being tested

Flow:

Deploy to Green (new version)
Test in production-like settings
Switch traffic from Blue → Green
If issues occur, switch back to Blue

Benefits:

Zero-downtime deploys
Fast rollbacks
Full prod-like validation before switching
Easier risk isolation

Supported By:

AWS Elastic Beanstalk
Kubernetes (services + selectors)
Google Cloud Run / App Engine
NGINX / HAProxy (manual routing)
Spinnaker, Argo Rollouts

Sample Blue/Green Deploy Script:

steps:
  - name: Deploy to Green
    run: ./deploy.sh --env green

  - name: Run Smoke Tests
    run: ./test.sh --env green

  - name: Switch Traffic to Green
    run: ./switch-traffic.sh --from blue --to green

  - name: Monitor
    run: ./monitor.sh --env green

Sample GitHub Actions CD Workflow

# .github/workflows/deploy.yml
name: CD Pipeline

on:
  push:
    branches: [main]

jobs:
  test-and-deploy:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3
      - run: ./build.sh
      - run: ./test.sh
      - run: ./deploy.sh production

Benefits of Continuous Deployment

Benefit	Description
Fast Feedback	See how real users respond within minutes
Smaller Deploys	Easier to debug, test, and roll back
Continuous Refactoring	Enables safer, incremental changes to improve code structure
Release Becomes a Business Choice	No technical blockers delay new features

When Not to Use Continuous Deployment

While CD is powerful, it’s not always the right fit.

Situation	Reason
Weak test coverage	You might ship broken code to production
Regulated environments	Manual approval may be required by compliance
No rollback plan	CD without escape hatches = risky operations

Tip: If your team isn’t ready for full CD, start with Continuous Delivery and build up confidence from there.

References

Shyam, K. (2020). Core CI/CD Concepts: A Comprehensive Overview. Dev.to. Available at: https://dev.to/kshyam/core-cicd-concepts-a-comprehensive-overview-ma6
Shore, J. (2005). A Practical Guide to Continuous Integration. James Shore's Blog. Available at: https://www.jamesshore.com/v2/books/aoad1/ten_minute_build
Swarmia. (2021). Continuous Integration: What It Is and Why It’s Important. Swarmia Blog. Available at: https://www.swarmia.com/blog/continuous-integration/
GART Solutions. (2020). Building an Effective CI/CD Pipeline: A Comprehensive Guide. Medium. Available at: https://gartsolutions.medium.com/building-an-effective-ci-cd-pipeline-a-comprehensive-guide-bb07343973b7
Fitz, T. (2019). Timothy Fitz's Blog. Timothy Fitz's Blog. Available at: http://timothyfitz.com/blog/
Fowler, M. (2024). Continuous Integration. Martin Fowler's Blog. Available at: https://www.martinfowler.com/articles/continuousIntegration.html#BuildingAFeatureWithContinuousIntegration

Networking

As a software developer, your work constantly relies on networks. Whether you're pulling code from a Git repository, making calls to a third-party API, connecting to a database across the room or across the country, or deploying your application to cloud servers, network connectivity is the invisible thread tying it all together.

Because this reliance is so fundamental, network problems can be significant roadblocks. When services become unreachable, APIs return strange errors, or deployments fail, understanding the underlying network behavior is crucial. This section aims to demystify network troubleshooting by providing you with both the foundational concepts and the practical command-line tools needed to diagnose common issues.

Network Stack / OSI Model

Computer networking involves many complex interactions. To make sense of it all, we use layered models, like the seven-layer OSI model, as a conceptual guide. While real-world systems blend these layers, the model provides a framework for isolating problems during troubleshooting.

Layer 1: Physical: This deals with the actual hardware transmitting the signals – Ethernet cables, fiber optics, Wi-Fi radios.
- Troubleshooting Focus: Is the cable plugged in securely? Is the Wi-Fi connected and the signal strong? Are there physical hardware failures?
Layer 2: Data Link: Manages communication within a single local network segment (like all devices connected to the same Wi-Fi router or Ethernet switch). It uses MAC addresses (unique hardware identifiers).
- Troubleshooting Focus: Can my machine communicate with the local router? Are there issues with the network switch? (Tool: arp)
Layer 3: Network: Handles addressing and routing between different networks using IP addresses. This is where the internet lives.
- Troubleshooting Focus: Does my machine have a valid IP address? Can it reach the destination network? Are there routing problems along the path? (Tools: ip addr, ip route, ping, traceroute)
Layer 4: Transport: Ensures data gets delivered reliably (or not, depending on the protocol) to the correct application process on the destination host. It uses protocols like TCP and UDP, along with port numbers.
- Troubleshooting Focus: Is the correct port open on the destination? Is a firewall blocking the connection? Is the expected service (like a web server) actually running and listening? (Tool: ss)
Layers 5-7: Session, Presentation, Application: These higher layers manage communication sessions, handle data formatting (like encryption/decryption with TLS/SSL, or character encoding), and define the specific protocols applications use to talk to each other (like HTTP for web, DNS for name resolution, SMTP for email).
- Troubleshooting Focus: Is the application server responding correctly? Are there DNS resolution errors? Are there issues with TLS certificates? Is the application itself misbehaving? (Tools: dig, curl, browser developer tools)

A critical point for developers is that Layers 2 through 7 are implemented primarily in software within the operating system kernel, system libraries, and the applications themselves. Only Layer 1 is purely hardware. This means network functions are susceptible to the same issues as any software: bugs, configuration mistakes, resource limitations, and security vulnerabilities. Thinking in layers allows you to systematically ask: "Is this a physical connection problem (L1), a local network issue (L2), an internet routing problem (L3), a transport/port issue (L4), or an application-level fault (L7)?"

Adressing and Transport

For devices to find each other across the the internet or even a local network, they need unique addresses. This is the role of the Internet Protocol (IP) at Layer 3.

IP Addressing (Layer 3)

You'll primarily encounter two versions of IP addresses:

IPv4: The older format, written as four numbers separated by dots (e.g., 192.168.1.101 or 8.8.8.8). While still widely used, the available pool of unique IPv4 addresses is largely depleted.
IPv6: The modern format, using longer hexadecimal numbers separated by colons (e.g., 2001:db8::1). IPv6 provides an almost unimaginably large number of addresses to accommodate future growth.

Parsing `ip addr show` Output

The ip addr show command (or its shorter alias ip a) allows for inspecting the IP addresses and network interfaces on your Linux system.

$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:2/64 scope link 
       valid_lft forever preferred_lft forever

Here's how to interpret the key parts for each interface block (like 1: lo: or 2: eth0:):

1: lo: or 2: eth0:: The first number is the interface index. The name follows (lo is the special loopback interface for local communication; eth0, ensX, enpXsY are common names for Ethernet interfaces; wlan0 or wlpXsY for wireless).
<LOOPBACK,UP,LOWER_UP>: These are flags indicating the interface's status and capabilities.
- UP: The interface is administratively enabled (often controllable via ip link set eth0 up/down).
- LOWER_UP: The physical layer (Layer 1) is connected and active (e.g., cable plugged in, Wi-Fi associated). Crucially, you need both UP and LOWER_UP for the interface to be truly operational.
- LOOPBACK: This is the loopback interface.
- BROADCAST, MULTICAST: Indicate support for these Layer 2 addressing modes.
mtu 1500: Maximum Transmission Unit - the largest packet size (in bytes) this interface can transmit without fragmentation. Ethernet standard is often 1500.
qdisc fq_codel: Queuing discipline - how the kernel manages outgoing packets. Not usually critical for basic debugging.
state UP: The overall operational state of the interface. UP means it's ready to use. Other states include DOWN, UNKNOWN.
link/ether 02:42:ac:11:00:02: The Layer 2 hardware address, also known as the MAC (Media Access Control) address. This is unique to the physical network card (usually). brd ff:ff:ff:ff:ff:ff is the broadcast MAC address.
inet 172.17.0.2/16: This line shows the assigned IPv4 address.
- 172.17.0.2: The actual IPv4 address.
- /16: The subnet mask in CIDR notation. /16 corresponds to 255.255.0.0, defining the size of the local network.
- brd 172.17.255.255: The broadcast address for this subnet.
- scope global: Indicates this address is valid system-wide (other scopes include host for loopback, link for link-local addresses used only on the immediate network segment).
inet6 fe80::42:acff:fe11:2/64: This line shows an assigned IPv6 address.
- fe80::...: This is a link-local IPv6 address, automatically configured and only usable on the local network segment. Globally routable IPv6 addresses typically start with other prefixes (like 2001:...).
- /64: The IPv6 prefix length, defining the subnet size.
- scope link: Confirms this is a link-local address.

When debugging, you primarily check if the relevant interface is UP and LOWER_UP, and if it has the expected inet (IPv4) or inet6 (global scope) address assigned.

Transport Protocols (Layer 4): TCP and UDP

Once an IP address gets your data packets to the correct destination machine, Layer 4 protocols take over to deliver that data to the specific application waiting for it. The two major protocols here are TCP and UDP.

TCP (Transmission Control Protocol) acts like a reliable courier service for your data. Before sending anything substantial, it establishes aconnection with the receiving end using a handshake protocol. This ensures both sides are ready and agree to communicate. Once the connection is up, TCP manages the data flow, breaking large chunks into numbered segments, ensuring they arrive in the correct order, and retransmitting any segments that get lost or corrupted along the way. This reliability makes TCP ideal for applications where data integrity and order are critical, such as loading web pages (HTTP/HTTPS), sending emails (SMTP), transferring files (FTP), or maintaining a persistent remote connection (SSH). However, this management comes at the cost of increased overhead and latency compared to UDP.

UDP (User Datagram Protocol), in contrast, operates more like the postal service for postcards. It's a connectionless protocol, meaning it bundles data into packets (datagrams) and sends them off towards the destination IP address and port without any prior negotiation or handshake. UDP makes a "best effort" attempt to deliver the data but provides no guarantees. Packets might arrive out of order, get duplicated, or never arrive at all, and UDP itself won't try to fix these issues. This approach results in lower overhead and latency than TCP, making UDP suitable for applications where speed is important and occasional data loss can be tolerated or handled by the application itself. Common examples include DNS lookups (where a quick request-response is needed, and the client can just retry if needed), DHCP (assigning IP addresses), live video and audio streaming (where retransmitting old data is pointless), and many online games (where timely updates are more important than guaranteed delivery of every single packet).

Ports and Sockets

Whether using TCP or UDP, applications need a way to distinguish themselves from other services running on the same machine. This is achieved using port numbers, ranging from 0 to 65535. Many common services use "well-known" ports (e.g., HTTP on TCP port 80, HTTPS on TCP port 443, DNS on UDP/TCP port 53). The specific combination of an IP address, a transport protocol (TCP or UDP), and a port number forms a unique communication endpoint known as a socket (e.g., 172.17.0.2 using TCP on port 443). When troubleshooting, verifying that the service you're trying to reach is actually listening on the expected protocol and port number on the server side is important.

Basic Connectivity Testing

Having confirmed your machine possesses valid IP addresses (using ip addr show), the next logical step in troubleshooting is to test whether you can communicate with other devices, both locally and across the internet. Two command-line tools, ping and traceroute, operate primarily at the Network Layer (Layer 3) and are commonly used for connectivity testing..

Checking Reachability with `ping`

The ping command is allows you to check basic network reachability. Its function is simple: it sends special network packets to a specified destination host. If the destination host is reachable and configured to respond (most are, unless blocked by a firewall), it will send back a reply packet. Receiving these replies confirms that a Layer 3 path exists between your machine and the target.

You use ping by simply providing the hostname or IP address you want to test:

# Ping Google's public DNS server by IP address
$ ping 8.8.8.8 
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=12.5 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=116 time=12.2 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=116 time=13.1 ms
^C  # Press Ctrl+C to stop
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 12.238/12.616/13.112/0.371 ms

# Ping a hostname (requires DNS to work first)
$ ping google.com
PING google.com (142.250.191.174) 56(84) bytes of data.
64 bytes from lga34s35-in-f14.1e100.net (142.250.191.174): icmp_seq=1 ttl=116 time=11.8 ms
... (Ctrl+C to stop) ...

Here is what the output means:

64 bytes from ...: This indicates a successful reply was received.
icmp_seq=N: The sequence number of the packet. Should increment steadily. Gaps indicate lost packets.
ttl=N: Time To Live. A value decremented by each router hop; helps prevent infinite loops. Not usually critical for basic debugging, but very low values on replies might indicate a close-by host.
time=X ms: Round Trip Time (RTT) or latency – how long it took for the request to go out and the reply to come back. This is a crucial measure of network performance.
Statistics (--- ... ping statistics ---): Summarizes the test after you stop it (Ctrl+C). Look closely at packet loss %. Any loss indicates a potential network problem.
Error Messages:
- Request timed out or 100% packet loss: No replies were received. The host might be down, unreachable, or a firewall might be blocking the ping requests or replies.
- Destination Host Unreachable: A router along the path reported that it doesn't know how to reach the destination network or host. This often points to a routing issue closer to your end or the destination's end.

To diagnose issues systematically, you can try to use ping in a specific order, moving outwards from your own machine:

ping 127.0.0.1 (or ping localhost): Tests your machine's own network stack. If this fails, there's a fundamental problem with your OS networking setup.
ping <your_own_IP>: Tests your specific network interface configuration. Should work if step 1 works.
ping <your_gateway_IP>: Tests connectivity to your local router (the "gateway" connecting your local network to others). Find your gateway using ip route show | grep default. Failure here points to a problem on your immediate local network (Wi-Fi, switch, router itself).
ping 8.8.8.8 (or another reliable external IP): Tests basic internet connectivity beyond your local network. If this fails but the gateway ping works, the issue is likely with your router's internet connection or your ISP.
ping <target_hostname> (e.g., ping google.com): Tests both internet connectivity and basic DNS name resolution. If ping 8.8.8.8 works but ping google.com fails, suspect a DNS problem (which we'll cover in the next lecture).

Note: For testing IPv6 connectivity, use the ping6 command (or sometimes ping -6).

Discovering the Path with `traceroute`

While ping tells you if you can reach a destination, traceroute (and similar tools like tracepath or mtr) attempts to show you the path your packets take to get there. It reveals the sequence of routers (or "hops") between your machine and the target. This is useful for identifying where a connection is failing or where significant delays are occurring.

$ traceroute google.com 
traceroute to google.com (142.250.191.174), 30 hops max, 60 byte packets
 1  _gateway (192.168.1.1)  0.530 ms  0.480 ms  0.465 ms  # Hop 1: Your local router
 2  10.x.x.x (10.x.x.x)  8.120 ms  8.050 ms  7.995 ms   # Hop 2: ISP Router 1
 3  another-isp-router.net (A.B.C.D)  9.500 ms  9.450 ms  9.400 ms # Hop 3: ISP Router 2
 4  * * *                                                # Hop 4: No reply / Timeout
 5  some-backbone-router.net (E.F.G.H)  15.200 ms  15.150 ms  15.100 ms # Hop 5
 ... (more hops) ...
12  lga34s35-in-f14.1e100.net (142.250.191.174)  11.900 ms  11.850 ms  11.800 ms # Final destination

Interpreting traceroute output:

Each numbered line represents a hop (a router).
It usually shows the hostname (if resolvable) and IP address of the router.
The three time values (e.g., 0.530 ms 0.480 ms 0.465 ms) are the RTTs for three separate probes sent to that hop. Consistent times are good; large variations might indicate instability.
* * *: This means no reply was received from that hop within the timeout period. This could be due to network congestion, packet loss, or (very commonly) a router configured not to send ICMP Time Exceeded messages (often for security reasons). A few asterisks are not always a problem, especially if the trace completes, but a long string of them, or if the trace stops there, indicates a likely point of failure or blocking.
Latency Jumps: Look for sudden, significant increases in the RTT between consecutive hops. This can pinpoint a slow link or congested router along the path.

Alternatives like tracepath often provide slightly simpler output, and mtr (My Traceroute) is a very powerful tool that continuously sends probes like ping to each hop identified by traceroute, giving you a live view of latency and packet loss along the entire path. It's excellent for diagnosing intermittent issues.

By using ping to check basic reachability and traceroute to inspect the path, you gain insights into Layer 3 connectivity. If ping fails, you can isolate whether the issue is local, with your gateway, your ISP, or potentially DNS. If ping works but connections are slow or unreliable, traceroute or mtr can help pinpoint the segment of the network path responsible for the delay or packet loss.

Understanding DNS (Name Resolution)

Computers communicate using numerical IP addresses, but humans prefer names like www.google.com. The Domain Name System (DNS) acts as the internet's phonebook, translating these human-friendly names into computer-friendly IP addresses. This translation, called name resolution, is critical as without it, you couldn't easily browse the web or access most online services.

The resolution process typically involves checking a local cache first, then asking a configured DNS resolver (often provided by your ISP or specified manually, visible in /etc/resolv.conf). This resolver then performs a recursive query and asks finally the authoritative name server responsible for the specific domain you requested.

The primary tool for interacting with DNS from the command line is dig (Domain Information Groper):

# Find the IPv4 address (A record) for a hostname
$ dig A www.google.com +short 
142.250.191.142

# Find the IPv6 address (AAAA record)
$ dig AAAA www.google.com +short
2607:f8b0:4004:834::200e

# Trace the full recursive lookup path (very useful for debugging!)
$ dig +trace www.google.com 
# (Output shows queries to root, .com, google.com servers)

# Query a specific DNS server (e.g., Google's public DNS)
$ dig @8.8.8.8 A www.stanford.edu +short
171.67.215.200

When debugging, if ping <IP> works but ping <hostname> fails, DNS is the prime suspect. Use dig to check if the name resolves correctly, if it returns the expected IP, or if the query times out. Also, be aware that organizations often run internal DNS servers (like with Active Directory) to resolve private hostnames not known to the public internet. Ensure you're using the appropriate DNS resolver for the name you're trying to look up.

Network Boundaries: NAT, Firewalls & Proxies

Your computer's network connection rarely connects directly to the public internet without intermediaries. Understanding these boundaries is crucial for debugging.

Private IPs & NAT: Most devices on home, university, or corporate networks use private IP addresses (ranges like 10.x.x.x, 172.16.x.x to 172.31.x.x, 192.168.x.x). These are not routable on the public internet. Your router performs Network Address Translation (NAT), translating your device's private IP address into the router's single public IP address when sending traffic out, and doing the reverse for incoming traffic. You can see your private IP with ip addr show and find your public IP (as seen by the outside world) using a service like curl ifconfig.me. NAT usually works transparently but can complicate hosting services or peer-to-peer connections.

Firewalls: These act as security guards for networks or individual hosts, filtering traffic based on defined rules (e.g., blocking incoming connections on specific ports unless explicitly allowed). Firewalls can exist on your own machine (e.g., ufw on Ubuntu, firewalld on CentOS/Fedora, Windows Firewall), on your network router, or as dedicated appliances or cloud services (like AWS Security Groups). If you can ping a host but cannot connect to a specific service (e.g., a web server on port 80), a firewall is a very likely culprit. Check the firewall rules on both the client and server sides. Basic host firewall status can often be checked with commands like sudo ufw status.

Proxies: A proxy server acts as an intermediary for your network requests. Instead of connecting directly, your traffic goes to the proxy, which then forwards it to the destination. Proxies are used for various reasons like caching web content, filtering traffic, enhancing security, or bypassing regional restrictions. In corporate or university environments, you might be required to configure your system or applications (often via environment variables like HTTP_PROXY and HTTPS_PROXY) to use a proxy. If connections fail inexplicably, check if a proxy is involved and if it's configured correctly; the proxy itself could be down or misconfigured.

Securing Communications

Transmitting data "in the clear" over networks is insecure, making it vulnerable to eavesdropping (confidentiality breach) and modification (integrity breach). Several mechanisms add security:

HTTPS (HTTP Secure): This is standard HTTP layered over TLS/SSL (Transport Layer Security/Secure Sockets Layer) encryption. It encrypts web traffic between your browser and the server, protecting sensitive data like logins and credit card numbers. It also uses digital certificates to help verify the server's identity. You see it as the padlock icon in your browser's address bar. Operates at Layers 6/7.
VPN (Virtual Private Network): Creates an encrypted "tunnel" between your device and a VPN server, typically encrypting all your network traffic over that tunnel. This protects your data on untrusted networks (like public Wi-Fi) and can make your traffic appear to originate from the VPN server's location/IP address. Often operates at Layer 3.

Talking Directly to Web Services (`curl`)

Sometimes, it's useful to bypass the complexities of a web browser and interact directly with web services at the HTTP level. The curl command is an incredibly powerful and versatile tool for this. It lets you make HTTP requests, view responses, and diagnose application-level issues.

Here are some essential curl uses for debugging:

# Simple GET request (shows response body)
$ curl http://example.com

# Verbose output (-v): Shows connection details, request/response headers! CRUCIAL!
$ curl -v https://api.github.com/users/octocat 
*   Trying 140.82.121.4:443...          # DNS resolved, attempting TCP connection
* Connected to api.github.com (...) port 443 (#0) # TCP connection successful
* ALPN, offering h2/http1.1
* TLSv1.3 (OUT), TLS handshake (...)     # TLS negotiation details...
* SSL connection using TLSv1.3 / ...
> GET /users/octocat HTTP/1.1            # > Shows request headers sent by curl
> Host: api.github.com
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK                       # < Shows response status line & headers
< server: GitHub.com
< content-type: application/json; charset=utf-8
< ... (other headers) ...
< 
{                                       # Response body starts here
  "login": "octocat",
  "id": 583231,
  ...
}
* Connection #0 to host api.github.com left intact # Connection closed/reused

# Show only response headers (-I, uses HEAD request)
$ curl -I https://google.com

# Make a POST request with JSON data
$ curl -X POST -H "Content-Type: application/json" \
  -d '{"name":"New Item", "value":123}' \
  http://httpbin.org/post

Using curl -v is usefu because it shows:

If the DNS resolution succeeded (Trying ...).
If the TCP connection was established (Connected to ...).
If the TLS handshake (for HTTPS) completed successfully (SSL connection using ...).
The exact HTTP request headers curl sent (>).
The HTTP status code and response headers received from the server (<). (e.g., 200 OK, 404 Not Found, 500 Internal Server Error).
The response body (unless using -I).

This allows you to quickly pinpoint whether a failure occurs at the connection level, during the TLS negotiation, or if the server is returning an application-specific error status code.

Checking Listening Services (`ss`)

When debugging why you can't connect to a service (like a web server or database you're running), you need to verify if the service is actually running on the server and listening for incoming connections on the expected network interface and port. The ss (socket statistics) command is the modern tool for this on Linux (replacing the older netstat).

The most useful invocation for this purpose is:

ss -tulnp

Let's break down those options:

-t: Show TCP sockets.
-u: Show UDP sockets.
-l: Show only Listening sockets (sockets waiting for incoming connections).
-n: Show Numeric addresses and ports (don't try to resolve names, faster and often clearer).
-p: Show the Process (program name and PID) that owns the socket. (Often requires sudo to see processes owned by other users).

Example Output:

$ sudo ss -tulnp
State    Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process                                     
LISTEN   0        4096       127.0.0.53%lo:53          0.0.0.0:*          users:(("systemd-resolve",pid=638,fd=13))    # DNS resolver listening locally
LISTEN   0        128            0.0.0.0:22          0.0.0.0:*          users:(("sshd",pid=870,fd=3))               # SSH daemon listening on all IPv4 interfaces
LISTEN   0        511            0.0.0.0:80          0.0.0.0:*          users:(("nginx",pid=1234,fd=6))             # Nginx web server listening on port 80 (IPv4)
LISTEN   0        128               [::]:22             [::]:*          users:(("sshd",pid=870,fd=4))               # SSH daemon listening on all IPv6 interfaces
LISTEN   0        511               [::]:80             [::]:*          users:(("nginx",pid=1234,fd=7))             # Nginx web server listening on port 80 (IPv6)

State: Must be LISTEN.
Local Address:Port: Does this match the IP address and port you expect the service to be listening on?
- 0.0.0.0:<port> means listening on that port on all available IPv4 interfaces.
- [::]:<port> means listening on that port on all available IPv6 interfaces.
- 127.0.0.1:<port> means listening only for connections originating from the local machine itself.
- <specific_IP>:<port> means listening only on that specific IP address.
Process: Does the process name match the server application you expect (e.g., nginx, httpd, postgres, python)?

If the service you're trying to connect to doesn't appear in the ss -tulnp output, it's either not running or it's misconfigured and not listening on the network as expected.

HyperText Transfer Protocol (HTTP)

The HyperText Transfer Protocol, commonly known as HTTP, serves as the communication protocol for the World Wide Web. It operates at the Application Layer (Layer 7) of the network stack. HTTP follows a request response model: a client, typically your web browser, sends a request message to a server asking for a resource, and the server sends back a response message, often containing the requested resource or information about the request's outcome.

HTTP Request

Every time your browser needs something from a web server, it constructs an HTTP request. This request has several parts:

Method (or Verb): This specifies the action the client wants the server to perform on the resource. Common methods include:
- GET: Retrieve a resource.
- POST: Submit data to be processed to a specified resource, often causing a change in state or side effects on the server (like submitting a form or creating a new user).
- PUT: Replace the target resource with the request payload.
- DELETE: Remove the specified resource.
- HEAD: Similar to GET, but asks for only the response headers, not the actual resource body. Useful for checking if a resource exists or getting metadata without downloading the content.
Path (URI/URL): This identifies the specific resource the client is interested in on the server. It's the part of the web address that comes after the domain name, such as /about-us.html or /api/products/42.
HTTP Version: Indicates which version of the HTTP protocol the client is using, for example, HTTP/1.1 or HTTP/2.
Headers: These are key value pairs that provide additional information or metadata about the request. Some common examples include:
- Host: Specifies the domain name of the server. Example: Host: www.example.com
- User-Agent: Identifies the client software making the request (e.g., browser type and version). Example: User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36
- Accept: Tells the server what content types the client can understand. Example: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
- Content-Type: Specifies the media type of the request body (used with POST or PUT). Example: Content-Type: application/json
- Authorization: Carries credentials for authenticating the client with the server. Example: Authorization: Bearer your_token_here
Body (Optional): This part contains the data being sent to the server, typically used with methods like POST or PUT. For instance, when you submit a login form, the username and password might be sent in the request body. If you are interacting with a web API, the body might contain data formatted in JSON. Example JSON body: {"username": "alice", "email": "alice@example.com"}

Here's a simplified example of a GET request:

GET /index.html HTTP/1.1
Host: www.example.com
User-Agent: MySimpleBrowser/1.0
Accept: text/html

HTTP Response

After receiving and processing a request, the server sends back an HTTP response. This response mirrors the request structure in some ways:

HTTP Version: The protocol version the server is using.
Status Code: A three digit numerical code indicating the result of the request. We'll explore these codes further below.
Reason Phrase: A short, human readable text description accompanying the status code. For example, OK for status code 200.
Headers: Key value pairs providing metadata about the response. Common response headers include:
- Content-Type: Specifies the media type of the resource being sent in the response body. Example: Content-Type: text/html; charset=UTF-8
- Content-Length: Indicates the size of the response body in bytes. Example: Content-Length: 1234
- Set-Cookie: Instructs the client to store a cookie. Example: Set-Cookie: sessionID=xyz789; HttpOnly; Path=/
- Location: Used in redirection responses (like 301 or 302) to tell the client where to find the resource. Example: Location: https://www.newexample.com/
Body (Optional): Contains the actual resource requested (like HTML code, an image file, or JSON data) or details about an error if the request failed. Responses to HEAD requests or status codes like 204 No Content intentionally omit the body.

Here's a simplified example of a successful response to the previous GET request:

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: 150
Date: Tue, 29 Apr 2025 20:11:00 GMT

<html>
<head><title>Example Page</title></head>
<body>
<h1>Hello!</h1>
<p>This is a simple example page.</p>
</body>
</html>

Status Codes

HTTP status codes help with understanding the outcome of a request. They are grouped into five categories based on their first digit:

1xx (Informational): The request was received, and the process is continuing. These are rarely encountered in typical web Browse or API interactions.
2xx (Success): The request was successfully received, understood, and accepted.
- 200 OK is the standard code for a successful request. The requested data is usually in the response body.
- 201 Created typically follows a POST request that successfully created a new resource on the server.
- 204 No Content signifies success, but there's no data to return in the body. This is often used for DELETE requests or PUT updates that don't need to send data back.
3xx (Redirection): Further action needs to be taken by the client to complete the request, usually involving navigating to a different URL.
- 301 Moved Permanently indicates the requested resource has permanently moved to a new URL, provided in the Location header. Clients and search engines should update their links.
- 302 Found (or 307 Temporary Redirect) means the resource is temporarily at a different URL, also given in the Location header. Clients should continue using the original URL for future requests.
- 304 Not Modified is used for caching. It tells the client that the resource hasn't changed since the last request, so the client can use its cached version.
4xx (Client Error): The request contains bad syntax or cannot be fulfilled, likely due to an error on the client's side.
- 400 Bad Request suggests the server could not understand the request due to malformed syntax (e.g., invalid characters, missing required parts).
- 401 Unauthorized means authentication is required, and the client either hasn't provided credentials or the provided credentials are invalid.
- 403 Forbidden indicates that the server understood the request but refuses to authorize it. Unlike 401, authentication won't necessarily help; the client simply doesn't have permission to access the resource.
- 404 Not Found is perhaps the most famous code; the server cannot find the requested resource at the specified URL.
5xx (Server Error): The server failed to fulfill an apparently valid request due to an internal issue.
- 500 Internal Server Error is a generic code indicating an unexpected condition on the server prevented it from fulfilling the request (e.g., a bug in the server-side code).
- 502 Bad Gateway usually occurs when a server acting as a gateway or proxy received an invalid response from an upstream server it needed to query.
- 503 Service Unavailable means the server is temporarily unable to handle the request, often because it's overloaded or down for maintenance.

Web APIs

Instead of responding with HTML meant for display, a web API endpoint might respond with structured data, commonly in JSON (JavaScript Object Notation) format, intended for consumption by another program. For example, a mobile weather app likely communicates with a weather service's API over HTTP. The app sends a GET request to an API endpoint like /api/weather?location=CollegeParkMD, and the server responds with JSON data like {"temperature": 75, "condition": "Sunny", "unit": "F"}. The app then parses this JSON data and displays it to the user.

Web APIs use the same HTTP principles: methods (GET, POST, PUT, DELETE), status codes, headers, and optionally request/response bodies. Understanding HTTP is therefore essential for working with or building systems that rely on APIs. Many modern web applications heavily use APIs internally, even for their own front end interfaces (built with frameworks like React, Angular, Vue), to dynamically load data and update the user interface without full page reloads.

Browser Developer Tools

For anyone working with web technologies, the browser's built in Developer Tools are solid. You can usually open them by pressing F12 or right clicking on a webpage and selecting "Inspect" or "Inspect Element".

Within these tools, the Network Tab is particularly powerful for understanding HTTP communication. It records every network request initiated by the webpage as it loads and runs. This includes requests for the main HTML document, CSS stylesheets, JavaScript files, images, fonts, and data requests made via JavaScript.

For each recorded request, the Network Tab typically shows:

The requested URL.
The HTTP Method used (e.g., GET, POST).
The Status Code returned by the server (e.g., 200, 404). Errors (4xx, 5xx codes) are often highlighted, perhaps in red, making them easy to spot.
The Type of resource requested (e.g., document, script, stylesheet, xhr).
The Size of the response.
The Time taken for the request.

System Monitoring Dashboard Project

Overview

In this project, you will create a basic system monitoring tool via a Bash script that runs in the terminal. The tool will display real-time system statistics and allow for simple historical data analysis. This project will help you practice shell scripting, data collection, and basic data analysis using common Unix tools.

Description

You will write a Bash script that will support three "operating modes": collection, display, and query. Users will interact with your script in the following way:

./[script name].sh [operating mode flag] [additional options]

The table below details the possible operating mode flags and additional options.

Flag	Type	Description
-c	Operating Mode	Enables the `Collection` operating mode. Required.
-d	Operating Mode	Enables the `Display` operating mode. Required.
-q	Operating Mode	Enables the `Query` operating mode. Required.
--start	Configuration	Sets the start datetime for a query. Required in the `Query` operating mode.
--end	Configuration	Sets the end datetime for a query. Required in the `Query` operating mode.
--help	User Flag	Prints out a message on how to use this script. Required.

Sample code to parse the operating mode commands is provided, but it is up to you to integrate it into your script.

Requirements

0. Overall Requirements

The script must:

Be written as a Bash script
Make use of variables, conditionals, loops, functions, variable substitutions, and other shell syntax as demonstrated in other parts of the course
Be commented, and clearly written
Implement the –help flag

1. Collection Mode

Collection mode must:

Be enabled via the -c flag
Collects CPU and memory usage (the amount of RAM currently being used by the system)
Parse and format the data into a simple CSV format
Appends the collected data to a log file with a timestamp
Use system utilities like top and free with appropriate command line flags to determine CPU/memory usage; capturing resources does not need to account for multiple CPUs as it is intended a system summary

Note: free isn't avaliable on macOS, so we reccomend parsing the output of the memory_pressure command instead for memory usage. free should still work on other WSL/Unix machines. Technically, you can use the PhysMem row on the top output of macOS, but this is mildly harder to parse. For WSL/Unix systems, we reccomend looking at row starting with Mem for what you should be looking at.

Store the results in a CSV file that is determined by the environment variable SYSTEM_STATSFILE; if this variable is not present, use the default /tmp/system_stats.log
Collection times can use any timestamp format but it is suggested that the output of date be used as this is standard and widely available

Running your script in this mode looks like:

./system_monitoring_tool.sh -c

The output should be of the format:

DATETIME1, CPU%, MEM%
DATETIME2, CPU%, MEM%
DATETIME3, CPU%, MEM%
...

An example of the default contents of running the script several times accumulating the contents in the default output file looks like the following:

>> cat /tmp/system_stats.log 
2025-01-08 15:43:12,3.0,55.5611
2025-01-08 15:43:32,84.4,55.3498
2025-01-08 15:43:33,84.5,55.4192
2025-01-08 16:00:15,80.6,57.004
2025-01-08 16:00:20,83.3,57.0953

2. Display Mode

Display mode must:

Be enabled via the -d flag.
Displays current CPU and memory usage as a percentage
Updates the display every 5 seconds using a loop which clears the terminal display and redraws output and the shell builtin sleep command
Obtain CPU and Memory usage using standard tools as was done in collection mode
Show a visual representation of the usage as a horizontal “bar” graph and lines output up attractively

Suggested output is shown below:

>> ./system_monitoring_tool.sh -d
System Monitor Dashboard
========================
CPU Usage:     19.4% [###                 ]
Memory Usage:  57.7% [###########         ] 
========================
Press Ctrl+C to exit

3. Query Mode

Query mode must:

Be enabled via the -q flag
Accepts start and end date/time parameters via the --start and --end flags respectively
Filters and displays data from the log file within the specified time range
Outputs the results in a readable CSV format
Honors the SYSTEM_STATSFILE variable or defaulting to /tmp/system_stats.log that variable is not set
Implement error detection when the command line arguments do not specify start/end dates for the query
It is strongly suggested that UNIX text processing tools be used to complete the required functionality; the instructor solution relies heavily on awk due to ease of date comparisons.

Sample output is show below:

# error cases for underspecifying the command line invocation
>> ./system_monitoring_tool.sh -q
ERROR: Both start and end dates must be specified
Usage: query_stats.sh --start 'YYYY-MM-DD HH:MM:SS' --end 'YYYY-MM-DD HH:MM:SS'
Displays all stats between the start and end times.

>> ./system_monitoring_tool.sh -q --start '2025-01-08 15:43:12'
ERROR: Both start and end dates must be specified
Usage: query_stats.sh --start 'YYYY-MM-DD HH:MM:SS' --end 'YYYY-MM-DD HH:MM:SS'
Displays all stats between the start and end times.

# contents of the log (default file)
>> cat /tmp/system_stats.log 
2025-01-08 15:43:12,3.0,55.5611
2025-01-08 15:43:32,84.4,55.3498
2025-01-08 15:43:33,84.5,55.4192
2025-01-08 16:00:15,80.6,57.004
2025-01-08 16:00:20,83.3,57.0953

# show a range of entries
>> ./system_monitoring_tool.sh -q --start '2025-01-08 15:43:12' --end '2025-01-08 15:43:33'
Timestamp,CPU%,Memory%
2025-01-08 15:43:12,3.00,55.56,
2025-01-08 15:43:32,84.40,55.35,
2025-01-08 15:43:33,84.50,55.42,

# show a different range of entries
>> ./system_monitoring_tool.sh -q --start '2025-01-08 15:43:15' --end '2025-01-08 16:16:00'
Timestamp,CPU%,Memory%
2025-01-08 15:43:32,84.40,55.35,
2025-01-08 15:43:33,84.50,55.42,
2025-01-08 16:00:15,80.60,57.00,
2025-01-08 16:00:20,83.30,57.10,

Grading Criteria (60pts)

Script (40pts)

Clearly written, and commented code (10 points)
Meets specification laid out in the requirements section. (10 points per mode)

Video (20 pts)

Video must be under 6 minutes (20% penalty).
Screen recording showing the following:
- Go through the source code for each part of your script and explain its functionaltiy (10 points)
  - It is easy for me to see what the code is doing, I also care about why the code is written how it is.
- Run and display the functionality the script based on the project specification (10 points)

Submission Requirements

Submit a single Bash script, system_monitoring_tool.sh.
Submit a video showing the usage of your scripts.
Upload the required video and files to the assignment listed on Gradescope.

Late Policy

10% deduction per day late

Sample Code and Additional Info

Operating Mode Parsing

while [[ $# -gt 0 ]]; do
    case "$1" in
        -c)
            echo "Collect mode"
            ;;
        -d)
            echo "Display mode"
            ;;
        -q)
            echo "Query mode"
            ;;
        *)
            echo "Unknown parameter: $1"
            echo "Usage: $0 [-c | -d | -q]"
            exit 1
            ;;
    esac
    shift
done

For parsing the start and end dates, the $@ variable can be used to pass the command line arguments to a function. Note that calling shift discards $1 (the first argument) and the other arguments shift down. Then, you can parse arguments using a very similar for loop to the above.

WSL / VMs

WSL / VMs is known to have issues with displaying accurate system performance metrics or very low numbers, this is normal and will not impact your grade. You may see output like:

%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

This is due to how WSL is running as a VM, meaning that it will only be tracking the usage of WSL itself which will likely be rather low.

System Monitoring Dashboard Project

Overview

In this group project, you will be working on a basic calculator app, and have to make some changes and work dynamically with the Git history. Then, you will combine all of your changes with a pull request to simulate an actual development environment.

Repository

See instructions here.

System Monitoring Dashboard Project

Overview

In this project, you will implement a CI pipeline via GitHub actions.

Repository

See instructions here.

System Monitoring Dashboard Project

Overview

In this project, imagine you are a software engineer / sysadmin at a company and you get 3 customer tickets triaged to you. It's your job to read the ticket and diagnose the issue by using common Linux networking tools. See the details on how you should reproduce the tickets in their respective folders.

See the tickets here.

You do not need to worry about the contents of the docker files and docker compose files.

Note: Ticket 2 Issue

You may see an issue related to undefined symbol: EVP_MD_CTX_get_size_ex on Ticket 2. This is related to some open bugs but you can reason about the bug by looking at the config file.

Requirements

You will need Docker installed on your machine for this project. Read more here.

Write Up (15 points each x 3 tickets)

For each ticket, you must write up a one paragraph (8 sentences max) bug report that contains the following:

A brief description of the issue (2 points)
Steps you took to reproduce the issue (4 points)
What tools you used to diagnose the issue; you MUST include the specific commands run, and why to get points. (4 points)
What is the issue? (5 points)

You must use a tool that we talked about in class or within the networking notes or you will not be awarded points.

See more information in the respective ticket folders.

Application Day: ToDo List App

Overview

Create a command-line todo list manager that allows users to add, view, complete, and manage their tasks. The application should store todos persistently.

Requirements

Your todo manager must implement the following features:

Add Tasks: Allow users to add new todo items to their list
Display Tasks: Show all current todos with visual indicators for completion status
Mark Complete: Enable users to mark one or more tasks as completed
Remove Tasks: Allow deletion of specific tasks by their position/number
Clean Up: Provide functionality to remove all completed tasks at once
Persistent Storage: Store todos in a file that persists between sessions

Technical Specifications

File Storage

Use a simple text format with checkboxes: [ ] for incomplete, [x] for complete
Each todo should be on its own line

Command Interface

Implement a command-line interface that supports:

./todo.sh add <task description> - Add a new task
- Useful tools: echo
./todo.sh done <task_number(s)> - Mark task(s) as complete
- Useful tools: sed
./todo.sh rm <task_number> - Remove a specific task
- Useful tools: sed
./todo.sh clean - Remove all completed tasks
- Useful tools: grep, input output redirection via > and >>
./todo.sh (no arguments) - Display all current todos
- Useful tools: cat -n

Display Format

Show todos with line numbers
Use visual indicators for task status:
- Incomplete tasks: ⬜ (must use this emoji)
- Completed tasks: ✅ (must use this emoji)
Number each todo for easy reference

Implementation Guidelines

# Add some tasks
$ ./todo.sh add "Buy groceries"
Added: Buy groceries

$ ./todo.sh add "Finish homework"
Added: Finish homework

$ ./todo.sh add "Call dentist"
Added: Call dentist

# View current todos
$ ./todo.sh
     1  ⬜ Buy groceries
     2  ⬜ Finish homework
     3  ⬜ Call dentist

# Mark task as complete
$ ./todo.sh done 1
Completed task #1

# View updated list
$ ./todo.sh
     1  ✅ Buy groceries
     2  ⬜ Finish homework
     3  ⬜ Call dentist

# Remove a task
$ ./todo.sh rm 3
Removed task #3

# Clean up completed tasks
$ ./todo.sh clean
Cleaned up 1 completed tasks

Submission Requirements

Source code consisting of runnable bash script called todo.sh.

Evaluation Criteria (5 pts)

5 pts awarded per function.
We want to see a good-faith attempt on each part.

Starter Code

#!/bin/bash
# todo - Simple todo list manager

TODO_FILE="$HOME/.todos"

add_todo() {
    echo "IMPLEMENT ME!!"
}

# Write other functions here...

touch $TODO_FILE

case "$1" in
    add)    shift; add_todo "$@" ;;
    # add other cases here...
esac

Docker Application Days

Assignment repository: https://github.com/mdurrani808/DockerApplicationDay

See the relevant task descriptions under the part{1,2,3} folder. Submissions will be due on Gradescope.

Practical Tools For Efficient Development