The Unix operating system was born in AT&T laboratories in the United States, then known as "Bell Labs". Created in the late 1960s, it derives from Multics, another system from the same laboratory about ten years earlier. Unix spread rapidly because Bell Labs distributed its new system as freely modifiable source code. This led to the emergence of Unix families produced by the system's main users: research laboratories on one hand and major computer manufacturers on the other.
From the beginning, Unix development has been closely linked to scientific computing. These intrinsic qualities explain why this operating system is still widely used in many research fields today.
Today, Unix is a registered trademark of The Open Group, which standardizes all Unix systems. However, there is a broader definition that includes "Unix-like" systems such as GNU/Linux. Despite proclaiming in its name not to be Unix (GNU is Not Unix), this family of operating systems has such functional similarities with its ancestor that it's difficult to explain how it isn't Unix.
Nowadays, a Unix system can be installed on virtually any machine, from personal computers to large computing servers. Notably, for several years, Apple's standard operating system on Macintosh computers, macOS, has been a certified Unix system.
Unix is a multitasking and multi-user operating system. This means it can manage the simultaneous use of the same computer by multiple people, and for each person, it allows parallel execution of multiple programs. The multiplicity of users and running programs on the same machine requires particular resource management, involving restricted rights for each user so that one person's work doesn't interfere with another's.
Each line corresponds to a user. Information is separated by `:` characters. In order: login, encoded password, UID, group ID, full name, home directory, and default shell.
The file system of an operating system encompasses all mechanisms for managing storage space (hard drives) on the computer. Data and programs are stored in files. A file can be thought of as a small part of a hard drive dedicated to storing a set of data.
In a Unix system, a file name describes a path in a tree. A file name starts with a `/` character and consists of successive node labels describing the file's location in the name tree. Each label is separated from the preceding one by the `/` character.
For example, the file `/etc/passwd` indicates that this file is located at a node (directory) named `/etc`, which itself is located at the root of the file name tree `/`.
The concept of a link can be compared to a shortcut in other operating systems. A link is a special file that creates additional edges in the file name tree. From a computer science perspective, the tree structure becomes a Directed Acyclic Graph (DAG).
```{mermaid}
graph LR
root["/"]
usr["/usr"]
bin["/bin"]
home["/home"]
alice["alice"]
programs["programs<br/>(link)"]
grep["grep"]
root --> usr
root --> home
usr --> bin
home --> alice
alice --> programs
bin --> grep
programs -.->|symbolic link| bin
style programs fill:#fff2cc
style grep fill:#e1f5ff
```
Creating a link in a Unix file system creates a synonym between the link name and the target file.
### The `.` and `..` Directories
Unix uses links to facilitate navigation in the file name tree. When creating a directory node, the system automatically adds two links under this node named `.` and `..`:
These links mean that for each file, there isn't just one name but an infinite number of possible names. The file `/home/alice/myfile` can also be named:
- `/home/alice/./myfile`
- `/home/alice/../../home/alice/myfile`
- `/home/alice/./././myfile`
### Current Directory and Relative Paths
The hierarchical tree structure of Unix file names is powerful but produces often very long file names. To work around this problem, Unix offers the concept of current directory and relative paths.
**Current Directory**: When working on a machine, you typically work on a set of files located in the same region of the name tree. The common part of all these names is stored in an environment variable called `PWD` (Present Working Directory).
By default, when you log into your Unix account, this variable is initialized with your home directory name. You can change this variable's value using the `cd` command.
**Relative Paths**: Relative file names are expressed relative to the current directory. To know the true name corresponding to a relative name, you concatenate the current directory name and the relative name.
Example:
```bash
# If current directory is: /home/alice/experiment_1
A relative name is recognized by the fact it doesn't start with `/`. In contrast, complete file names are called absolute paths and always start with `/`.
### Access Rights
Unix is a multi-user system. To protect each user's data from others, each file belongs to a specific user (usually its creator) and a user group. Additionally, each file has access rights concerning:
A program corresponds to a sequence of calculation instructions that the computer must execute to perform a task. While it's important to store this instruction sequence for regular reuse, it's equally important to execute it. A process corresponds to the execution of a program.
Since Unix is multitasking and multi-user, the same program can be executed simultaneously by multiple processes. It's therefore important to distinguish between program and process.
A process can be considered as part of the computer's memory dedicated to program execution. This memory chunk can be divided into three main parts: the environment, data area, and program area.
A process is an isolated memory area where a program executes. Isolation secures the computer by preventing a program from corrupting others' execution. However, during execution, a program must interact with the rest of the computer.
The process environment is dedicated to this interface task. It contains descriptions of system elements the process needs to know. Two main types of information are stored:
The Unix shell is the most important program for a Unix user. It's how they interact with their computer. There's a graphical window system under Unix similar to Windows or macOS, called X Window System (X11), which can operate in client/server mode across networks. However, we'll focus on interacting with Unix in "text" mode via the shell.
A shell command describes how to trigger program execution with all necessary information. As a principle, every program installed on a Unix machine corresponds to a usable command from the shell bearing the program's name, and conversely, every Unix command is the name of an installed program.
A Unix command is the name of a program installed on the machine. When you execute a command like `ls` or `grep`, you're actually launching execution of an eponymous program stored somewhere on your hard drives.
Arguments indicate data the program should process, beyond data potentially transmitted through standard input. Depending on how it's programmed, a program can accept one or multiple arguments. Each argument may have a distinct role depending on the program.
To understand each argument's role, consult the program's manual page via `man` command or online help, usually accessible with the `-h` option.
This fourth part of a Unix command line is crucial, allowing you to specify how your program should configure its standard inputs/outputs. This is one of the most important things to understand to fully benefit from the Unix system.
## File Name Patterns with Wildcards
It's very common in a Unix command to need to specify multiple file names. When the number of files becomes large, typing these names one by one can be tedious, especially if all file names share common characteristics.
To address this, there's a series of "wildcard" characters to indicate the form of desired file names:
| Wildcard | Matches |
|----------|---------|
| `*` | Zero, one, or more characters |
| `?` | Exactly one character |
| `[...]` | One character from the list |
| `[^...]` | One character NOT in the list |
| `[a-z]` | One character in the range |
Each word in a Unix command line using these characters is replaced during execution by the list of existing file names matching the pattern.
These file name patterns are most often used with file manipulation commands like copying (`cp`), deletion (`rm`), or listing (`ls`). They're also frequently used in loops to launch the same command on an entire series of datasets.
The property that gives a Unix shell its full power is the standard input/output redirection system. Each process inherits three standard data streams from its parent:
- `stdin`: Standard input stream (file descriptor 0)
- `stdout`: Standard output stream (file descriptor 1)
- `stderr`: Standard error stream (file descriptor 2)
The `grep` command selects lines of text containing a pattern (or in this example) and copies them to standard output. Input redirection tells the process to read from `my_listing`, and output redirection saves results to `my_selection`.
### Redirecting Output to Another Process (Pipes)
The most powerful redirection mode connects one process's standard output to another's standard input. The first program's results become the second's data. Data passes directly between processes without going through an intermediate file. This creates a "pipe" between processes.
```{mermaid}
flowchart LR
stdin1["stdin"] --> P1["ls /"]
P1 --> pipe["|<br/>pipe"]
pipe --> P2["grep or"]
P2 --> stdout2["stdout"]
P2 --> stderr2["stderr"]
stdout2 --> Screen[("Screen")]
stderr2 --> Screen
style pipe fill:#fff2cc
style stdout2 fill:#e1ffe1
style stderr2 fill:#ffe1e1
```
Syntactically, this is achieved by joining two or more commands with the `|` character:
A computer's value lies in its ability to automatically perform repetitive calculation tasks. Users often find themselves needing to launch the same Unix command for calculations on multiple datasets. If each dataset is saved in a different file with coherent naming (e.g., `gis_vercors.dat`, `gis_belledonne.dat`, `gis_chartreuse.dat`), it's possible to leverage loop structures offered by Unix shells.
### Shell Variables
Working automatically and repetitively requires using variables to store useful, changing information at each iteration. For example, if your Unix command must read data from different files for each execution, you cannot write the file name in your command since it won't always be the same.
You already know environment variables, set up by the `export` command, used to store system configuration information. There are simple variables allowing you to store any information you deem necessary during your Unix session. They're set up with simple assignment:
To retrieve the value contained in a variable, precede its name with the `$` character.
### The `for` Loop
To solve our problem of repeating the same Unix command multiple times while working on different data files, we'll create a variable that takes each element of a list as its value in turn. In our case, this list will be a list of file names constructed using file name ambiguity characters.
The commands presented here are a subset of all commands available by default on a Unix system. They're presented with a subset of their options. For a complete description of their functionality, refer to online help accessible via the `man` command.
Named after its authors (Aho, Weinberger, Kernighan), `awk` is a complete programming language. A full description is beyond this course's scope but was perfectly described in "The AWK Programming Language" by its authors.