Files
OBIJupyterHub/web_src/lectures/computers/regex/slides_regex.qmd

369 lines
5.6 KiB
Plaintext
Raw Normal View History

2025-10-16 01:07:07 +02:00
---
title: "Regular Expressions"
format:
revealjs:
theme: beige # thème des slides
transition: fade # effet de transition entre les slides
---
## Regular Expressions
Pattern matching for text with variations
Example: `tot*o` matches:
- "to" + any number of "t" + "o"
- "toto", "totto", "totttto", etc.
---
## Basic Concepts
**Alphabet**: Set of allowed symbols
- DNA: {A, C, G, T}
- Text: {letters, digits, punctuation, ...}
**Text**: Sequence of symbols from alphabet
**Word**: Subsequence of consecutive symbols
---
## Simple Regular Expression
`ATG` matches exactly "ATG"
```{mermaid}
graph LR
D((D)) -->|A| 1((1))
1 -->|T| 2((2))
2 -->|G| F((F))
style D fill:#90EE90
style F fill:#FFC0CB
```
---
## Symbol Ambiguities
### Any Character: `.`
`.TG` matches:
- "ATG", "TTG", "GTG", "CTG", ...
```{mermaid}
graph LR
D((D)) -->|any| 1((1))
1 -->|T| 2((2))
2 -->|G| F((F))
style D fill:#90EE90
style F fill:#FFC0CB
```
---
## Character Classes
`[ATG]TG` matches only:
- "ATG", "TTG", "GTG"
```{mermaid}
graph LR
D((D)) -->|A/T/G| 1((1))
1 -->|T| 2((2))
2 -->|G| F((F))
style D fill:#90EE90
style F fill:#FFC0CB
```
---
## Ranges and Negation
**Ranges**: `[A-Z]`, `[0-9]`, `[A-Za-z0-9]`
**Negation**: `[^A-Z]` (anything except uppercase)
---
## Repetition: Zero or One
`ballons?` matches:
- "ballon" (singular)
- "ballons" (plural)
```{mermaid}
graph LR
D((D)) -->|b| 1((1))
1 -->|a| 2((2))
2 -->|l| 3((3))
3 -->|l| 4((4))
4 -->|o| 5((5))
5 -->|n| 6((6))
5 -->|s| 7((7))
7 -->|n| F((F))
6 --> F
style D fill:#90EE90
style F fill:#FFC0CB
```
---
## Repetition: Zero or More
`TTA*TT` matches:
- "TTTT", "TTATT", "TTAAATT", ...
```{mermaid}
graph LR
D((D)) -->|T| 1((1))
1 -->|T| 2((2))
2 -->|A| 2
2 -->|T| 3((3))
3 -->|T| F((F))
style D fill:#90EE90
style F fill:#FFC0CB
```
---
## Repetition: One or More
`TTA+TT` matches:
- "TTATT", "TTAATT", "TTAAATT", ...
- But NOT "TTTT"
```{mermaid}
graph LR
D((D)) -->|T| 1((1))
1 -->|T| 2((2))
2 -->|A| 3((3))
3 -->|A| 3
3 -->|T| 4((4))
4 -->|T| F((F))
style D fill:#90EE90
style F fill:#FFC0CB
```
---
## Exact Repetition
`A{3,5}` matches:
- "AAA", "AAAA", "AAAAA"
`A{3}` matches exactly "AAA"
---
## Special Characters
- `^` - Start of line
- `$` - End of line
- `\n` - Newline
- `\t` - Tab
Examples:
- `^start` - "start" at beginning of line
- `end$` - "end" at end of line
- `^exact$` - "exact" as entire line
---
## Alternation
`papa|mama` matches either:
- "papa" OR "mama"
```{mermaid}
graph LR
D((D)) -->|p| 1((1))
1 -->|a| 2((2))
2 -->|p| 3((3))
3 -->|a| F1((F))
D -->|m| 4((4))
4 -->|a| 5((5))
5 -->|m| 6((6))
6 -->|a| F2((F))
style D fill:#90EE90
style F1 fill:#FFC0CB
style F2 fill:#FFC0CB
```
---
## Grouping
`T(AA|AG|GA)` matches:
- "TAA", "TAG", "TGA"
Instead of incorrect: `TAA|AG|GA`
---
## Backreferences
`([ACGT]{3})\1{9,}` matches:
- Any triplet repeated 10+ times
- Example: "CAGCAGCAGCAG..."
---
## Quick Reference
### Symbol Ambiguity
| Pattern | Matches |
|---------|---------|
| `.` | Any character |
| `[abc]` | a, b, or c |
| `[^abc]` | Not a, b, or c |
### Repetition
| Pattern | Matches |
|---------|---------|
| `?` | 0 or 1 |
| `*` | 0 or more |
| `+` | 1 or more |
| `{n,m}` | n to m times |
---
## Biological Application: Gene Finding
Find Coding Sequences (CDS) in bacterial DNA:
1. Start codon
2. Multiple non-stop codons
3. Stop codon
---
## Start Codons
Bacterial start: ATG, TTG, GTG
Pattern: `[ATG]TG`
```{mermaid}
graph LR
D((D)) -->|A/T/G| 1((1))
1 -->|T| 2((2))
2 -->|G| F((F))
style D fill:#90EE90
style F fill:#FFC0CB
```
---
## Stop Codons
Bacterial stop: TAA, TAG, TGA
Pattern: `T(A[AG]|GA)`
```{mermaid}
graph LR
D((D)) -->|T| 1((1))
1 -->|A| 2((2))
2 -->|A/G| F1((F))
1 -->|G| 3((3))
3 -->|A| F2((F))
style D fill:#90EE90
style F1 fill:#FFC0CB
style F2 fill:#FFC0CB
```
---
## Non-Stop Codons
61 codons that aren't stop codons
Pattern: `[ACG][ACGT][ACGT]|T([CT][ACGT]|G[CGT]|A[CT])`
```{mermaid}
graph LR
D((D)) -->|A/C/G| A1((1))
A1 -->|A/C/G/T| A2((2))
A2 -->|A/C/G/T| F1((F))
D -->|T| B1((3))
B1 -->|C/T| B2a((4))
B2a -->|A/C/G/T| F2((F))
B1 -->|G| B2b((5))
B2b -->|C/G/T| F3((F))
B1 -->|A| B2c((6))
B2c -->|C/T| F4((F))
style D fill:#90EE90
style F1 fill:#FFC0CB
style F2 fill:#FFC0CB
style F3 fill:#FFC0CB
style F4 fill:#FFC0CB
```
---
## Complete CDS Pattern
`[ATG]TG([ACG][ACGT][ACGT]|T([CT][ACGT]|G[CGT]|A[CT]))+T(A[GA]|GA)`
- Start codon
- 1+ non-stop codons
- Stop codon
```{mermaid}
graph LR
D((D)) -->|A/T/G| 1((1))
1 -->|T| 2((2))
2 -->|G| 3((3))
3 --> 4((4))
4 -->|A/C/G| 5((5))
5 -->|A/C/G/T| 6((6))
6 -->|A/C/G/T| 7((7))
4 -->|T| 8((8))
8 -->|C/T| 9((9))
9 -->|A/C/G/T| 7
8 -->|G| 10((10))
10 -->|C/G/T| 7
8 -->|A| 11((11))
11 -->|C/T| 7
7 --> 4
7 -->|T| 12((12))
12 -->|A| 13((13))
13 -->|A/G| F((F))
12 -->|G| 14((14))
14 -->|A| F
style D fill:#90EE90
style F fill:#FFC0CB
```
---
## Minimum Length CDS
`[ATG]TG([ACG][ACGT][ACGT]|T([CT][ACGT]|G[CGT]|A[CT])){99,}T(A[GA]|GA)`
Requires at least 100 amino acids (99 non-stop codons + stop)
---
## Summary
- Regular expressions = text patterns
- Symbol ambiguity: `.`, `[]`, `[^]`
- Repetition: `?`, `*`, `+`, `{}`
- Special chars: `^`, `$`, `\n`, `\t`
- Powerful for biological sequence analysis