Files
obitools4/autodoc/prompt_examples.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

408 lines
17 KiB
Markdown

# Task
Given `autodoc/cmd/obi{xxx}.md`, produce:
1. `autodoc/examples/obi{xxx}/` — a directory containing synthetic input sequence files
that allow every example in the EXAMPLES section to be executed and validated.
2. An updated `autodoc/cmd/obi{xxx}.md` — with corrected EXAMPLES and an enriched OUTPUT
section describing observed output annotations.
---
## TOOL CALL FORMAT — enforce before every call
A tool call is exactly:
<function=tool_name>
{"param": "value"}
</function>
Rules (no exceptions):
- `<` is immediately followed by `f` — zero spaces, zero characters in between.
- Parameters are a **single JSON object** — no XML tags, no `<parameter=...>`, no `</parameter>`.
- No outer wrapper — never use `<tool_call>`, `<tool_use>`, or any other enclosing tag.
- Tool name is lowercase with double underscores.
---
## HALLUCINATION GUARD
Every sequence written in STATE 2 must be biologically valid for the command being
tested. Derive sequence content from the OPTIONS and OUTPUT sections of `$doc` — never
invent behaviour not described there.
**EXECUTION GUARD — critical:** The `## Observed output example` subsection added in
STATE 5 MUST contain verbatim bytes from `$outputs` (actual tool output read in STATE 4).
It MUST NOT be invented or approximated. If no command succeeded, omit the subsection
entirely rather than writing invented content.
---
## DOCUMENT PRESERVATION — critical
The output of STATE 5 is `$doc` with **surgical edits only**. The rules are:
- Copy the ENTIRE content of `$doc` verbatim into the new file.
- Apply ONLY the three modifications described in STATE 5 (EXAMPLES update,
prose corrections, OUTPUT subsection addition).
- Do NOT reformat, reorder, rewrite, or restructure any heading, paragraph,
option list, or prose from `$doc` **unless it is factually contradicted by
actual execution results** (see Modification 2 in STATE 5).
- Do NOT add new top-level sections (no ENVIRONMENT VARIABLES, no duplicate OUTPUT, etc.).
- Do NOT change section title casing, Markdown heading levels, or list syntax.
- If in doubt, leave the section exactly as it appears in `$doc`.
---
## FASTQ FORMAT — mandatory structure
A valid FASTQ record is **exactly 4 lines** in this order:
```
@<identifier> <optional description>
<nucleotide sequence> ← MUST be non-empty (≥ 10 characters, A/T/G/C only)
+
<quality string> ← MUST be the exact same length as the sequence line
```
Common mistakes that are **forbidden**:
- Writing `@header\n+\nquality` with the sequence line missing.
- Writing a quality string shorter or longer than the sequence.
- Mixing `>` (FASTA) and `@` (FASTQ) headers in the same file.
- Writing `~`-separated fields (e.g. `@seq002~description~here`) — use a space.
---
## OUTPUT FORMAT GUARD
OBITools4 determines the output format from the **data content and explicit flags**,
**not from the output filename extension**. A file named `out.fasta` will contain FASTQ
if quality scores are present and no `--fasta-output` flag is given.
Rules when designing examples:
- If the example is meant to produce FASTA output from FASTQ input, the command MUST
include `--fasta-output`.
- If the example is meant to produce FASTQ output from FASTA input, the command MUST
include `--fastq-output`.
- Never assume an output format from the filename alone.
- Verify the actual format of each output file in STATE 3b by checking its first
character (`>` = FASTA, `@` = FASTQ, `[` or `{` = JSON).
---
## OPTION VALIDATION GUARD
Before writing any example command in STATE 2, explicitly cross-check each option
against the OPTIONS section of `$doc`:
- Every flag used must appear in the OPTIONS section with the claimed semantics.
- Input-format flags (`--fasta`, `--fastq`, `--csv`, `--genbank`, `--embl`,
`--ecopcr`) tell the tool how to **read** the input. They do NOT affect the
output format.
- Output-format flags (`--fasta-output`, `--fastq-output`, `--json-output`) tell
the tool what format to **write**. If there is no `--csv-output` (or similar) in
the OPTIONS section, do NOT write an example claiming CSV output.
- If an option needed for a working example is absent from `$doc`, mark that example
as SKIP rather than inventing a flag.
---
## ANNOTATION RULES — CRITICAL
When creating FASTA/FASTQ files with annotations:
- Use **only** valid annotation attribute names: `taxid`, `scientific_name`, `rank`, `definition`, `sample`, `run_id`, `instrument`
- For taxonomy data: use `taxid` (NCBI Taxonomy ID) and `scientific_name` — never invent taxids
- Examples of valid taxonomy annotations:
- `>seq001 {"taxid":2}` — Bacteria (valid NCBI taxid)
- `>seq002 {"taxid":2157,"scientific_name":"Archaea"}` — Archaea (valid NCBI taxid)
- `>seq003 {"taxid":2759,"scientific_name":"Eukaryota"}` — Eukaryota (valid NCBI taxid)
- NEVER use invented taxids
- **Map attributes** (JSON maps) must have names ending with `_merged` (e.g., `taxid_merged`, `sample_merged`)
---
## CSV FILES FOR JOINS
When creating CSV files for `obijoin`:
- Do NOT include the ID column in the CSV (the join key is specified separately via `--by`)
- The CSV format is auto-detected; do NOT use `--csv` flag
- Example CSV structure for taxid join:
```
taxid,scientific_name,phylum
2,Bacteria,Proteobacteria
2157,Archaea,Euryarchaeota
2759,Eukaryota,Arthropoda
```
- Example command: `obijoin --join-with taxonomy.csv --by taxid sequences.fasta`
---
## PIPELINE
Execute the five states below in order. Do not skip states. Do not merge states.
---
### STATE 1 — Read the documentation file and fetch pipeline command docs
**Input:** nothing.
**Action:**
Step 1a — read the autodoc file:
```
<function=Read>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/cmd/obi{xxx}.md"}
</function>
```
Step 1b — scan the EXAMPLES section of the file just read for any `obi*` commands
other than `obi{xxx}` itself that appear in pipeline examples (e.g. `obigrep`, `obiuniq`,
`obiclean`). For each such command found, emit a WebFetch call to retrieve its online
documentation (in the same parallel message as Step 1a if possible, otherwise
immediately after):
```
<function=WebFetch>
{"url": "https://obitools4.metabarcoding.org/obitools/obi<other>/"}
</function>
```
If the page returns a 404 or error, store an empty string for that command.
**Output:** store content as `$doc`, and store fetched pages as `$pipeline_docs`
(a map from command name to page content).
**Stop.** Do not interpret or summarise. Proceed to STATE 2.
---
### STATE 2 — Analyse examples and design input files
**Input:** `$doc`.
**Action (no tool calls):**
1. Extract every example command from the EXAMPLES section of `$doc`.
- Identify every distinct input filename referenced (e.g. `sequences.fasta`,
`reads_R1.fastq`, …).
- Identify every option used and verify each against the OPTIONS section (see
OPTION VALIDATION GUARD above).
- For any `obi*` command used in a pipeline (not `obi{xxx}` itself), verify its
flags and expression syntax against `$pipeline_docs`. If `$pipeline_docs` for
that command is empty (page not found), mark the example as SKIP rather than
guessing the syntax.
- **Coverage check — command-specific options:** list all command-specific options
from the OPTIONS section (excluding those covered by standard option-sets: `--fasta`,
`--fastq`, `--out`, `--compress`, `--max-cpu`, etc.). Verify that every such option
appears in at least one non-skipped example. If any option is not covered, **add an
additional example** that exercises it before proceeding.
- **Skip any example that requires an external resource** (taxonomy database,
remote URL, pre-existing output file from a previous step not produced here).
Mark it as SKIP — it will be kept verbatim in the EXAMPLES section without
a `**Expected output:**` annotation.
- **`--paired-with` examples:** `--paired-with` requires `--out` (standard output
cannot be used). The command produces TWO output files named `<prefix>_R1.ext`
and `<prefix>_R2.ext` where `<prefix>` is the stem of the value given to `--out`
and `.ext` is the format extension. For example:
`obi{xxx} --paired-with reverse.fastq --out out_paired.fastq forward.fastq`
produces `out_paired_R1.fastq` and `out_paired_R2.fastq`.
Do NOT use `>` redirection for paired-with examples — use `--out` only.
In STATE 4, read both `_R1` and `_R2` output files.
2. For each distinct input filename, design synthetic sequence content that:
- Is **minimal** (≤ 20 sequences, each ≤ 300 bp).
- Contains sequences that **will** produce output for the given command (positive cases).
- Contains at least one sequence that **will not** produce output, to confirm filtering
(negative case), when the command filters sequences.
- Exercises every option combination present in the non-skipped examples.
- Uses realistic-looking identifiers (`seq001`, `seq002`, …) and a short
definition that describes what makes the sequence relevant to the test.
3. **File format rules (strictly enforced):**
**FASTA:** one `>id description` header line, then the sequence on one or more
lines (60 bp per line). Every sequence must be non-empty (≥ 10 bp, A/T/G/C only).
**FASTQ:** exactly 4 lines per record — see FASTQ FORMAT section above.
Before finalising the FASTQ content, mentally verify each record:
- Line 1 starts with `@`, has an identifier, optionally a space and description.
- Line 2 is the nucleotide sequence (non-empty, ≥ 10 characters).
- Line 3 is exactly `+` (nothing else required).
- Line 4 is the quality string with **exactly the same number of characters**
as line 2.
If any record fails this check, fix it before proceeding.
4. Rewrite every non-skipped example command into two forms:
- `$cmds_doc`: the bare command as it will appear in the documentation — references
only filenames present in `autodoc/examples/obi{xxx}/`, output redirected to a
descriptive filename (e.g. `out_default.fasta`). **No `cd` prefix.**
- `$cmds_run`: the same command prefixed with the `cd` so it can be executed:
`cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} &&`
**Output:** store file designs as `$files`, `$cmds_doc`, and `$cmds_run`.
**Stop.** Proceed to STATE 3.
---
### STATE 3 — Write input files, validate them, and run examples
**Input:** `$files`, `$cmds_doc`, `$cmds_run`.
**Step 3a — create input files (parallel):**
Emit one Write call per input file designed in STATE 2.
```
<function=Write>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/FILENAME", "content": "..."}
</function>
```
**Stop.** Wait for all writes to complete. Then proceed to Step 3b.
**Step 3b — validate input files:**
Before running any example, emit one Bash call that checks every written input file:
```
<function=Bash>
{"command": "cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} && python3 -c \"\nimport sys\nfor fname in $(echo FILENAMES):\n lines = open(fname).readlines()\n if fname.endswith('.fastq'):\n assert len(lines) % 4 == 0, f'{fname}: line count not multiple of 4'\n for i in range(0, len(lines), 4):\n hdr, seq, plus, qual = lines[i:i+4]\n assert hdr.startswith('@'), f'{fname} record {i//4+1}: header must start with @'\n seq = seq.rstrip(); qual = qual.rstrip()\n assert len(seq) >= 10, f'{fname} record {i//4+1}: sequence too short ({len(seq)})'\n assert len(seq) == len(qual), f'{fname} record {i//4+1}: seq len {len(seq)} != qual len {len(qual)}'\n elif fname.endswith('.fasta') or fname.endswith('.fa'):\n assert lines[0].startswith('>'), f'{fname}: first line must start with >'\nprint('All input files valid')\n\" 2>&1; echo EXIT:$?"}
</function>
```
If validation fails (EXIT non-zero or output is not `All input files valid`): fix the
offending file(s) with new Write calls, then re-run validation. Do NOT proceed to
Step 3c until validation passes.
**Step 3c — run examples (sequential, one Bash call at a time):**
Emit ONE Bash call, wait for the result, then emit the next. Do NOT batch them.
```
<function=Bash>
{"command": "cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} && COMMAND 2>&1; echo EXIT:$?"}
</function>
```
After each successful run (EXIT:0), immediately verify the output file was actually
created and is non-empty with a second Bash call:
```
<function=Bash>
{"command": "ls -la /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE && head -c 100 /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE"}
</function>
```
Also verify the output format matches expectation using the first character rule
(see OUTPUT FORMAT GUARD): `>` = FASTA, `@` = FASTQ, `[`/`{` = JSON. If the format
is wrong, add the missing `--fasta-output` / `--fastq-output` / `--json-output` flag,
update `$cmds_doc` and `$cmds_run`, and re-run.
For each command, record in `$runs`:
- The `$cmds_doc` form (bare command for documentation).
- Exit code.
- The output filename(s).
- The confirmed output format (FASTA / FASTQ / JSON).
- The full stdout/stderr text.
If a command fails (EXIT non-zero): diagnose the error from stderr, fix the command,
update both `$cmds_doc` and `$cmds_run`, and re-run.
Do NOT proceed to STATE 4 until all non-skipped commands have EXIT:0 and verified
non-empty output files.
**Output:** store per-command results as `$runs`.
**Stop.** Proceed to STATE 4.
---
### STATE 4 — Read output files
**Input:** `$runs` (output file paths from STATE 3).
**Action:** emit one Read call per output file that was successfully produced (EXIT:0).
```
<function=Read>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE"}
</function>
```
Emit all reads in a single parallel message.
**Output:** store contents as `$outputs`.
**Stop.** Proceed to STATE 5.
---
### STATE 5 — Update the documentation file
**Input:** `$doc`, `$runs`, `$outputs`, `$cmds_doc`.
Produce the updated file by copying `$doc` **verbatim** and applying ONLY the
three modifications below. Re-read the DOCUMENT PRESERVATION rules at the top before
writing.
#### Modification 1 — EXAMPLES section
For each non-skipped example:
- Replace the original command with the rewritten `$cmds_doc` form.
- Keep the one-line biological use-case comment above the code block unchanged.
- The `**Expected output:**` annotation goes on its own line **after** the closing
triple-backtick of the code block, never inside it:
```
```bash
obi{xxx} [options] input_file > out_name.fasta
```
**Expected output:** N sequences written to `out_name.fasta`.
```
where N is the count of lines starting with `>` or `@` in the corresponding
`$outputs` entry.
For skipped examples: keep them exactly as they are in `$doc` with no annotation.
#### Modification 2 — Prose corrections (DESCRIPTION, OPTIONS, NOTES, …)
After completing all runs in STATE 3, compare `$runs` and `$outputs` against the
prose in `$doc` outside the EXAMPLES section. For each **factual contradiction**
found — where the documentation claims a behaviour that actual execution disproves —
apply a minimal correction:
- Fix only the specific sentence or phrase that is wrong. Do not rewrite the
surrounding paragraph.
- Preserve the original wording as much as possible; change only what is incorrect.
- Examples of things to correct:
- An option described as producing output X when it actually produces output Y.
- A default value stated incorrectly.
- An attribute name that differs from what appears in actual output.
- A claim about which sequences are selected/discarded that contradicts observed results.
- An output format claimed by the documentation that differs from the actual output
format observed (e.g. claiming CSV output when the tool produces FASTA).
- After each corrected passage, add an inline HTML comment documenting the fix:
`<!-- corrected: <brief reason, e.g. "actual output is FASTA not CSV"> -->`
- Do NOT "improve" text that is merely incomplete or imprecise — only fix outright
contradictions with observed behaviour.
#### Modification 3 — OUTPUT section
Find the existing `# OUTPUT` section in `$doc`. At the very end of that section
(before the next `---` or `#` heading), append a single new subsection:
```markdown
## Observed output example
```
<verbatim excerpt — first ≤ 10 sequences from the first successful $outputs entry>
```
```
Rules:
- The excerpt is copied byte-for-byte from `$outputs`. No editing, no truncation
within a sequence record.
- Do NOT duplicate the OUTPUT section. There must be exactly one `# OUTPUT` heading
in the resulting file.
- If no output was successfully produced, omit this subsection entirely.
#### Final write
```
<function=Write>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/cmd/obi{xxx}.md", "content": "..."}
</function>
```
**Stop. Do not emit any text after the Write call.**