- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
17 KiB
Task
Given autodoc/cmd/obi{xxx}.md, produce:
autodoc/examples/obi{xxx}/— a directory containing synthetic input sequence files that allow every example in the EXAMPLES section to be executed and validated.- An updated
autodoc/cmd/obi{xxx}.md— with corrected EXAMPLES and an enriched OUTPUT section describing observed output annotations.
TOOL CALL FORMAT — enforce before every call
A tool call is exactly:
<function=tool_name>
{"param": "value"}
</function>
Rules (no exceptions):
<is immediately followed byf— zero spaces, zero characters in between.- Parameters are a single JSON object — no XML tags, no
<parameter=...>, no</parameter>. - No outer wrapper — never use
<tool_call>,<tool_use>, or any other enclosing tag. - Tool name is lowercase with double underscores.
HALLUCINATION GUARD
Every sequence written in STATE 2 must be biologically valid for the command being
tested. Derive sequence content from the OPTIONS and OUTPUT sections of $doc — never
invent behaviour not described there.
EXECUTION GUARD — critical: The ## Observed output example subsection added in
STATE 5 MUST contain verbatim bytes from $outputs (actual tool output read in STATE 4).
It MUST NOT be invented or approximated. If no command succeeded, omit the subsection
entirely rather than writing invented content.
DOCUMENT PRESERVATION — critical
The output of STATE 5 is $doc with surgical edits only. The rules are:
- Copy the ENTIRE content of
$docverbatim into the new file. - Apply ONLY the three modifications described in STATE 5 (EXAMPLES update, prose corrections, OUTPUT subsection addition).
- Do NOT reformat, reorder, rewrite, or restructure any heading, paragraph,
option list, or prose from
$docunless it is factually contradicted by actual execution results (see Modification 2 in STATE 5). - Do NOT add new top-level sections (no ENVIRONMENT VARIABLES, no duplicate OUTPUT, etc.).
- Do NOT change section title casing, Markdown heading levels, or list syntax.
- If in doubt, leave the section exactly as it appears in
$doc.
FASTQ FORMAT — mandatory structure
A valid FASTQ record is exactly 4 lines in this order:
@<identifier> <optional description>
<nucleotide sequence> ← MUST be non-empty (≥ 10 characters, A/T/G/C only)
+
<quality string> ← MUST be the exact same length as the sequence line
Common mistakes that are forbidden:
- Writing
@header\n+\nqualitywith the sequence line missing. - Writing a quality string shorter or longer than the sequence.
- Mixing
>(FASTA) and@(FASTQ) headers in the same file. - Writing
~-separated fields (e.g.@seq002~description~here) — use a space.
OUTPUT FORMAT GUARD
OBITools4 determines the output format from the data content and explicit flags,
not from the output filename extension. A file named out.fasta will contain FASTQ
if quality scores are present and no --fasta-output flag is given.
Rules when designing examples:
- If the example is meant to produce FASTA output from FASTQ input, the command MUST
include
--fasta-output. - If the example is meant to produce FASTQ output from FASTA input, the command MUST
include
--fastq-output. - Never assume an output format from the filename alone.
- Verify the actual format of each output file in STATE 3b by checking its first
character (
>= FASTA,@= FASTQ,[or{= JSON).
OPTION VALIDATION GUARD
Before writing any example command in STATE 2, explicitly cross-check each option
against the OPTIONS section of $doc:
- Every flag used must appear in the OPTIONS section with the claimed semantics.
- Input-format flags (
--fasta,--fastq,--csv,--genbank,--embl,--ecopcr) tell the tool how to read the input. They do NOT affect the output format. - Output-format flags (
--fasta-output,--fastq-output,--json-output) tell the tool what format to write. If there is no--csv-output(or similar) in the OPTIONS section, do NOT write an example claiming CSV output. - If an option needed for a working example is absent from
$doc, mark that example as SKIP rather than inventing a flag.
ANNOTATION RULES — CRITICAL
When creating FASTA/FASTQ files with annotations:
- Use only valid annotation attribute names:
taxid,scientific_name,rank,definition,sample,run_id,instrument - For taxonomy data: use
taxid(NCBI Taxonomy ID) andscientific_name— never invent taxids - Examples of valid taxonomy annotations:
>seq001 {"taxid":2}— Bacteria (valid NCBI taxid)>seq002 {"taxid":2157,"scientific_name":"Archaea"}— Archaea (valid NCBI taxid)>seq003 {"taxid":2759,"scientific_name":"Eukaryota"}— Eukaryota (valid NCBI taxid)
- NEVER use invented taxids
- Map attributes (JSON maps) must have names ending with
_merged(e.g.,taxid_merged,sample_merged)
CSV FILES FOR JOINS
When creating CSV files for obijoin:
- Do NOT include the ID column in the CSV (the join key is specified separately via
--by) - The CSV format is auto-detected; do NOT use
--csvflag - Example CSV structure for taxid join:
taxid,scientific_name,phylum 2,Bacteria,Proteobacteria 2157,Archaea,Euryarchaeota 2759,Eukaryota,Arthropoda - Example command:
obijoin --join-with taxonomy.csv --by taxid sequences.fasta
PIPELINE
Execute the five states below in order. Do not skip states. Do not merge states.
STATE 1 — Read the documentation file and fetch pipeline command docs
Input: nothing. Action:
Step 1a — read the autodoc file:
<function=Read>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/cmd/obi{xxx}.md"}
</function>
Step 1b — scan the EXAMPLES section of the file just read for any obi* commands
other than obi{xxx} itself that appear in pipeline examples (e.g. obigrep, obiuniq,
obiclean). For each such command found, emit a WebFetch call to retrieve its online
documentation (in the same parallel message as Step 1a if possible, otherwise
immediately after):
<function=WebFetch>
{"url": "https://obitools4.metabarcoding.org/obitools/obi<other>/"}
</function>
If the page returns a 404 or error, store an empty string for that command.
Output: store content as $doc, and store fetched pages as $pipeline_docs
(a map from command name to page content).
Stop. Do not interpret or summarise. Proceed to STATE 2.
STATE 2 — Analyse examples and design input files
Input: $doc.
Action (no tool calls):
-
Extract every example command from the EXAMPLES section of
$doc.- Identify every distinct input filename referenced (e.g.
sequences.fasta,reads_R1.fastq, …). - Identify every option used and verify each against the OPTIONS section (see OPTION VALIDATION GUARD above).
- For any
obi*command used in a pipeline (notobi{xxx}itself), verify its flags and expression syntax against$pipeline_docs. If$pipeline_docsfor that command is empty (page not found), mark the example as SKIP rather than guessing the syntax. - Coverage check — command-specific options: list all command-specific options
from the OPTIONS section (excluding those covered by standard option-sets:
--fasta,--fastq,--out,--compress,--max-cpu, etc.). Verify that every such option appears in at least one non-skipped example. If any option is not covered, add an additional example that exercises it before proceeding. - Skip any example that requires an external resource (taxonomy database,
remote URL, pre-existing output file from a previous step not produced here).
Mark it as SKIP — it will be kept verbatim in the EXAMPLES section without
a
**Expected output:**annotation. --paired-withexamples:--paired-withrequires--out(standard output cannot be used). The command produces TWO output files named<prefix>_R1.extand<prefix>_R2.extwhere<prefix>is the stem of the value given to--outand.extis the format extension. For example:obi{xxx} --paired-with reverse.fastq --out out_paired.fastq forward.fastqproducesout_paired_R1.fastqandout_paired_R2.fastq. Do NOT use>redirection for paired-with examples — use--outonly. In STATE 4, read both_R1and_R2output files.
- Identify every distinct input filename referenced (e.g.
-
For each distinct input filename, design synthetic sequence content that:
- Is minimal (≤ 20 sequences, each ≤ 300 bp).
- Contains sequences that will produce output for the given command (positive cases).
- Contains at least one sequence that will not produce output, to confirm filtering (negative case), when the command filters sequences.
- Exercises every option combination present in the non-skipped examples.
- Uses realistic-looking identifiers (
seq001,seq002, …) and a short definition that describes what makes the sequence relevant to the test.
-
File format rules (strictly enforced):
FASTA: one
>id descriptionheader line, then the sequence on one or more lines (60 bp per line). Every sequence must be non-empty (≥ 10 bp, A/T/G/C only).FASTQ: exactly 4 lines per record — see FASTQ FORMAT section above. Before finalising the FASTQ content, mentally verify each record:
- Line 1 starts with
@, has an identifier, optionally a space and description. - Line 2 is the nucleotide sequence (non-empty, ≥ 10 characters).
- Line 3 is exactly
+(nothing else required). - Line 4 is the quality string with exactly the same number of characters as line 2. If any record fails this check, fix it before proceeding.
- Line 1 starts with
-
Rewrite every non-skipped example command into two forms:
$cmds_doc: the bare command as it will appear in the documentation — references only filenames present inautodoc/examples/obi{xxx}/, output redirected to a descriptive filename (e.g.out_default.fasta). Nocdprefix.$cmds_run: the same command prefixed with thecdso it can be executed:cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} &&
Output: store file designs as $files, $cmds_doc, and $cmds_run.
Stop. Proceed to STATE 3.
STATE 3 — Write input files, validate them, and run examples
Input: $files, $cmds_doc, $cmds_run.
Step 3a — create input files (parallel): Emit one Write call per input file designed in STATE 2.
<function=Write>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/FILENAME", "content": "..."}
</function>
Stop. Wait for all writes to complete. Then proceed to Step 3b.
Step 3b — validate input files: Before running any example, emit one Bash call that checks every written input file:
<function=Bash>
{"command": "cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} && python3 -c \"\nimport sys\nfor fname in $(echo FILENAMES):\n lines = open(fname).readlines()\n if fname.endswith('.fastq'):\n assert len(lines) % 4 == 0, f'{fname}: line count not multiple of 4'\n for i in range(0, len(lines), 4):\n hdr, seq, plus, qual = lines[i:i+4]\n assert hdr.startswith('@'), f'{fname} record {i//4+1}: header must start with @'\n seq = seq.rstrip(); qual = qual.rstrip()\n assert len(seq) >= 10, f'{fname} record {i//4+1}: sequence too short ({len(seq)})'\n assert len(seq) == len(qual), f'{fname} record {i//4+1}: seq len {len(seq)} != qual len {len(qual)}'\n elif fname.endswith('.fasta') or fname.endswith('.fa'):\n assert lines[0].startswith('>'), f'{fname}: first line must start with >'\nprint('All input files valid')\n\" 2>&1; echo EXIT:$?"}
</function>
If validation fails (EXIT non-zero or output is not All input files valid): fix the
offending file(s) with new Write calls, then re-run validation. Do NOT proceed to
Step 3c until validation passes.
Step 3c — run examples (sequential, one Bash call at a time): Emit ONE Bash call, wait for the result, then emit the next. Do NOT batch them.
<function=Bash>
{"command": "cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} && COMMAND 2>&1; echo EXIT:$?"}
</function>
After each successful run (EXIT:0), immediately verify the output file was actually created and is non-empty with a second Bash call:
<function=Bash>
{"command": "ls -la /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE && head -c 100 /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE"}
</function>
Also verify the output format matches expectation using the first character rule
(see OUTPUT FORMAT GUARD): > = FASTA, @ = FASTQ, [/{ = JSON. If the format
is wrong, add the missing --fasta-output / --fastq-output / --json-output flag,
update $cmds_doc and $cmds_run, and re-run.
For each command, record in $runs:
- The
$cmds_docform (bare command for documentation). - Exit code.
- The output filename(s).
- The confirmed output format (FASTA / FASTQ / JSON).
- The full stdout/stderr text.
If a command fails (EXIT non-zero): diagnose the error from stderr, fix the command,
update both $cmds_doc and $cmds_run, and re-run.
Do NOT proceed to STATE 4 until all non-skipped commands have EXIT:0 and verified
non-empty output files.
Output: store per-command results as $runs.
Stop. Proceed to STATE 4.
STATE 4 — Read output files
Input: $runs (output file paths from STATE 3).
Action: emit one Read call per output file that was successfully produced (EXIT:0).
<function=Read>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE"}
</function>
Emit all reads in a single parallel message.
Output: store contents as $outputs.
Stop. Proceed to STATE 5.
STATE 5 — Update the documentation file
Input: $doc, $runs, $outputs, $cmds_doc.
Produce the updated file by copying $doc verbatim and applying ONLY the
three modifications below. Re-read the DOCUMENT PRESERVATION rules at the top before
writing.
Modification 1 — EXAMPLES section
For each non-skipped example:
-
Replace the original command with the rewritten
$cmds_docform. -
Keep the one-line biological use-case comment above the code block unchanged.
-
The
**Expected output:**annotation goes on its own line after the closing triple-backtick of the code block, never inside it:```bash obi{xxx} [options] input_file > out_name.fastaExpected output: N sequences written to
out_name.fasta.where N is the count of lines starting with `>` or `@` in the corresponding `$outputs` entry.
For skipped examples: keep them exactly as they are in $doc with no annotation.
Modification 2 — Prose corrections (DESCRIPTION, OPTIONS, NOTES, …)
After completing all runs in STATE 3, compare $runs and $outputs against the
prose in $doc outside the EXAMPLES section. For each factual contradiction
found — where the documentation claims a behaviour that actual execution disproves —
apply a minimal correction:
- Fix only the specific sentence or phrase that is wrong. Do not rewrite the surrounding paragraph.
- Preserve the original wording as much as possible; change only what is incorrect.
- Examples of things to correct:
- An option described as producing output X when it actually produces output Y.
- A default value stated incorrectly.
- An attribute name that differs from what appears in actual output.
- A claim about which sequences are selected/discarded that contradicts observed results.
- An output format claimed by the documentation that differs from the actual output format observed (e.g. claiming CSV output when the tool produces FASTA).
- After each corrected passage, add an inline HTML comment documenting the fix:
<!-- corrected: <brief reason, e.g. "actual output is FASTA not CSV"> --> - Do NOT "improve" text that is merely incomplete or imprecise — only fix outright contradictions with observed behaviour.
Modification 3 — OUTPUT section
Find the existing # OUTPUT section in $doc. At the very end of that section
(before the next --- or # heading), append a single new subsection:
## Observed output example
<verbatim excerpt — first ≤ 10 sequences from the first successful $outputs entry>
Rules:
- The excerpt is copied byte-for-byte from
$outputs. No editing, no truncation within a sequence record. - Do NOT duplicate the OUTPUT section. There must be exactly one
# OUTPUTheading in the resulting file. - If no output was successfully produced, omit this subsection entirely.
Final write
<function=Write>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/cmd/obi{xxx}.md", "content": "..."}
</function>
Stop. Do not emit any text after the Write call.