Files
obitools4/autodoc/prompt_examples.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

17 KiB

Task

Given autodoc/cmd/obi{xxx}.md, produce:

  1. autodoc/examples/obi{xxx}/ — a directory containing synthetic input sequence files that allow every example in the EXAMPLES section to be executed and validated.
  2. An updated autodoc/cmd/obi{xxx}.md — with corrected EXAMPLES and an enriched OUTPUT section describing observed output annotations.

TOOL CALL FORMAT — enforce before every call

A tool call is exactly:

<function=tool_name>
{"param": "value"}
</function>

Rules (no exceptions):

  • < is immediately followed by f — zero spaces, zero characters in between.
  • Parameters are a single JSON object — no XML tags, no <parameter=...>, no </parameter>.
  • No outer wrapper — never use <tool_call>, <tool_use>, or any other enclosing tag.
  • Tool name is lowercase with double underscores.

HALLUCINATION GUARD

Every sequence written in STATE 2 must be biologically valid for the command being tested. Derive sequence content from the OPTIONS and OUTPUT sections of $doc — never invent behaviour not described there.

EXECUTION GUARD — critical: The ## Observed output example subsection added in STATE 5 MUST contain verbatim bytes from $outputs (actual tool output read in STATE 4). It MUST NOT be invented or approximated. If no command succeeded, omit the subsection entirely rather than writing invented content.


DOCUMENT PRESERVATION — critical

The output of STATE 5 is $doc with surgical edits only. The rules are:

  • Copy the ENTIRE content of $doc verbatim into the new file.
  • Apply ONLY the three modifications described in STATE 5 (EXAMPLES update, prose corrections, OUTPUT subsection addition).
  • Do NOT reformat, reorder, rewrite, or restructure any heading, paragraph, option list, or prose from $doc unless it is factually contradicted by actual execution results (see Modification 2 in STATE 5).
  • Do NOT add new top-level sections (no ENVIRONMENT VARIABLES, no duplicate OUTPUT, etc.).
  • Do NOT change section title casing, Markdown heading levels, or list syntax.
  • If in doubt, leave the section exactly as it appears in $doc.

FASTQ FORMAT — mandatory structure

A valid FASTQ record is exactly 4 lines in this order:

@<identifier> <optional description>
<nucleotide sequence>          ← MUST be non-empty (≥ 10 characters, A/T/G/C only)
+
<quality string>               ← MUST be the exact same length as the sequence line

Common mistakes that are forbidden:

  • Writing @header\n+\nquality with the sequence line missing.
  • Writing a quality string shorter or longer than the sequence.
  • Mixing > (FASTA) and @ (FASTQ) headers in the same file.
  • Writing ~-separated fields (e.g. @seq002~description~here) — use a space.

OUTPUT FORMAT GUARD

OBITools4 determines the output format from the data content and explicit flags, not from the output filename extension. A file named out.fasta will contain FASTQ if quality scores are present and no --fasta-output flag is given.

Rules when designing examples:

  • If the example is meant to produce FASTA output from FASTQ input, the command MUST include --fasta-output.
  • If the example is meant to produce FASTQ output from FASTA input, the command MUST include --fastq-output.
  • Never assume an output format from the filename alone.
  • Verify the actual format of each output file in STATE 3b by checking its first character (> = FASTA, @ = FASTQ, [ or { = JSON).

OPTION VALIDATION GUARD

Before writing any example command in STATE 2, explicitly cross-check each option against the OPTIONS section of $doc:

  • Every flag used must appear in the OPTIONS section with the claimed semantics.
  • Input-format flags (--fasta, --fastq, --csv, --genbank, --embl, --ecopcr) tell the tool how to read the input. They do NOT affect the output format.
  • Output-format flags (--fasta-output, --fastq-output, --json-output) tell the tool what format to write. If there is no --csv-output (or similar) in the OPTIONS section, do NOT write an example claiming CSV output.
  • If an option needed for a working example is absent from $doc, mark that example as SKIP rather than inventing a flag.

ANNOTATION RULES — CRITICAL

When creating FASTA/FASTQ files with annotations:

  • Use only valid annotation attribute names: taxid, scientific_name, rank, definition, sample, run_id, instrument
  • For taxonomy data: use taxid (NCBI Taxonomy ID) and scientific_name — never invent taxids
  • Examples of valid taxonomy annotations:
    • >seq001 {"taxid":2} — Bacteria (valid NCBI taxid)
    • >seq002 {"taxid":2157,"scientific_name":"Archaea"} — Archaea (valid NCBI taxid)
    • >seq003 {"taxid":2759,"scientific_name":"Eukaryota"} — Eukaryota (valid NCBI taxid)
  • NEVER use invented taxids
  • Map attributes (JSON maps) must have names ending with _merged (e.g., taxid_merged, sample_merged)

CSV FILES FOR JOINS

When creating CSV files for obijoin:

  • Do NOT include the ID column in the CSV (the join key is specified separately via --by)
  • The CSV format is auto-detected; do NOT use --csv flag
  • Example CSV structure for taxid join:
    taxid,scientific_name,phylum
    2,Bacteria,Proteobacteria
    2157,Archaea,Euryarchaeota
    2759,Eukaryota,Arthropoda
    
  • Example command: obijoin --join-with taxonomy.csv --by taxid sequences.fasta

PIPELINE

Execute the five states below in order. Do not skip states. Do not merge states.


STATE 1 — Read the documentation file and fetch pipeline command docs

Input: nothing. Action:

Step 1a — read the autodoc file:

<function=Read>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/cmd/obi{xxx}.md"}
</function>

Step 1b — scan the EXAMPLES section of the file just read for any obi* commands other than obi{xxx} itself that appear in pipeline examples (e.g. obigrep, obiuniq, obiclean). For each such command found, emit a WebFetch call to retrieve its online documentation (in the same parallel message as Step 1a if possible, otherwise immediately after):

<function=WebFetch>
{"url": "https://obitools4.metabarcoding.org/obitools/obi<other>/"}
</function>

If the page returns a 404 or error, store an empty string for that command.

Output: store content as $doc, and store fetched pages as $pipeline_docs (a map from command name to page content). Stop. Do not interpret or summarise. Proceed to STATE 2.


STATE 2 — Analyse examples and design input files

Input: $doc. Action (no tool calls):

  1. Extract every example command from the EXAMPLES section of $doc.

    • Identify every distinct input filename referenced (e.g. sequences.fasta, reads_R1.fastq, …).
    • Identify every option used and verify each against the OPTIONS section (see OPTION VALIDATION GUARD above).
    • For any obi* command used in a pipeline (not obi{xxx} itself), verify its flags and expression syntax against $pipeline_docs. If $pipeline_docs for that command is empty (page not found), mark the example as SKIP rather than guessing the syntax.
    • Coverage check — command-specific options: list all command-specific options from the OPTIONS section (excluding those covered by standard option-sets: --fasta, --fastq, --out, --compress, --max-cpu, etc.). Verify that every such option appears in at least one non-skipped example. If any option is not covered, add an additional example that exercises it before proceeding.
    • Skip any example that requires an external resource (taxonomy database, remote URL, pre-existing output file from a previous step not produced here). Mark it as SKIP — it will be kept verbatim in the EXAMPLES section without a **Expected output:** annotation.
    • --paired-with examples: --paired-with requires --out (standard output cannot be used). The command produces TWO output files named <prefix>_R1.ext and <prefix>_R2.ext where <prefix> is the stem of the value given to --out and .ext is the format extension. For example: obi{xxx} --paired-with reverse.fastq --out out_paired.fastq forward.fastq produces out_paired_R1.fastq and out_paired_R2.fastq. Do NOT use > redirection for paired-with examples — use --out only. In STATE 4, read both _R1 and _R2 output files.
  2. For each distinct input filename, design synthetic sequence content that:

    • Is minimal (≤ 20 sequences, each ≤ 300 bp).
    • Contains sequences that will produce output for the given command (positive cases).
    • Contains at least one sequence that will not produce output, to confirm filtering (negative case), when the command filters sequences.
    • Exercises every option combination present in the non-skipped examples.
    • Uses realistic-looking identifiers (seq001, seq002, …) and a short definition that describes what makes the sequence relevant to the test.
  3. File format rules (strictly enforced):

    FASTA: one >id description header line, then the sequence on one or more lines (60 bp per line). Every sequence must be non-empty (≥ 10 bp, A/T/G/C only).

    FASTQ: exactly 4 lines per record — see FASTQ FORMAT section above. Before finalising the FASTQ content, mentally verify each record:

    • Line 1 starts with @, has an identifier, optionally a space and description.
    • Line 2 is the nucleotide sequence (non-empty, ≥ 10 characters).
    • Line 3 is exactly + (nothing else required).
    • Line 4 is the quality string with exactly the same number of characters as line 2. If any record fails this check, fix it before proceeding.
  4. Rewrite every non-skipped example command into two forms:

    • $cmds_doc: the bare command as it will appear in the documentation — references only filenames present in autodoc/examples/obi{xxx}/, output redirected to a descriptive filename (e.g. out_default.fasta). No cd prefix.
    • $cmds_run: the same command prefixed with the cd so it can be executed: cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} &&

Output: store file designs as $files, $cmds_doc, and $cmds_run. Stop. Proceed to STATE 3.


STATE 3 — Write input files, validate them, and run examples

Input: $files, $cmds_doc, $cmds_run.

Step 3a — create input files (parallel): Emit one Write call per input file designed in STATE 2.

<function=Write>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/FILENAME", "content": "..."}
</function>

Stop. Wait for all writes to complete. Then proceed to Step 3b.

Step 3b — validate input files: Before running any example, emit one Bash call that checks every written input file:

<function=Bash>
{"command": "cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} && python3 -c \"\nimport sys\nfor fname in $(echo FILENAMES):\n    lines = open(fname).readlines()\n    if fname.endswith('.fastq'):\n        assert len(lines) % 4 == 0, f'{fname}: line count not multiple of 4'\n        for i in range(0, len(lines), 4):\n            hdr, seq, plus, qual = lines[i:i+4]\n            assert hdr.startswith('@'), f'{fname} record {i//4+1}: header must start with @'\n            seq = seq.rstrip(); qual = qual.rstrip()\n            assert len(seq) >= 10, f'{fname} record {i//4+1}: sequence too short ({len(seq)})'\n            assert len(seq) == len(qual), f'{fname} record {i//4+1}: seq len {len(seq)} != qual len {len(qual)}'\n    elif fname.endswith('.fasta') or fname.endswith('.fa'):\n        assert lines[0].startswith('>'), f'{fname}: first line must start with >'\nprint('All input files valid')\n\" 2>&1; echo EXIT:$?"}
</function>

If validation fails (EXIT non-zero or output is not All input files valid): fix the offending file(s) with new Write calls, then re-run validation. Do NOT proceed to Step 3c until validation passes.

Step 3c — run examples (sequential, one Bash call at a time): Emit ONE Bash call, wait for the result, then emit the next. Do NOT batch them.

<function=Bash>
{"command": "cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} && COMMAND 2>&1; echo EXIT:$?"}
</function>

After each successful run (EXIT:0), immediately verify the output file was actually created and is non-empty with a second Bash call:

<function=Bash>
{"command": "ls -la /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE && head -c 100 /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE"}
</function>

Also verify the output format matches expectation using the first character rule (see OUTPUT FORMAT GUARD): > = FASTA, @ = FASTQ, [/{ = JSON. If the format is wrong, add the missing --fasta-output / --fastq-output / --json-output flag, update $cmds_doc and $cmds_run, and re-run.

For each command, record in $runs:

  • The $cmds_doc form (bare command for documentation).
  • Exit code.
  • The output filename(s).
  • The confirmed output format (FASTA / FASTQ / JSON).
  • The full stdout/stderr text.

If a command fails (EXIT non-zero): diagnose the error from stderr, fix the command, update both $cmds_doc and $cmds_run, and re-run. Do NOT proceed to STATE 4 until all non-skipped commands have EXIT:0 and verified non-empty output files.

Output: store per-command results as $runs. Stop. Proceed to STATE 4.


STATE 4 — Read output files

Input: $runs (output file paths from STATE 3). Action: emit one Read call per output file that was successfully produced (EXIT:0).

<function=Read>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE"}
</function>

Emit all reads in a single parallel message.

Output: store contents as $outputs. Stop. Proceed to STATE 5.


STATE 5 — Update the documentation file

Input: $doc, $runs, $outputs, $cmds_doc.

Produce the updated file by copying $doc verbatim and applying ONLY the three modifications below. Re-read the DOCUMENT PRESERVATION rules at the top before writing.

Modification 1 — EXAMPLES section

For each non-skipped example:

  • Replace the original command with the rewritten $cmds_doc form.

  • Keep the one-line biological use-case comment above the code block unchanged.

  • The **Expected output:** annotation goes on its own line after the closing triple-backtick of the code block, never inside it:

    ```bash
    obi{xxx} [options] input_file > out_name.fasta
    

    Expected output: N sequences written to out_name.fasta.

    
    where N is the count of lines starting with `>` or `@` in the corresponding
    `$outputs` entry.
    
    

For skipped examples: keep them exactly as they are in $doc with no annotation.

Modification 2 — Prose corrections (DESCRIPTION, OPTIONS, NOTES, …)

After completing all runs in STATE 3, compare $runs and $outputs against the prose in $doc outside the EXAMPLES section. For each factual contradiction found — where the documentation claims a behaviour that actual execution disproves — apply a minimal correction:

  • Fix only the specific sentence or phrase that is wrong. Do not rewrite the surrounding paragraph.
  • Preserve the original wording as much as possible; change only what is incorrect.
  • Examples of things to correct:
    • An option described as producing output X when it actually produces output Y.
    • A default value stated incorrectly.
    • An attribute name that differs from what appears in actual output.
    • A claim about which sequences are selected/discarded that contradicts observed results.
    • An output format claimed by the documentation that differs from the actual output format observed (e.g. claiming CSV output when the tool produces FASTA).
  • After each corrected passage, add an inline HTML comment documenting the fix: <!-- corrected: <brief reason, e.g. "actual output is FASTA not CSV"> -->
  • Do NOT "improve" text that is merely incomplete or imprecise — only fix outright contradictions with observed behaviour.

Modification 3 — OUTPUT section

Find the existing # OUTPUT section in $doc. At the very end of that section (before the next --- or # heading), append a single new subsection:

## Observed output example

<verbatim excerpt — first ≤ 10 sequences from the first successful $outputs entry>

Rules:

  • The excerpt is copied byte-for-byte from $outputs. No editing, no truncation within a sequence record.
  • Do NOT duplicate the OUTPUT section. There must be exactly one # OUTPUT heading in the resulting file.
  • If no output was successfully produced, omit this subsection entirely.

Final write

<function=Write>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/cmd/obi{xxx}.md", "content": "..."}
</function>

Stop. Do not emit any text after the Write call.