mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 03:50:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

17 KiB

Raw Blame History

Task

Given autodoc/cmd/obi{xxx}.md, produce:

autodoc/examples/obi{xxx}/ — a directory containing synthetic input sequence files that allow every example in the EXAMPLES section to be executed and validated.
An updated autodoc/cmd/obi{xxx}.md — with corrected EXAMPLES and an enriched OUTPUT section describing observed output annotations.

TOOL CALL FORMAT — enforce before every call

A tool call is exactly:

<function=tool_name>
{"param": "value"}
</function>

Rules (no exceptions):

< is immediately followed by f — zero spaces, zero characters in between.
Parameters are a single JSON object — no XML tags, no <parameter=...>, no </parameter>.
No outer wrapper — never use <tool_call>, <tool_use>, or any other enclosing tag.
Tool name is lowercase with double underscores.

HALLUCINATION GUARD

Every sequence written in STATE 2 must be biologically valid for the command being tested. Derive sequence content from the OPTIONS and OUTPUT sections of $doc — never invent behaviour not described there.

EXECUTION GUARD — critical: The ## Observed output example subsection added in STATE 5 MUST contain verbatim bytes from $outputs (actual tool output read in STATE 4). It MUST NOT be invented or approximated. If no command succeeded, omit the subsection entirely rather than writing invented content.

DOCUMENT PRESERVATION — critical

The output of STATE 5 is $doc with surgical edits only. The rules are:

Copy the ENTIRE content of $doc verbatim into the new file.
Apply ONLY the three modifications described in STATE 5 (EXAMPLES update, prose corrections, OUTPUT subsection addition).
Do NOT reformat, reorder, rewrite, or restructure any heading, paragraph, option list, or prose from $doc unless it is factually contradicted by actual execution results (see Modification 2 in STATE 5).
Do NOT add new top-level sections (no ENVIRONMENT VARIABLES, no duplicate OUTPUT, etc.).
Do NOT change section title casing, Markdown heading levels, or list syntax.
If in doubt, leave the section exactly as it appears in $doc.

FASTQ FORMAT — mandatory structure

A valid FASTQ record is exactly 4 lines in this order:

@<identifier> <optional description>
<nucleotide sequence>          ← MUST be non-empty (≥ 10 characters, A/T/G/C only)
+
<quality string>               ← MUST be the exact same length as the sequence line

Common mistakes that are forbidden:

Writing @header\n+\nquality with the sequence line missing.
Writing a quality string shorter or longer than the sequence.
Mixing > (FASTA) and @ (FASTQ) headers in the same file.
Writing ~-separated fields (e.g. @seq002~description~here) — use a space.

OUTPUT FORMAT GUARD

OBITools4 determines the output format from the data content and explicit flags, not from the output filename extension. A file named out.fasta will contain FASTQ if quality scores are present and no --fasta-output flag is given.

Rules when designing examples:

If the example is meant to produce FASTA output from FASTQ input, the command MUST include --fasta-output.
If the example is meant to produce FASTQ output from FASTA input, the command MUST include --fastq-output.
Never assume an output format from the filename alone.
Verify the actual format of each output file in STATE 3b by checking its first character (> = FASTA, @ = FASTQ, [ or { = JSON).

OPTION VALIDATION GUARD

Before writing any example command in STATE 2, explicitly cross-check each option against the OPTIONS section of $doc:

Every flag used must appear in the OPTIONS section with the claimed semantics.
Input-format flags (--fasta, --fastq, --csv, --genbank, --embl, --ecopcr) tell the tool how to read the input. They do NOT affect the output format.
Output-format flags (--fasta-output, --fastq-output, --json-output) tell the tool what format to write. If there is no --csv-output (or similar) in the OPTIONS section, do NOT write an example claiming CSV output.
If an option needed for a working example is absent from $doc, mark that example as SKIP rather than inventing a flag.

ANNOTATION RULES — CRITICAL

When creating FASTA/FASTQ files with annotations:

Use only valid annotation attribute names: taxid, scientific_name, rank, definition, sample, run_id, instrument
For taxonomy data: use taxid (NCBI Taxonomy ID) and scientific_name — never invent taxids
Examples of valid taxonomy annotations:
- >seq001 {"taxid":2} — Bacteria (valid NCBI taxid)
- >seq002 {"taxid":2157,"scientific_name":"Archaea"} — Archaea (valid NCBI taxid)
- >seq003 {"taxid":2759,"scientific_name":"Eukaryota"} — Eukaryota (valid NCBI taxid)
NEVER use invented taxids
Map attributes (JSON maps) must have names ending with _merged (e.g., taxid_merged, sample_merged)

CSV FILES FOR JOINS

When creating CSV files for obijoin:

Do NOT include the ID column in the CSV (the join key is specified separately via --by)
The CSV format is auto-detected; do NOT use --csv flag

Example CSV structure for taxid join:

taxid,scientific_name,phylum
2,Bacteria,Proteobacteria
2157,Archaea,Euryarchaeota
2759,Eukaryota,Arthropoda

Example command: obijoin --join-with taxonomy.csv --by taxid sequences.fasta

PIPELINE

Execute the five states below in order. Do not skip states. Do not merge states.

STATE 1 — Read the documentation file and fetch pipeline command docs

Input: nothing. Action:

Step 1a — read the autodoc file:

<function=Read>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/cmd/obi{xxx}.md"}
</function>

Step 1b — scan the EXAMPLES section of the file just read for any obi* commands other than obi{xxx} itself that appear in pipeline examples (e.g. obigrep, obiuniq, obiclean). For each such command found, emit a WebFetch call to retrieve its online documentation (in the same parallel message as Step 1a if possible, otherwise immediately after):

<function=WebFetch>
{"url": "https://obitools4.metabarcoding.org/obitools/obi<other>/"}
</function>

If the page returns a 404 or error, store an empty string for that command.

Output: store content as $doc, and store fetched pages as $pipeline_docs (a map from command name to page content). Stop. Do not interpret or summarise. Proceed to STATE 2.

STATE 2 — Analyse examples and design input files

Input: $doc. Action (no tool calls):

Extract every example command from the EXAMPLES section of $doc.
- Identify every distinct input filename referenced (e.g. sequences.fasta, reads_R1.fastq, …).
- Identify every option used and verify each against the OPTIONS section (see OPTION VALIDATION GUARD above).
- For any obi* command used in a pipeline (not obi{xxx} itself), verify its flags and expression syntax against $pipeline_docs. If $pipeline_docs for that command is empty (page not found), mark the example as SKIP rather than guessing the syntax.
- Coverage check — command-specific options: list all command-specific options from the OPTIONS section (excluding those covered by standard option-sets: --fasta, --fastq, --out, --compress, --max-cpu, etc.). Verify that every such option appears in at least one non-skipped example. If any option is not covered, add an additional example that exercises it before proceeding.
- Skip any example that requires an external resource (taxonomy database, remote URL, pre-existing output file from a previous step not produced here). Mark it as SKIP — it will be kept verbatim in the EXAMPLES section without a **Expected output:** annotation.
- --paired-with examples: --paired-with requires --out (standard output cannot be used). The command produces TWO output files named <prefix>_R1.ext and <prefix>_R2.ext where <prefix> is the stem of the value given to --out and .ext is the format extension. For example: obi{xxx} --paired-with reverse.fastq --out out_paired.fastq forward.fastq produces out_paired_R1.fastq and out_paired_R2.fastq. Do NOT use > redirection for paired-with examples — use --out only. In STATE 4, read both _R1 and _R2 output files.
For each distinct input filename, design synthetic sequence content that:
- Is minimal (≤ 20 sequences, each ≤ 300 bp).
- Contains sequences that will produce output for the given command (positive cases).
- Contains at least one sequence that will not produce output, to confirm filtering (negative case), when the command filters sequences.
- Exercises every option combination present in the non-skipped examples.
- Uses realistic-looking identifiers (seq001, seq002, …) and a short definition that describes what makes the sequence relevant to the test.
File format rules (strictly enforced):

FASTA: one >id description header line, then the sequence on one or more lines (60 bp per line). Every sequence must be non-empty (≥ 10 bp, A/T/G/C only).

FASTQ: exactly 4 lines per record — see FASTQ FORMAT section above. Before finalising the FASTQ content, mentally verify each record:
- Line 1 starts with @, has an identifier, optionally a space and description.
- Line 2 is the nucleotide sequence (non-empty, ≥ 10 characters).
- Line 3 is exactly + (nothing else required).
- Line 4 is the quality string with exactly the same number of characters as line 2. If any record fails this check, fix it before proceeding.
Rewrite every non-skipped example command into two forms:
- $cmds_doc: the bare command as it will appear in the documentation — references only filenames present in autodoc/examples/obi{xxx}/, output redirected to a descriptive filename (e.g. out_default.fasta). No cd prefix.
- $cmds_run: the same command prefixed with the cd so it can be executed: cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} &&

Output: store file designs as $files, $cmds_doc, and $cmds_run. Stop. Proceed to STATE 3.

STATE 3 — Write input files, validate them, and run examples

Input: $files, $cmds_doc, $cmds_run.

Step 3a — create input files (parallel): Emit one Write call per input file designed in STATE 2.

<function=Write>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/FILENAME", "content": "..."}
</function>

Stop. Wait for all writes to complete. Then proceed to Step 3b.

Step 3b — validate input files: Before running any example, emit one Bash call that checks every written input file:

<function=Bash>
{"command": "cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} && python3 -c \"\nimport sys\nfor fname in $(echo FILENAMES):\n    lines = open(fname).readlines()\n    if fname.endswith('.fastq'):\n        assert len(lines) % 4 == 0, f'{fname}: line count not multiple of 4'\n        for i in range(0, len(lines), 4):\n            hdr, seq, plus, qual = lines[i:i+4]\n            assert hdr.startswith('@'), f'{fname} record {i//4+1}: header must start with @'\n            seq = seq.rstrip(); qual = qual.rstrip()\n            assert len(seq) >= 10, f'{fname} record {i//4+1}: sequence too short ({len(seq)})'\n            assert len(seq) == len(qual), f'{fname} record {i//4+1}: seq len {len(seq)} != qual len {len(qual)}'\n    elif fname.endswith('.fasta') or fname.endswith('.fa'):\n        assert lines[0].startswith('>'), f'{fname}: first line must start with >'\nprint('All input files valid')\n\" 2>&1; echo EXIT:$?"}
</function>

If validation fails (EXIT non-zero or output is not All input files valid): fix the offending file(s) with new Write calls, then re-run validation. Do NOT proceed to Step 3c until validation passes.

Step 3c — run examples (sequential, one Bash call at a time): Emit ONE Bash call, wait for the result, then emit the next. Do NOT batch them.

<function=Bash>
{"command": "cd /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx} && COMMAND 2>&1; echo EXIT:$?"}
</function>

After each successful run (EXIT:0), immediately verify the output file was actually created and is non-empty with a second Bash call:

<function=Bash>
{"command": "ls -la /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE && head -c 100 /Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE"}
</function>

Also verify the output format matches expectation using the first character rule (see OUTPUT FORMAT GUARD): > = FASTA, @ = FASTQ, [/{ = JSON. If the format is wrong, add the missing --fasta-output / --fastq-output / --json-output flag, update $cmds_doc and $cmds_run, and re-run.

For each command, record in $runs:

The $cmds_doc form (bare command for documentation).
Exit code.
The output filename(s).
The confirmed output format (FASTA / FASTQ / JSON).
The full stdout/stderr text.

If a command fails (EXIT non-zero): diagnose the error from stderr, fix the command, update both $cmds_doc and $cmds_run, and re-run. Do NOT proceed to STATE 4 until all non-skipped commands have EXIT:0 and verified non-empty output files.

Output: store per-command results as $runs. Stop. Proceed to STATE 4.

STATE 4 — Read output files

Input: $runs (output file paths from STATE 3). Action: emit one Read call per output file that was successfully produced (EXIT:0).

<function=Read>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/examples/obi{xxx}/OUTPUT_FILE"}
</function>

Emit all reads in a single parallel message.

Output: store contents as $outputs. Stop. Proceed to STATE 5.

STATE 5 — Update the documentation file

Input: $doc, $runs, $outputs, $cmds_doc.

Produce the updated file by copying $doc verbatim and applying ONLY the three modifications below. Re-read the DOCUMENT PRESERVATION rules at the top before writing.

Modification 1 — EXAMPLES section

For each non-skipped example:

Replace the original command with the rewritten $cmds_doc form.
Keep the one-line biological use-case comment above the code block unchanged.
The **Expected output:** annotation goes on its own line after the closing triple-backtick of the code block, never inside it:
```
```bash
obi{xxx} [options] input_file > out_name.fasta
```
Expected output: N sequences written to out_name.fasta.
```
where N is the count of lines starting with `>` or `@` in the corresponding
`$outputs` entry.
```

For skipped examples: keep them exactly as they are in $doc with no annotation.

Modification 2 — Prose corrections (DESCRIPTION, OPTIONS, NOTES, …)

After completing all runs in STATE 3, compare $runs and $outputs against the prose in $doc outside the EXAMPLES section. For each factual contradiction found — where the documentation claims a behaviour that actual execution disproves — apply a minimal correction:

Fix only the specific sentence or phrase that is wrong. Do not rewrite the surrounding paragraph.
Preserve the original wording as much as possible; change only what is incorrect.
Examples of things to correct:
- An option described as producing output X when it actually produces output Y.
- A default value stated incorrectly.
- An attribute name that differs from what appears in actual output.
- A claim about which sequences are selected/discarded that contradicts observed results.
- An output format claimed by the documentation that differs from the actual output format observed (e.g. claiming CSV output when the tool produces FASTA).
After each corrected passage, add an inline HTML comment documenting the fix: 
Do NOT "improve" text that is merely incomplete or imprecise — only fix outright contradictions with observed behaviour.

Modification 3 — OUTPUT section

Find the existing # OUTPUT section in $doc. At the very end of that section (before the next --- or # heading), append a single new subsection:

## Observed output example

Rules:

The excerpt is copied byte-for-byte from $outputs. No editing, no truncation within a sequence record.
Do NOT duplicate the OUTPUT section. There must be exactly one # OUTPUT heading in the resulting file.
If no output was successfully produced, omit this subsection entirely.

Final write

<function=Write>
{"file_path": "/Users/coissac/Sync/travail/__MOI__/GO/obitools4/autodoc/cmd/obi{xxx}.md", "content": "..."}
</function>

Stop. Do not emit any text after the Write call.

17 KiB Raw Blame History