mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
8c7017a99d
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
40 lines
1.7 KiB
Markdown
40 lines
1.7 KiB
Markdown
# Mann-Whitney U Distribution Implementation in `obistats`
|
||
|
||
The `obistats` package provides efficient computation of the **Mann-Whitney U distribution**, used in nonparametric hypothesis testing to compare two independent samples.
|
||
|
||
## Core Types
|
||
|
||
- **`UDist`**: Represents the discrete probability distribution of the U statistic for sample sizes `N1`, `N2`. It optionally handles **ties** via a tie-count vector `T`.
|
||
|
||
## Key Features
|
||
|
||
- ✅ **Exact distribution computation**, both with and without ties.
|
||
- *No ties*: Uses dynamic programming (Mann–Whitney recurrence) in `O(N1·N2·U)` time.
|
||
- *With ties*: Implements the linked-list-based algorithm from Cheung & Klotz (1997) via memoization (`makeUmemo`).
|
||
|
||
- ✅ **PMF & CDF evaluation**:
|
||
- `PMF(U)` returns the probability mass at U.
|
||
- `CDF(U)` computes cumulative probabilities using symmetry to minimize computation.
|
||
|
||
- ✅ **Support for tied ranks**:
|
||
- `T` encodes tie multiplicities per rank; if nil, no ties are assumed.
|
||
|
||
- ✅ **Optimized recurrence**:
|
||
- Exploits symmetry (`p_{n,m} = p_{m,n}`) and incremental DP to reduce memory/time.
|
||
|
||
- ✅ **Boundary handling**:
|
||
- `Bounds()` returns support `[0, N1·N2]`.
|
||
- `Step() = 0.5`, reflecting U’s discrete unit in tied cases.
|
||
|
||
## Algorithm Notes
|
||
|
||
- `p(U)` uses a 2D DP table (rows = *n*, columns = U), computing only necessary states.
|
||
- `makeUmemo` builds a 3D memoization table (`k`, `n1`, `2U`) for tied distributions.
|
||
- Performance bottlenecks noted in comments (e.g., map overhead) suggest future optimization paths.
|
||
|
||
## Use Case
|
||
|
||
Enables exact *p*-value calculation for the **Mann-Whitney U test**, especially valuable when:
|
||
- Sample sizes are small-to-moderate (exact methods needed).
|
||
- Data contain ties.
|