mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 12:00:39 +00:00
40 lines
1.7 KiB
Markdown
40 lines
1.7 KiB
Markdown
|
|
# Mann-Whitney U Distribution Implementation in `obistats`
|
|||
|
|
|
|||
|
|
The `obistats` package provides efficient computation of the **Mann-Whitney U distribution**, used in nonparametric hypothesis testing to compare two independent samples.
|
|||
|
|
|
|||
|
|
## Core Types
|
|||
|
|
|
|||
|
|
- **`UDist`**: Represents the discrete probability distribution of the U statistic for sample sizes `N1`, `N2`. It optionally handles **ties** via a tie-count vector `T`.
|
|||
|
|
|
|||
|
|
## Key Features
|
|||
|
|
|
|||
|
|
- ✅ **Exact distribution computation**, both with and without ties.
|
|||
|
|
- *No ties*: Uses dynamic programming (Mann–Whitney recurrence) in `O(N1·N2·U)` time.
|
|||
|
|
- *With ties*: Implements the linked-list-based algorithm from Cheung & Klotz (1997) via memoization (`makeUmemo`).
|
|||
|
|
|
|||
|
|
- ✅ **PMF & CDF evaluation**:
|
|||
|
|
- `PMF(U)` returns the probability mass at U.
|
|||
|
|
- `CDF(U)` computes cumulative probabilities using symmetry to minimize computation.
|
|||
|
|
|
|||
|
|
- ✅ **Support for tied ranks**:
|
|||
|
|
- `T` encodes tie multiplicities per rank; if nil, no ties are assumed.
|
|||
|
|
|
|||
|
|
- ✅ **Optimized recurrence**:
|
|||
|
|
- Exploits symmetry (`p_{n,m} = p_{m,n}`) and incremental DP to reduce memory/time.
|
|||
|
|
|
|||
|
|
- ✅ **Boundary handling**:
|
|||
|
|
- `Bounds()` returns support `[0, N1·N2]`.
|
|||
|
|
- `Step() = 0.5`, reflecting U’s discrete unit in tied cases.
|
|||
|
|
|
|||
|
|
## Algorithm Notes
|
|||
|
|
|
|||
|
|
- `p(U)` uses a 2D DP table (rows = *n*, columns = U), computing only necessary states.
|
|||
|
|
- `makeUmemo` builds a 3D memoization table (`k`, `n1`, `2U`) for tied distributions.
|
|||
|
|
- Performance bottlenecks noted in comments (e.g., map overhead) suggest future optimization paths.
|
|||
|
|
|
|||
|
|
## Use Case
|
|||
|
|
|
|||
|
|
Enables exact *p*-value calculation for the **Mann-Whitney U test**, especially valuable when:
|
|||
|
|
- Sample sizes are small-to-moderate (exact methods needed).
|
|||
|
|
- Data contain ties.
|