OBIJupyterHub/jupyterhub_volumes/course/biodiversity_metrics_exercises.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "90c78109",
   "metadata": {},
   "source": [
    "# Biodiversity Metrics — Hands-on Practice\n",
    "\n",
    "Follow this guided lab to reproduce the workflow presented in the *Biodiversity metrics & metabarcoding* lecture. Each exercise mirrors a concept from the slides so that you can redo the analyses step by step. Work in order, execute your own solution in the placeholder cell, and reveal the provided answer only after you tried.\n",
    "\n",
    "**How to use this notebook**\n",
    "\n",
    "- Run the notebook with an R kernel.\n",
    "- Complete each code task in the `# your work here` cell.\n",
    "- Compare with the hidden solution when you are confident in your attempt.\n",
    "- Keep notes about questions or observations so that we can review them during lab time."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c4e6e7a",
   "metadata": {},
   "source": [
    "## Series 1 – Preparing the Environment\n",
    "Focus: installing and loading every package used throughout the lecture (devtools, tidyverse, vegan, latex2exp, kableExtra, MetabarSchool)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10e073bb",
   "metadata": {},
   "source": [
    "### Exercise 1.1 – Install the required packages\n",
    "\n",
    "Install *devtools*, *tidyverse*, *vegan*, *latex2exp*, and *kableExtra* from CRAN if they are not already available. Then install **MetabarSchool** from the Git repository shown in the lecture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58323009",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58073d0f",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "if (!requireNamespace('devtools', quietly = TRUE)) install.packages('devtools', dependencies = TRUE)\n",
    "if (!requireNamespace('tidyverse', quietly = TRUE)) install.packages('tidyverse', dependencies = TRUE)\n",
    "if (!requireNamespace('vegan', quietly = TRUE)) install.packages('vegan', dependencies = TRUE)\n",
    "if (!requireNamespace('latex2exp', quietly = TRUE)) install.packages('latex2exp', dependencies = TRUE)\n",
    "if (!requireNamespace('kableExtra', quietly = TRUE)) install.packages('kableExtra', dependencies = TRUE)\n",
    "if (!requireNamespace('MetabarSchool', quietly = TRUE)) {\n",
    "  devtools::install_git('https://git.metabarcoding.org/MetabarcodingSchool/biodiversity-metrics.git')\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "698da8cf",
   "metadata": {},
   "source": [
    "### Exercise 1.2 – Load every library\n",
    "\n",
    "Attach tidyverse, vegan, latex2exp, kableExtra, and MetabarSchool so that their functions are available for the rest of the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2719c43e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01a16266",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "library(tidyverse)\n",
    "library(vegan)\n",
    "library(latex2exp)\n",
    "library(kableExtra)\n",
    "library(MetabarSchool)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38972d40",
   "metadata": {},
   "source": [
    "## Series 2 – Exploring the mock community dataset\n",
    "Focus: loading the `positive.*` objects that describe the mock community and understanding their structure."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9d6eb01",
   "metadata": {},
   "source": [
    "### Exercise 2.1 – Load the mock community data\n",
    "\n",
    "Load `positive.count`, `positive.samples`, and `positive.motus` from the MetabarSchool package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45c6e967",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd1f426f",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "data('positive.count')\n",
    "data('positive.samples')\n",
    "data('positive.motus')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5782ad96",
   "metadata": {},
   "source": [
    "### Exercise 2.2 – Inspect matrices and metadata\n",
    "\n",
    "Display the dimensions of `positive.count`, the first rows of `positive.samples`, and the first rows of `positive.motus` to replicate the introductory slides."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a1470bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f150200",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "dim(positive.count)\n",
    "head(positive.samples, n = 3)\n",
    "head(positive.motus, n = 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed7b50cc",
   "metadata": {},
   "source": [
    "## Series 3 – Early data curation\n",
    "Focus: removing singleton MOTUs and visualizing the uneven sequencing depth highlighted in the lecture."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "566f88ae",
   "metadata": {},
   "source": [
    "### Exercise 3.1 – Remove singleton MOTUs\n",
    "\n",
    "Count how many MOTUs occur once across the full dataset and filter them out of both `positive.count` and `positive.motus`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ac5a29b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39a96b44",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "singleton.mask <- colSums(positive.count) > 1\n",
    "table(singleton.mask)\n",
    "positive.count <- positive.count[, singleton.mask]\n",
    "positive.motus <- positive.motus[singleton.mask, ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "433e47fe",
   "metadata": {},
   "source": [
    "### Exercise 3.2 – Examine read depth per PCR\n",
    "\n",
    "Plot the distribution of total reads per PCR (log scale optional) to confirm that sequencing depth is heterogeneous."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e72af69",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a418577",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "par(bg = NA)\n",
    "hist(rowSums(positive.count), breaks = 15,\n",
    "     xlab = 'Read counts', main = 'Number of reads per PCR')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b1d2501",
   "metadata": {},
   "source": [
    "### Exercise 3.3 – Test the influence of dilution\n",
    "\n",
    "Reproduce the boxplot of read counts across dilution levels and compute the ANOVA that quantifies the fraction of variance explained by dilution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a27498aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3e44168",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "par(bg = NA)\n",
    "boxplot(rowSums(positive.count) ~ positive.samples$dilution, log = 'y',\n",
    "        xlab = 'Dilution', ylab = 'Total reads')\n",
    "aov_table <- summary(aov(rowSums(positive.count) ~ positive.samples$dilution))[[1]]\n",
    "round((aov_table['positive.samples$dilution', 'Sum Sq'] / sum(aov_table[, 'Sum Sq'])) * 100, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d26b97bb",
   "metadata": {},
   "source": [
    "## Series 4 – Rarefaction workflow\n",
    "Focus: rarefying all PCRs to the same depth and documenting which MOTUs are retained."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7215a84e",
   "metadata": {},
   "source": [
    "### Exercise 4.1 – Rarefy down to 2,000 reads\n",
    "\n",
    "Find the minimum sequencing depth across PCRs and run `rrarefy()` so that every PCR contains exactly 2,000 reads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5b630ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9396d525",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "min_reads <- min(rowSums(positive.count))\n",
    "min_reads\n",
    "positive.count.rarefied <- rrarefy(positive.count, sample = 2000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9ffa199",
   "metadata": {},
   "source": [
    "### Exercise 4.2 – Track retained MOTUs\n",
    "\n",
    "Identify which MOTUs have non-zero counts after rarefaction. Keep those columns in the rarefied count matrix as well as in `positive.motus`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7ca495f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d64a8f53",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "are.still.present <- colSums(positive.count.rarefied) > 0\n",
    "table(are.still.present)\n",
    "positive.count.rarefied <- positive.count.rarefied[, are.still.present]\n",
    "positive.motus.rare <- positive.motus[are.still.present, ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91c3d626",
   "metadata": {},
   "source": [
    "### Exercise 4.3 – Compare distributions before and after\n",
    "\n",
    "Use histograms or boxplots to compare the MOTU read-count distribution before and after rarefaction, as shown on the slides."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58aa7960",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3acddc91",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "par(mfrow = c(1, 2), bg = NA)\n",
    "hist(log10(colSums(positive.count) + 1), breaks = 30,\n",
    "     xlab = expression(log[10](reads~per~MOTU)), main = 'Not rarefied',\n",
    "     xlim = c(0, 6), ylim = c(0, 2300))\n",
    "hist(log10(colSums(positive.count.rarefied) + 1), breaks = 30,\n",
    "     xlab = expression(log[10](reads~per~MOTU)), main = 'Rarefied data',\n",
    "     xlim = c(0, 6), ylim = c(0, 2300))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8356f83d",
   "metadata": {},
   "source": [
    "## Series 5 – Relative frequencies\n",
    "Focus: transforming counts to relative abundances so that all PCRs sum to one."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "093a4bdc",
   "metadata": {},
   "source": [
    "### Exercise 5.1 – Compute relative frequencies\n",
    "\n",
    "Apply `decostand(..., method = \"total\")` to convert the raw `positive.count` matrix into `positive.count.relfreq`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d04a1702",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84051b2e",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "positive.count.relfreq <- decostand(positive.count, method = 'total')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac101584",
   "metadata": {},
   "source": [
    "### Exercise 5.2 – Check for empty MOTUs\n",
    "\n",
    "Verify that no MOTU sums to zero after the relative-frequency transformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6dc22c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "432d3736",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "table(colSums(positive.count.relfreq) == 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c98dd7aa",
   "metadata": {},
   "source": [
    "## Series 6 – Alpha-diversity on toy environments\n",
    "Focus: rehearse richness, Gini–Simpson, Shannon entropy, and Hill numbers exactly as on the lecture example."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5619e9b0",
   "metadata": {},
   "source": [
    "### Exercise 6.1 – Encode the two environments and compute richness\n",
    "\n",
    "Create the two probability vectors from the slides (E1 and E2), bind them into a matrix named `environments`, and compute richness `S` for each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1745870a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ae56e8d",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "E1 <- c(A = 0.25, B = 0.25, C = 0.25, D = 0.25, E = 0, F = 0, G = 0)\n",
    "E2 <- c(A = 0.55, B = 0.07, C = 0.02, D = 0.17, E = 0.07, F = 0.07, G = 0.03)\n",
    "environments <- rbind(`Environment 1` = E1, `Environment 2` = E2)\n",
    "S <- rowSums(environments > 0)\n",
    "S"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8957a397",
   "metadata": {},
   "source": [
    "### Exercise 6.2 – Compute Gini–Simpson and Shannon indices\n",
    "\n",
    "Use the formulas from the slides to compute Gini–Simpson (`1 - sum(p^2)`) and Shannon entropy (`-sum(p * log(p))`) for each environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf9e5cb8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e001766e",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "GiniSimpson <- 1 - rowSums(environments^2)\n",
    "Shannon <- -rowSums(environments * log(environments), na.rm = TRUE)\n",
    "data.frame(`Gini-Simpson` = GiniSimpson, `Shannon` = Shannon)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0b46c2c",
   "metadata": {},
   "source": [
    "### Exercise 6.3 – Convert Shannon entropy to Hill numbers\n",
    "\n",
    "Follow the lecture by transforming Shannon entropy to Hill numbers via `exp(H)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "597ffb15",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48b56af5",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "Hill <- exp(-rowSums(environments * log(environments), na.rm = TRUE))\n",
    "Hill"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f83fda9d",
   "metadata": {},
   "source": [
    "## Series 7 – Generalized entropy and biodiversity spectra\n",
    "Focus: implementing the `log_q`, `exp_q`, `H_q`, and `D_q` helpers and generating diversity spectra just like on the slides."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4827495",
   "metadata": {},
   "source": [
    "### Exercise 7.1 – Implement the generalized log and exponential\n",
    "\n",
    "Write the `log_q()` and `exp_q()` functions introduced in the lecture. They should fall back to the natural log/exp when `q = 1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91876c04",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8dc09ddc",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "log_q <- function(x, q = 1) {\n",
    "  if (q == 1) {\n",
    "    log(x)\n",
    "  } else {\n",
    "    (x^(1 - q) - 1) / (1 - q)\n",
    "  }\n",
    "}\n",
    "exp_q <- function(x, q = 1) {\n",
    "  if (q == 1) {\n",
    "    exp(x)\n",
    "  } else {\n",
    "    (1 + (1 - q) * x)^(1 / (1 - q))\n",
    "  }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0842dfd5",
   "metadata": {},
   "source": [
    "### Exercise 7.2 – Build entropy and diversity spectra\n",
    "\n",
    "Define `H_q()`, `D_q()`, `H_spectrum()`, and `D_spectrum()`, then compute the spectra for the toy environments over `q = seq(0, 3, by = 0.1)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e4262b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5addf0f",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "H_q <- function(x, q = 1) {\n",
    "  sum(x * log_q(1 / x, q), na.rm = TRUE)\n",
    "}\n",
    "D_q <- function(x, q = 1) {\n",
    "  exp_q(H_q(x, q), q)\n",
    "}\n",
    "H_spectrum <- function(x, q = 1) {\n",
    "  sapply(q, function(Q) H_q(x, Q))\n",
    "}\n",
    "D_spectrum <- function(x, q = 1) {\n",
    "  sapply(q, function(Q) D_q(x, Q))\n",
    "}\n",
    "qs <- seq(0, 3, by = 0.1)\n",
    "environments.H <- apply(environments, MARGIN = 1, H_spectrum, q = qs)\n",
    "environments.D <- apply(environments, MARGIN = 1, D_spectrum, q = qs)\n",
    "list(H = environments.H, D = environments.D)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c294ed9e",
   "metadata": {},
   "source": [
    "### Exercise 7.3 – Compare spectra for the mock community\n",
    "\n",
    "Use the `plants.16` dataset and the mock-community relative frequencies to compute `H.mock`, `D.mock`, and the distributions of `positive.H` and `positive.D` shown on the slides."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e86de7d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "accb758e",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "data('plants.16')\n",
    "positive.count.relfreq <- decostand(positive.count, method = 'total')\n",
    "H.mock <- H_spectrum(plants.16$dilution, q = qs)\n",
    "D.mock <- D_spectrum(plants.16$dilution, q = qs)\n",
    "positive.H <- apply(positive.count.relfreq, MARGIN = 1, H_spectrum, q = qs)\n",
    "positive.D <- apply(positive.count.relfreq, MARGIN = 1, D_spectrum, q = qs)\n",
    "list(H.mock = H.mock, D.mock = D.mock, positive.H = positive.H, positive.D = positive.D)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54760c35",
   "metadata": {},
   "source": [
    "## Series 8 – Impact of extra cleaning on α-diversity\n",
    "Focus: repeating the spectral computation on the cleaned dataset (`positive.clean.count`)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed7211a9",
   "metadata": {},
   "source": [
    "### Exercise 8.1 – Load the cleaned counts and normalize\n",
    "\n",
    "Load `positive.clean.count`, convert it to relative frequencies, and store the result in `positive.clean.count.relfreq`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05cf6f64",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46de8a61",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "data('positive.clean.count')\n",
    "positive.clean.count.relfreq <- decostand(positive.clean.count, method = 'total')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15087b1c",
   "metadata": {},
   "source": [
    "### Exercise 8.2 – Compute entropy and diversity spectra\n",
    "\n",
    "Apply `H_spectrum()` and `D_spectrum()` to every cleaned PCR (using the same `qs`) and summarize the distributions (e.g., via `rowMeans` or boxplots)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccca62a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "805ebb6f",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "positive.clean.H <- apply(positive.clean.count.relfreq, MARGIN = 1, H_spectrum, q = qs)\n",
    "positive.clean.D <- apply(positive.clean.count.relfreq, MARGIN = 1, D_spectrum, q = qs)\n",
    "positive.clean.H.means <- rowMeans(positive.clean.H)\n",
    "positive.clean.D.means <- rowMeans(positive.clean.D)\n",
    "list(H = positive.clean.H, D = positive.clean.D,\n",
    "     H.means = positive.clean.H.means,\n",
    "     D.means = positive.clean.D.means)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ada1e0c9",
   "metadata": {},
   "source": [
    "## Series 9 – β-diversity: Guiana rainforest case study\n",
    "Focus: cleaning replicated PCRs, averaging them, and building several distance matrices on the Guiana dataset exactly like on the slides."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9799240",
   "metadata": {},
   "source": [
    "### Exercise 9.1 – Load Guiana data and flag bad PCRs\n",
    "\n",
    "Load `guiana.count`, `guiana.samples`, and `guiana.motus`. Then repeat the `tag_bad_pcr()` filtering three times so that only reliable PCRs remain in `guiana.count.clean` and `guiana.samples.clean`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6a20527",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2665255a",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "data('guiana.count')\n",
    "data('guiana.samples')\n",
    "data('guiana.motus')\n",
    "guiana.count.clean <- guiana.count\n",
    "guiana.samples.clean <- guiana.samples\n",
    "for (cycle in 1:3) {\n",
    "  s <- tag_bad_pcr(guiana.samples.clean$sample, guiana.count.clean)\n",
    "  guiana.count.clean <- guiana.count.clean[s$keep, ]\n",
    "  guiana.samples.clean <- guiana.samples.clean[s$keep, ]\n",
    "}\n",
    "table(s$keep)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "883353c4",
   "metadata": {},
   "source": [
    "### Exercise 9.2 – Average PCR replicates and retain real samples\n",
    "\n",
    "Bind the `tag_bad_pcr` summary to `guiana.samples.clean`, average counts by biological sample, keep one metadata line per sample, and drop rows without a site identifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "494cedda",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e3a5435",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "guiana.samples.clean <- cbind(guiana.samples.clean, s[rownames(guiana.samples.clean), ])\n",
    "guiana.count.mean <- aggregate(decostand(guiana.count.clean, method = 'total'),\n",
    "                              by = list(guiana.samples.clean$sample), FUN = mean)\n",
    "rownames(guiana.count.mean) <- guiana.count.mean[, 1]\n",
    "guiana.count.mean <- as.matrix(guiana.count.mean[, -1])\n",
    "guiana.samples.mean <- aggregate(guiana.samples.clean,\n",
    "                                by = list(guiana.samples.clean$sample),\n",
    "                                FUN = function(i) i[1])\n",
    "rownames(guiana.samples.mean) <- guiana.samples.mean[, 1]\n",
    "guiana.samples.mean <- guiana.samples.mean[, -1]\n",
    "guiana.samples.final <- guiana.samples.mean[!is.na(guiana.samples.mean$site_id), ]\n",
    "guiana.count.final <- guiana.count.mean[rownames(guiana.samples.final), ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7daae74c",
   "metadata": {},
   "source": [
    "### Exercise 9.3 – Create transformed matrices\n",
    "\n",
    "Compute the Hellinger transformation, relative frequencies, and presence/absence thresholds (0.1%, 1%, and 5%) used later in the slides."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78512528",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e3c43e4",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "guiana.hellinger.final <- decostand(guiana.count.final, method = 'hellinger')\n",
    "guiana.relfreq.final <- decostand(guiana.count.final, method = 'total')\n",
    "guiana.presence.1.final <- guiana.relfreq.final > 0.001\n",
    "guiana.presence.10.final <- guiana.relfreq.final > 0.01\n",
    "guiana.presence.50.final <- guiana.relfreq.final > 0.05"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e36cfd3",
   "metadata": {},
   "source": [
    "### Exercise 9.4 – Compute distance matrices\n",
    "\n",
    "Using `vegdist()`, reproduce the Bray–Curtis, Euclidean (on Hellinger), and Jaccard (three thresholds) distances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a17fc33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "952cf1b7",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "guiana.bc.dist <- vegdist(guiana.relfreq.final, method = 'bray')\n",
    "guiana.euc.dist <- vegdist(guiana.hellinger.final, method = 'euclidean')\n",
    "guiana.jac.1.dist <- vegdist(guiana.presence.1.final, method = 'jaccard')\n",
    "guiana.jac.10.dist <- vegdist(guiana.presence.10.final, method = 'jaccard')\n",
    "guiana.jac.50.dist <- vegdist(guiana.presence.50.final, method = 'jaccard')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dce2f2e",
   "metadata": {},
   "source": [
    "## Series 10 – Ordination and diversity comparisons\n",
    "Focus: recreating the ordination figures and the final comparison of α-diversity spectra across sites/materials."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37a99ef4",
   "metadata": {},
   "source": [
    "### Exercise 10.1 – Run principal coordinate analyses\n",
    "\n",
    "Apply `cmdscale()` (with `k = 3`) to every distance matrix and store the resulting objects for plotting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ceafe4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c457d01",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "guiana.bc.pcoa <- cmdscale(guiana.bc.dist, k = 3, eig = TRUE)\n",
    "guiana.euc.pcoa <- cmdscale(guiana.euc.dist, k = 3, eig = TRUE)\n",
    "guiana.jac.1.pcoa <- cmdscale(guiana.jac.1.dist, k = 3, eig = TRUE)\n",
    "guiana.jac.10.pcoa <- cmdscale(guiana.jac.10.dist, k = 3, eig = TRUE)\n",
    "guiana.jac.50.pcoa <- cmdscale(guiana.jac.50.dist, k = 3, eig = TRUE)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4becf772",
   "metadata": {},
   "source": [
    "### Exercise 10.2 – Compare Euclidean PCoA and PCA\n",
    "\n",
    "Create the `samples.type` factor (Material × Site), plot the Euclidean PCoA (first two axes), and run a PCA on the Hellinger-transformed data to mirror the lecture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "078deace",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76d6a234",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "samples.type <- interaction(guiana.samples.final$Material,\n",
    "                                 guiana.samples.final$Site, drop = FALSE)\n",
    "par(mfrow = c(1, 2), bg = NA)\n",
    "plot(guiana.euc.pcoa$points[, 1:2], col = samples.type, asp = 1,\n",
    "     xlab = 'Axis 1', ylab = 'Axis 2', main = 'Euclidean on Hellinger')\n",
    "guiana.hellinger.pca <- prcomp(guiana.hellinger.final, center = TRUE, scale. = FALSE)\n",
    "plot(guiana.hellinger.pca$x[, 1:2], col = samples.type, asp = 1,\n",
    "     xlab = 'Axis 1', ylab = 'Axis 2', main = 'PCA on Hellinger data')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "422dcd81",
   "metadata": {},
   "source": [
    "### Exercise 10.3 – Compare diversity spectra by habitat\n",
    "\n",
    "Apply `H_spectrum()` to each averaged Guiana sample (using the same `qs`) and reproduce the boxplots comparing Mana vs Petit Plateau and Litter vs Soil."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "176f0cb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your work here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d26e3cd4",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true,
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "guiana.relfreq.spectra <- apply(guiana.relfreq.final, MARGIN = 1, H_spectrum, q = qs)\n",
    "par(mfrow = c(2, 2), bg = NA)\n",
    "boxplot(t(guiana.relfreq.spectra[, samples.type == 'litter.Mana']), log = 'y',\n",
    "        names = qs, las = 2, main = 'Mana litter', ylab = expression(''^q * H))\n",
    "boxplot(t(guiana.relfreq.spectra[, samples.type == 'soil.Mana']), log = 'y',\n",
    "        names = qs, las = 2, main = 'Mana soil', ylab = expression(''^q * H))\n",
    "boxplot(t(guiana.relfreq.spectra[, samples.type == 'litter.Petit Plateau']), log = 'y',\n",
    "        names = qs, las = 2, main = 'Petit Plateau litter', ylab = expression(''^q * H))\n",
    "boxplot(t(guiana.relfreq.spectra[, samples.type == 'soil.Petit Plateau']), log = 'y',\n",
    "        names = qs, las = 2, main = 'Petit Plateau soil', ylab = expression(''^q * H))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "name": "R"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}