{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Factorial DOE: oil-company experiment\n",
    "\n",
    "Industrial designed experiment from an oil company. Four blend\n",
    "components are varied across 19 runs to improve volumetric heat\n",
    "capacity. Three of the components (A, B, D) are continuous and were\n",
    "varied in coded units; the fourth (C) is a binary factor: present or\n",
    "absent. The aim is to identify which factors and interactions actually\n",
    "drive the response so that future experimentation can focus there.\n",
    "\n",
    "**Data.** `oil-company-doe.csv` from\n",
    "[openmv.net](https://openmv.net/info/oil-company-doe). 19 rows, five\n",
    "columns: `A`, `B`, `C`, `D`, `y`.\n",
    "\n",
    "**What we do.** Run the package's `analyze_experiment` with an\n",
    "interactions model, read the coefficient table and ANOVA, build a\n",
    "Pareto plot of standardised effects, and check the residual\n",
    "diagnostics.\n",
    "\n",
    "**Adapted from** the *Design and analysis of experiments* chapter of\n",
    "the [Process Improvement using Data](https://learnche.org/pid) book\n",
    "(CC BY-SA 4.0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from process_improve.experiments.analysis import analyze_experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and inspect"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://openmv.net/file/oil-company-doe.csv\"\n",
    "doe = pd.read_csv(url, index_col=0)\n",
    "if doe.shape[1] < 5:\n",
    "    doe = pd.read_csv(url)\n",
    "doe.columns = [c.strip() for c in doe.columns]\n",
    "print(f\"Shape: {doe.shape}\")\n",
    "print(f\"Columns: {list(doe.columns)}\")\n",
    "doe.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doe.describe(include=\"all\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fit the interactions model\n",
    "\n",
    "`analyze_experiment` accepts a design DataFrame and the name of the\n",
    "response column, builds a formula, fits it via statsmodels, and\n",
    "returns a dictionary keyed by the requested analysis types. With\n",
    "`model=\"interactions\"` we ask for main effects plus all two-factor\n",
    "interactions; that is 4 + 6 = 10 terms, which fits comfortably in 19\n",
    "runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = analyze_experiment(\n",
    "    doe,\n",
    "    response_column=\"y\",\n",
    "    model=\"interactions\",\n",
    "    analysis_type=[\n",
    "        \"coefficients\",\n",
    "        \"anova\",\n",
    "        \"effects\",\n",
    "        \"significance\",\n",
    "        \"residual_diagnostics\",\n",
    "    ],\n",
    ")\n",
    "ms = result[\"model_summary\"]\n",
    "print(f\"formula     : {ms['formula']}\")\n",
    "print(f\"R2          : {ms['r_squared']:.3f}\")\n",
    "print(f\"R2 (adj)    : {ms['r_squared_adj']:.3f}\")\n",
    "print(f\"R2 (pred)   : {ms['r_squared_pred']:.3f}\")\n",
    "print(f\"obs / df_resid: {ms['n_obs']} / {ms['df_residual']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Coefficient table\n",
    "\n",
    "Sort by absolute t-value to put the most impactful terms at the top.\n",
    "Coefficients with a small p-value are the ones that the data picks out\n",
    "as important; coefficients whose confidence interval straddles zero\n",
    "are not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coef_df = pd.DataFrame(result[\"coefficients\"]).set_index(\"term\")\n",
    "coef_df = coef_df.reindex(coef_df[\"t_value\"].abs().sort_values(ascending=False).index)\n",
    "coef_df.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ANOVA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "anova = pd.DataFrame(result[\"anova_table\"]).set_index(\"source\")\n",
    "anova.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pareto plot of standardised effects\n",
    "\n",
    "Dividing each effect estimate by its standard error gives a unitless\n",
    "comparison of how strong each term is relative to the noise. Plotting\n",
    "the absolute values as a horizontal bar chart from largest to smallest\n",
    "is the classical Pareto plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "effects = pd.Series(result[\"effects\"], dtype=float)\n",
    "errors = pd.Series(result[\"effect_std_errors\"], dtype=float).reindex(effects.index)\n",
    "standardised = (effects / errors).abs().sort_values()\n",
    "\n",
    "_fig, ax = plt.subplots(figsize=(7, 0.35 * max(4, len(standardised)) + 1.0))\n",
    "ax.barh(standardised.index, standardised.values, color=\"#1f77b4\")\n",
    "ax.axvline(2, color=\"r\", linestyle=\"--\", linewidth=1, label=\"|t| = 2 (rough significance)\")\n",
    "ax.set_xlabel(\"|effect / standard error|\")\n",
    "ax.set_title(\"Pareto of standardised effects\")\n",
    "ax.legend(loc=\"best\")\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Residual diagnostics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diag = result[\"residual_diagnostics\"]\n",
    "fitted = np.asarray(diag[\"fitted_values\"], dtype=float)\n",
    "resid = np.asarray(diag[\"residuals\"], dtype=float)\n",
    "\n",
    "_fig, axes = plt.subplots(1, 2, figsize=(9, 3.2))\n",
    "axes[0].scatter(fitted, resid, s=30, color=\"#1f77b4\")\n",
    "axes[0].axhline(0, color=\"k\", linewidth=0.7)\n",
    "axes[0].set_xlabel(\"Fitted\")\n",
    "axes[0].set_ylabel(\"Residual\")\n",
    "axes[0].set_title(\"Residuals vs fitted\")\n",
    "\n",
    "axes[1].hist(resid, bins=8, color=\"#1f77b4\", edgecolor=\"k\")\n",
    "axes[1].set_xlabel(\"Residual\")\n",
    "axes[1].set_title(\"Residual distribution\")\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f\"Shapiro-Wilk p-value : {diag['shapiro_wilk']['p_value']:.3f}\")\n",
    "print(f\"Durbin-Watson        : {diag['durbin_watson']:.2f}\")\n",
    "print(f\"Breusch-Pagan p-value: {diag['breusch_pagan']['p_value']:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What to try next\n",
    "\n",
    "- Drop the terms below the `|t| = 2` line and re-fit. The model R\n",
    "  squared will go down a touch but the adjusted R squared usually goes\n",
    "  up because the noise terms penalise the unadjusted statistic.\n",
    "- Pass `analysis_type=\"lack_of_fit\"` to ask whether the residual\n",
    "  variation is consistent with pure replication error, which tells you\n",
    "  whether a curvature term is missing.\n",
    "- For follow-up experimentation, the\n",
    "  :func:`process_improve.experiments.strategy.recommend_strategy`\n",
    "  function will propose where to run the next batch of experiments\n",
    "  given the coefficients you have already estimated."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}