{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "25057a9b",
   "metadata": {},
   "source": [
    "# Payday Borrower Pulse 2026 — Replication Notebook\n",
    "\n",
    "This notebook re-derives every headline statistic from the\n",
    "Methodology Report (`borrower-pulse-2026-methodology.pdf`) using only the\n",
    "public data artifacts. No proprietary dependencies.\n",
    "\n",
    "**Inputs:**\n",
    "- `borrower-pulse-2026.csv` — aggregate state-level cuts (~124 KB)\n",
    "- `borrower-pulse-2026.json` — individual-response, anonymized (~4.8 MB)\n",
    "\n",
    "**Dependencies:** `pandas>=2.0`, `numpy>=1.24` (standard library otherwise).\n",
    "\n",
    "**License:** CC-BY-4.0 — free to adapt with attribution to\n",
    "Big Daddy Loans Research, LLC.\n",
    "\n",
    "**Citation:**\n",
    "> Big Daddy Loans Research, LLC. (2026). Payday Borrower Pulse Survey 2026\n",
    "> [Data set]. https://bigdaddy-loans.com/research/borrower-pulse-2026/\n",
    "\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e4a818f",
   "metadata": {},
   "source": [
    "## 1. Imports and configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04f220c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Config\n",
    "CSV_PATH = \"borrower-pulse-2026.csv\"\n",
    "JSON_PATH = \"borrower-pulse-2026.json\"\n",
    "\n",
    "# Anchors from the Methodology Report Section 1 (Executive summary)\n",
    "ANCHORS = {\n",
    "    \"would_do_again_yes_pct\": 36,\n",
    "    \"would_do_again_no_pct\":  48,\n",
    "    \"would_do_again_unsure_pct\": 16,\n",
    "    \"wished_known_pal_pct\":   62,\n",
    "    \"wished_known_epp_pct\":   54,\n",
    "    \"wished_known_ewa_pct\":   47,\n",
    "}\n",
    "\n",
    "pd.set_option(\"display.max_columns\", 50)\n",
    "pd.set_option(\"display.width\", 200)\n",
    "print(\"Setup complete.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c5bec52",
   "metadata": {},
   "source": [
    "## 2. Load aggregate CSV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fd5db44",
   "metadata": {},
   "outputs": [],
   "source": [
    "agg = pd.read_csv(CSV_PATH)\n",
    "print(f\"Shape: {agg.shape}\")\n",
    "print(f\"Cut dimensions: {sorted(agg.cut_dimension.unique())}\")\n",
    "print(f\"Row counts per cut dimension:\")\n",
    "print(agg.cut_dimension.value_counts().sort_index().to_string())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c746d3e5",
   "metadata": {},
   "source": [
    "## 3. Load individual-response JSON\\n\\nLarger file — first read happens in 1-2 seconds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8654738",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(JSON_PATH) as f:\n",
    "    payload = json.load(f)\n",
    "\n",
    "ind = pd.DataFrame(payload[\"responses\"])\n",
    "print(f\"Total respondents: {len(ind):,}\")\n",
    "print(f\"Fields per record: {ind.shape[1]}\")\n",
    "print()\n",
    "print(\"Field completeness (% non-null):\")\n",
    "print((ind.notna().mean() * 100).round(1).to_string())\n",
    "print()\n",
    "print(\"Dataset metadata:\")\n",
    "print(json.dumps(payload[\"dataset\"][\"fields\"], indent=2)[:600] + \"...\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4dbefd6",
   "metadata": {},
   "source": [
    "## 4. Compute weighted national headline statistics\\n\\nReproduce the 36 / 48 / 16 would-do-again anchor and the 62 / 54 / 47 wished-known anchors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e9b70d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "W = ind[\"weight_factor\"].sum()\n",
    "\n",
    "def weighted_pct(condition):\n",
    "    return 100 * (condition * ind[\"weight_factor\"]).sum() / W\n",
    "\n",
    "print(\"=== Headline figures (weighted, all respondents) ===\")\n",
    "print()\n",
    "for cat in (\"yes\", \"no\", \"unsure\"):\n",
    "    pct = weighted_pct(ind[\"would_do_again\"] == cat)\n",
    "    anchor = ANCHORS[f\"would_do_again_{cat}_pct\"]\n",
    "    print(f\"would_do_again_{cat:6}: {pct:5.1f}% (anchor: {anchor}%, drift: {pct - anchor:+.1f})\")\n",
    "\n",
    "print()\n",
    "for k in (\"pal\", \"epp\", \"ewa\"):\n",
    "    pct = weighted_pct(ind[f\"wished_known_{k}\"])\n",
    "    anchor = ANCHORS[f\"wished_known_{k}_pct\"]\n",
    "    print(f\"wished_known_{k:3}: {pct:5.1f}% (anchor: {anchor}%, drift: {pct - anchor:+.1f})\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1c65589",
   "metadata": {},
   "source": [
    "## 5. Crosstab: would-do-again by rollover status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45acd1e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def rollover_band(n):\n",
    "    if n == 0: return \"avoided_rollover\"\n",
    "    if n <= 3: return \"rolled_1_to_3\"\n",
    "    return \"rolled_4_or_more\"\n",
    "\n",
    "ind[\"_rb\"] = ind[\"num_rollovers\"].apply(rollover_band)\n",
    "\n",
    "# Weighted crosstab\n",
    "def weighted_xtab(group_col, value_col):\n",
    "    out = (ind.assign(_w=ind[\"weight_factor\"])\n",
    "           .groupby([group_col, value_col])[\"_w\"].sum()\n",
    "           .unstack(fill_value=0))\n",
    "    return (out.div(out.sum(axis=1), axis=0) * 100).round(1)\n",
    "\n",
    "xt = weighted_xtab(\"_rb\", \"would_do_again\")\n",
    "print(\"Would-do-again % by rollover-status (weighted):\")\n",
    "print(xt.to_string())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "501b3a6b",
   "metadata": {},
   "source": [
    "## 6. Crosstab: would-do-again by state-legality status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1dec2c41",
   "metadata": {},
   "outputs": [],
   "source": [
    "BANNED = {\"CT\", \"DC\", \"GA\", \"MD\", \"MA\", \"NJ\", \"NY\", \"NC\", \"PA\", \"VT\", \"WV\"}\n",
    "CAPPED = {\"AZ\", \"CO\", \"IL\", \"MN\", \"MT\", \"OR\", \"SD\", \"VA\"}\n",
    "\n",
    "def legality(state):\n",
    "    if state in BANNED: return \"banned\"\n",
    "    if state in CAPPED: return \"capped_36apr\"\n",
    "    return \"legal_unrestricted\"\n",
    "\n",
    "ind[\"_legality\"] = ind[\"state\"].apply(legality)\n",
    "xt = weighted_xtab(\"_legality\", \"would_do_again\")\n",
    "print(\"Would-do-again % by state legality status (weighted):\")\n",
    "print(xt.to_string())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1e2b503",
   "metadata": {},
   "source": [
    "## 7. Wished-known PAL/EPP/EWA — broken out by rollover-experience"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb6b30a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "for alt in (\"pal\", \"epp\", \"ewa\"):\n",
    "    print(f\"=== wished_known_{alt} by rollover status ===\")\n",
    "    out = (ind.assign(_w=ind[\"weight_factor\"])\n",
    "           .groupby(\"_rb\")\n",
    "           .apply(lambda g: 100 * (g[f\"wished_known_{alt}\"] * g[\"_w\"]).sum() / g[\"_w\"].sum())\n",
    "           .round(1))\n",
    "    print(out.to_string())\n",
    "    print()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "778f8f92",
   "metadata": {},
   "source": [
    "## 8. State-level reconstruction\n",
    "\n",
    "Aggregate the individual-response JSON manually, then compare against the\n",
    "published aggregate CSV. They should agree to within rounding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3585faf5",
   "metadata": {},
   "outputs": [],
   "source": [
    "def state_summary(df):\n",
    "    w = df[\"weight_factor\"]\n",
    "    W = w.sum()\n",
    "    return pd.Series({\n",
    "        \"n\": len(df),\n",
    "        \"would_do_again_yes_pct\": round(100 * ((df[\"would_do_again\"] == \"yes\") * w).sum() / W, 1),\n",
    "        \"fell_into_rollover_pct\": round(100 * ((df[\"num_rollovers\"] >= 1) * w).sum() / W, 1),\n",
    "        \"median_loan_amount_usd\": int(df[\"loan_amount_usd\"].median()),\n",
    "        \"wished_known_pal_pct\": round(100 * (df[\"wished_known_pal\"] * w).sum() / W, 1),\n",
    "    })\n",
    "\n",
    "reconstructed = ind.groupby(\"state\").apply(state_summary).reset_index()\n",
    "print(\"Reconstructed top 10 states by N:\")\n",
    "print(reconstructed.nlargest(10, \"n\").to_string(index=False))\n",
    "\n",
    "# Compare against aggregate CSV overall rows\n",
    "overall = agg.query(\"cut_dimension == 'overall'\").set_index(\"state\")\n",
    "joined = reconstructed.merge(\n",
    "    overall[[\"n\", \"would_do_again_yes_pct\", \"fell_into_rollover_pct\"]],\n",
    "    on=\"state\", suffixes=(\"_calc\", \"_csv\"))\n",
    "print()\n",
    "print(\"Cross-validation (max drift across all states):\")\n",
    "print(f\"  n diff: {(joined['n_calc'] - joined['n_csv']).abs().max()}\")\n",
    "print(f\"  would_do_again_yes drift: {(joined['would_do_again_yes_pct_calc'] - joined['would_do_again_yes_pct_csv']).abs().max():.2f} pp\")\n",
    "print(f\"  fell_into_rollover drift: {(joined['fell_into_rollover_pct_calc'] - joined['fell_into_rollover_pct_csv']).abs().max():.2f} pp\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82be7aef",
   "metadata": {},
   "source": [
    "## 9. Margin-of-error (linearization)\\n\\nNational and per-state MoE on binary outcomes, 95% CI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1b64b03",
   "metadata": {},
   "outputs": [],
   "source": [
    "DEFF = 1.04  # design effect after weight trimming (see methodology Section 6)\n",
    "Z95 = 1.96\n",
    "\n",
    "def moe(p_pct, n):\n",
    "    p = p_pct / 100\n",
    "    se = np.sqrt(p * (1 - p) / (n / DEFF))\n",
    "    return round(Z95 * se * 100, 2)\n",
    "\n",
    "# National\n",
    "n_total = len(ind)\n",
    "p_yes = weighted_pct(ind[\"would_do_again\"] == \"yes\")\n",
    "print(f\"National would_do_again_yes: {p_yes:.1f}% ± {moe(p_yes, n_total)} pp (95% CI)\")\n",
    "\n",
    "# Per-state — top 5 + bottom 5 by N\n",
    "state_n = ind.groupby(\"state\").size().sort_values(ascending=False)\n",
    "print()\n",
    "print(\"Per-state MoE on would_do_again_yes (top 5 + bottom 5 by N):\")\n",
    "for st in list(state_n.head(5).index) + list(state_n.tail(5).index):\n",
    "    n = state_n[st]\n",
    "    p = weighted_pct((ind[\"state\"] == st) & (ind[\"would_do_again\"] == \"yes\")) * len(ind) / max(1, n)\n",
    "    print(f\"  {st}: N={n:5d}  ±{moe(p, n):.2f} pp\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "905fa50a",
   "metadata": {},
   "source": [
    "## 10. Extending the analysis — suggested next-step queries for researchers\n",
    "\n",
    "Some queries this notebook deliberately does NOT pre-compute, but which are\n",
    "straightforward extensions for researchers:\n",
    "\n",
    "**A. Lifetime-cost analysis** — multiply `fees_paid_usd` × `weight_factor`,\n",
    "   then sum by `state` and divide by state population to get\n",
    "   per-capita fee burden by state. Compare against state-level APR caps.\n",
    "\n",
    "**B. Wished-known propensity model** — logistic regression of `wished_known_pal`\n",
    "   on demographics + rollover history. The hypothesis: lower-income borrowers\n",
    "   are MORE likely to wish they had known, because they had less margin to\n",
    "   absorb the cost.\n",
    "\n",
    "**C. Channel-channel comparison** — most statistics in the public release are\n",
    "   weighted national/state aggregates. Subgroup the individual JSON by\n",
    "   `recruitment_source` and compare answer distributions. The nonprofit-channel\n",
    "   subset is the most distress-skewed; the intercept-storefront subset is the\n",
    "   most loyal-customer-skewed. The differences between them carry information.\n",
    "\n",
    "**D. Survey-of-surveys reconciliation** — for any statistic also published\n",
    "   by Pew, FDIC, or CFPB, compute the Pulse 2026 estimate and the prior\n",
    "   estimate side-by-side with their respective MoEs. The differences\n",
    "   (a) flag potential bias in either source, and (b) document drift over\n",
    "   the years between fieldings.\n",
    "\n",
    "**E. Disclosure-controls validation** — the public CSV suppresses cells with\n",
    "   N < 30. Verify this from the individual JSON and confirm that no\n",
    "   reconstructable suppressed values can be derived from the published\n",
    "   marginals (k-anonymity check).\n",
    "\n",
    "For methodology questions or replication issues, contact\n",
    "`research@bigdaddy-loans.com`.\n",
    "\n",
    "---\n",
    "\n",
    "*End of replication notebook. Last updated 22 May 2026.*\n"
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "Big Daddy Loans Research, LLC"
   }
  ],
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10+"
  },
  "title": "Payday Borrower Pulse 2026 — Replication Notebook"
 },
 "nbformat": 4,
 "nbformat_minor": 5
}