diff --git a/docs/source/changelog.rst b/docs/source/changelog.rst
index 29ef137..56839b7 100644
--- a/docs/source/changelog.rst
+++ b/docs/source/changelog.rst
@@ -12,6 +12,73 @@ Version 4.3.2
 
 **ADDITIONS**
 
+    - New function `scqubits.recommend_parallelization(...)`: a
+      workload-aware heuristic that picks `num_cpus` and a per-worker
+      BLAS-thread cap from the Hilbert-space dimension, grid size,
+      eigenvalue count, and sparse-vs-dense regime. It applies the
+      choice live (no kernel restart) and starts no worker processes,
+      so it is safe to call from Jupyter and from plain scripts.
+      Sweep/spectrum methods accept `num_cpus="auto"` to tune
+      themselves before running, and `scqubits.settings.AUTO_PARALLEL`
+      (default `False`) makes unspecified `num_cpus` do the same. See
+      the :ref:`settings guide <guide-settings>`.
+
+    - New function `scqubits.calibrate_parallelization()`: a one-time
+      measurement that times a short battery of sweeps in isolated
+      subprocesses and records this machine's per-task overhead,
+      pool-startup cost, and per-point diagonalization cost to
+      `~/.scqubits/parallel_calibration.json` (override with
+      `scqubits.settings.PARALLEL_CALIBRATION_PATH`). When present, the
+      recommendation uses this measured break-even instead of the
+      built-in defaults.
+
+    - Parallel sweeps now use the ``spawn`` process start method on
+      macOS (and Windows), and ``fork`` on Linux. Fork is unsafe on
+      macOS -- Apple's Accelerate/GCD and the Objective-C runtime are
+      not fork-safe, so forking a worker pool after the numerics have
+      started threads can crash or hang (CPython itself defaults macOS
+      to ``spawn`` since 3.8; this affects both Intel and Apple
+      Silicon). With ``spawn``, a plain script that uses ``num_cpus >
+      1`` must guard its entry point with ``if __name__ ==
+      "__main__":`` (Jupyter/IPython are unaffected; a one-time
+      reminder is emitted otherwise). The worker pool is cached and
+      reused, so the one-time ``spawn`` startup cost is paid once per
+      session, not per sweep.
+
+    - New setting `scqubits.settings.MULTIPROC_BLAS_THREADS`
+      (`"auto"`, a positive int, or `None`; default `"auto"`): caps the
+      number of BLAS/OpenMP threads per worker process during parallel
+      sweeps (`NUM_CPUS` > 1) to avoid core oversubscription. The
+      default `"auto"` caps each worker to `cores // num_cpus`, so
+      parallel sweeps no longer oversubscribe the cores out of the box;
+      a positive int sets a fixed cap, and `None` leaves threading
+      untouched. The cap is applied only while the worker pool is
+      created and the parent environment is restored afterwards
+      (serial work is unaffected). It reaches spawn-based workers
+      (macOS, Windows) via the thread-count environment variables; for
+      fork-based workers (Linux) it uses `threadpoolctl` (now a
+      scqubits dependency). A one-time warning is emitted when the cap
+      cannot take effect. See the :ref:`settings guide <guide-settings>`.
+    - `ParameterSweep` now reuses a single worker pool across the
+      per-subsystem and dressed sweeps within one run (cached in
+      `scqubits.settings.POOL` and shut down automatically at
+      interpreter exit), instead of starting a fresh pool for each,
+      and ships only the per-grid-point bare eigensystem to each
+      worker, reducing inter-process serialization on large sweeps.
+
+    - Automatic sparse diagonalization: when `esys_method` /
+      `evals_method` are left at their default (`None`), `scqubits`
+      now uses sparse `scipy` `eigsh` instead of dense diagonalization
+      for a large Hamiltonian of which only a small fraction of the
+      spectrum is requested -- the dressed-spectrum regime of composite
+      `HilbertSpace` objects -- where it is dramatically faster and
+      avoids forming the full dense matrix (which may not even fit in
+      memory). Controlled by `scqubits.settings.AUTO_SPARSE_DIAG`
+      (default `True`; thresholds `SPARSE_DIAG_MIN_DIM` and
+      `SPARSE_DIAG_MAX_EVALS_FRAC`); it falls back to the dense solver
+      if the sparse solver raises or its result fails a residual check.
+      Set `AUTO_SPARSE_DIAG = False` to always use the dense path.
+
     - Named constructors for `Circuit` from a YAML description:
       `Circuit.from_yaml_file(path, ...)` (path on disk) and
       `Circuit.from_yaml_string(yaml_text, ...)` (inline YAML). These
diff --git a/docs/source/guide/settings/guide-settings.rst b/docs/source/guide/settings/guide-settings.rst
index 1a96115..7290ba9 100644
--- a/docs/source/guide/settings/guide-settings.rst
+++ b/docs/source/guide/settings/guide-settings.rst
@@ -34,6 +34,29 @@ scqubits has a few internal parameters that can be changed by the user:
 +------------------------------+------------------------------+-------------------------------------------------------------------+
 | ``NUM_CPUS``                 | int                          | Number of cores to be used in parallelization (default: 1)        |
 +------------------------------+------------------------------+-------------------------------------------------------------------+
+| ``MULTIPROC_BLAS_THREADS``   | "auto", int, or None         | Cap BLAS/OpenMP threads per worker during parallel sweeps         |
+|                              | (default: "auto")            | (``NUM_CPUS`` > 1). Default "auto" caps each worker to            |
+|                              |                              | cores // num_cpus so workers never oversubscribe; an int          |
+|                              |                              | sets a fixed cap; None leaves threading untouched. Uses           |
+|                              |                              | ``threadpoolctl`` (a dependency) for fork-based (Linux)           |
+|                              |                              | workers; no effect when numpy BLAS exposes no thread              |
+|                              |                              | control (e.g. Apple Accelerate).                                  |
++------------------------------+------------------------------+-------------------------------------------------------------------+
+| ``AUTO_PARALLEL``            | True / False (default: False)| When True, sweeps called without an explicit ``num_cpus`` use the |
+|                              |                              | parallelization heuristic (``recommend_parallelization``) to pick |
+|                              |                              | ``num_cpus`` and a BLAS-thread cap. Per-call opt-in is also       |
+|                              |                              | available via ``num_cpus="auto"``.                                |
++------------------------------+------------------------------+-------------------------------------------------------------------+
+| ``PARALLEL_CALIBRATION_PATH``| str or None (default: None)  | Location of the one-time machine calibration written by           |
+|                              |                              | ``calibrate_parallelization``. None uses                          |
+|                              |                              | ``~/.scqubits/parallel_calibration.json``.                        |
++------------------------------+------------------------------+-------------------------------------------------------------------+
+| ``AUTO_SPARSE_DIAG``         | True / False (default: True) | When True, default diagonalization (esys_method/evals_method =    |
+|                              |                              | None) uses sparse scipy eigsh for large spectra where only a few  |
+|                              |                              | eigenvalues are needed, with automatic dense fallback (thresholds |
+|                              |                              | SPARSE_DIAG_MIN_DIM, SPARSE_DIAG_MAX_EVALS_FRAC). See the         |
+|                              |                              | diagonalization guide.                                            |
++------------------------------+------------------------------+-------------------------------------------------------------------+
 | ``FUZZY_SLICING``            | True / False (default: False)| Whether to enable approximate value-based slicing                 |
 +------------------------------+------------------------------+-------------------------------------------------------------------+
 | ``FUZZY_WARNING``            | True / False (default: True) | Whether to warn user about use of approximate values in slicing   |
diff --git a/docs/source/guide/settings/ipynb/custom_diagonalization.ipynb b/docs/source/guide/settings/ipynb/custom_diagonalization.ipynb
index f580e1d..d75b270 100644
--- a/docs/source/guide/settings/ipynb/custom_diagonalization.ipynb
+++ b/docs/source/guide/settings/ipynb/custom_diagonalization.ipynb
@@ -51,6 +51,24 @@
     "\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Automatic sparse diagonalization\n",
+    "\n",
+    "When `esys_method` and `evals_method` are left at their default (`None`), `scqubits` does not always use the same dense solver. For a **large** Hamiltonian of which only a **small fraction** of the spectrum is requested — the typical situation for the dressed spectrum of a composite `HilbertSpace` — it automatically switches to sparse `scipy` `eigsh`, which is dramatically faster than dense diagonalization in this regime (and avoids forming the full dense matrix, which may not even fit in memory).\n",
+    "\n",
+    "This is controlled by `scqubits.settings.AUTO_SPARSE_DIAG` (default `True`). Sparse diagonalization is selected only when\n",
+    "\n",
+    "- the Hilbert-space dimension is at least `settings.SPARSE_DIAG_MIN_DIM` (default `1000`), **and**\n",
+    "- the number of requested eigenvalues is at most `settings.SPARSE_DIAG_MAX_EVALS_FRAC` times the dimension (default `0.1`).\n",
+    "\n",
+    "Otherwise — and whenever an explicit `esys_method`/`evals_method` is set — the behavior is unchanged. If the sparse solver raises, or its result fails a cheap residual check (a safeguard against the rare case where `eigsh` returns an inaccurate subspace without raising), `scqubits` automatically falls back to the dense solver.\n",
+    "\n",
+    "To disable automatic sparse diagonalization and always use the dense path, set `scqubits.settings.AUTO_SPARSE_DIAG = False`."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/docs/source/guide/settings/ipynb/parallel.ipynb b/docs/source/guide/settings/ipynb/parallel.ipynb
index 2310c55..321b9e0 100644
--- a/docs/source/guide/settings/ipynb/parallel.ipynb
+++ b/docs/source/guide/settings/ipynb/parallel.ipynb
@@ -8,41 +8,9 @@
     "\n",
     "Some of the computational tasks performed in scqubits can benefit significantly from parallelization. The scqubits package leverages parallel-processing capabilities provided by the Python Standard Library `multiprocessing` module. For better pickling support, scqubits further supports use of `pathos` and `dill`.\n",
     "\n",
-    "One important consideration for parallelization of tasks like parameter sweeps is the fact that Numpy and Scipy tend to make use of multi-threading internally. (Details of that will depend on how they were built on the machine in question.) This will generally lead to competition between multi-threading on the Numpy/Scipy level and parallelization of `map` methods via `multiprocessing` or `pathos`.\n",
+    "One important consideration for parallelization of tasks like parameter sweeps is the fact that Numpy and Scipy tend to make use of multi-threading internally (through their BLAS backend). With several worker processes each spawning a full BLAS thread pool, the cores become oversubscribed and a sweep can run *slower* than with fewer threads per worker.\n",
     "\n",
-    "In many cases, best performance is obtained by limiting the number of threads used by Numpy to \"a few\". (Precise numbers will be machine dependent and need to be determined on a case by case basis.) Limiting this thread number can be achieved from within a Python script or Jupyter and is accomplished by setting environment variables. \n",
-    "\n",
-    ".. note::\n",
-    "    Limiting the number of threads will only be effective if environment variables are set before the first import of\n",
-    "    Numpy. \n",
-    "\n",
-    "\n",
-    "Several environment variables can play a role, and which one is needed may again be machine-dependent. \n",
-    "We thus simply set them all:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "\n",
-    "NUM_THREADS = \"1\"\n",
-    "\n",
-    "os.environ[\"OMP_NUM_THREADS\"] = NUM_THREADS\n",
-    "os.environ[\"OPENBLAS_NUM_THREADS\"] = NUM_THREADS\n",
-    "os.environ[\"MKL_NUM_THREADS\"] = NUM_THREADS\n",
-    "os.environ[\"VECLIB_MAXIMUM_THREADS\"] = NUM_THREADS\n",
-    "os.environ[\"NUMEXPR_NUM_THREADS\"] = NUM_THREADS"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "At this point, Numpy import and import of scqubits can proceed."
+    "scqubits handles this for you: when `num_cpus > 1`, it caps each worker's BLAS threads automatically (via the `MULTIPROC_BLAS_THREADS` setting, default `\"auto\"` -- see *Limiting BLAS threads per worker* below), so in the common case there is nothing to configure. The sections below explain how to enable parallelization, when it actually helps, and the knobs available if you want to tune things by hand."
    ]
   },
   {
@@ -67,6 +35,26 @@
     "from scqubits import HilbertSpace, InteractionTerm, ParameterSweep"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Quick start\n",
+    "\n",
+    "Parallelization is opt-in, and scqubits can size it to the job for you. The shortest paths:\n",
+    "\n",
+    "| If you want… | Do this |\n",
+    "|---|---|\n",
+    "| scqubits to choose the settings | pass `num_cpus=\"auto\"` — it sizes the job, picks the workers, and stays serial when parallel wouldn't help |\n",
+    "| the best choices for *your* machine | run `scqubits.calibrate_parallelization()` once — it times your machine and saves the numbers `\"auto\"` then uses (re-running overwrites them, so you can recalibrate anytime) |\n",
+    "| every sweep auto-tuned, without a kwarg | set `scqubits.settings.AUTO_PARALLEL = True` |\n",
+    "| a fixed number of workers | pass `num_cpus=N` |\n",
+    "| to run a plain `.py` script with parallelism | guard the entry point with `if __name__ == \"__main__\":` |\n",
+    "\n",
+    "Each row is explained in the sections below; the difference between `\"auto\"` and\n",
+    "`AUTO_PARALLEL` is covered under *Letting scqubits choose the settings*."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -145,6 +133,106 @@
     "Once `num_cpus` exceeds the value 1 when passed, scqubits starts a parallel processing pool of the desired number of processes."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## When does `num_cpus > 1` actually help?\n",
+    "\n",
+    "Parallelization is **not free**, and for many sweeps it gives no speedup -- or even a\n",
+    "slowdown. Each grid point is shipped to a worker process (pickling + dispatch), and that\n",
+    "fixed overhead is only worth paying when there is enough work to amortize it:\n",
+    "\n",
+    "> `num_cpus > 1` helps only when **(number of grid points) × (cost per point)&nbsp;≫&nbsp;the per-task overhead**.\n",
+    "\n",
+    "What to expect, therefore:\n",
+    "\n",
+    "- **Small grids, or cheap-per-point systems** (small Hilbert spaces, few eigenstates):\n",
+    "  `num_cpus > 1` gives little or no benefit, and is frequently *slower* than serial. This\n",
+    "  is the common case and is entirely normal -- keep the default `num_cpus = 1`.\n",
+    "- **Large grids of expensive points** (large composite Hilbert spaces, many grid points):\n",
+    "  parallel workers pay off.\n",
+    "\n",
+    "If a `num_cpus` comparison looks 'inconclusive' or backwards (e.g. `num_cpus=2` slower\n",
+    "than `num_cpus=1`), the sweep is most likely below this break-even. (Oversubscription of\n",
+    "the cores by BLAS threads, historically the other common cause, is avoided by default --\n",
+    "see `MULTIPROC_BLAS_THREADS` below -- unless you have explicitly set it to `None`.) For a\n",
+    "hands-on demonstration of both regimes, see the `demo_multiprocessing` notebook in the\n",
+    "[scqubits-examples](https://github.com/scqubits/scqubits-examples) repository. Rather than\n",
+    "judging the break-even by hand, you can let scqubits pick `num_cpus` for you -- see\n",
+    "*Letting scqubits choose the settings* just below.\n",
+    "\n",
+    "For large composite systems the per-point **diagonalization method** is often a bigger\n",
+    "lever than parallelism: sparse diagonalization (the default for large spectra; see\n",
+    "`AUTO_SPARSE_DIAG`) can be far faster per point, and once each point is cheap, `num_cpus >\n",
+    "1` helps even less. Try sparse first; parallelize second."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Letting scqubits choose the settings\n",
+    "\n",
+    "**A 30-second mental model.** There are two *separate* pieces, and they work together:\n",
+    "\n",
+    "- **The switch — *whether* to parallelize.** That is `num_cpus`: leave it at the default (serial), set a number yourself, or say `\"auto\"` to let scqubits decide.\n",
+    "- **The map — *how well* `\"auto\"` decides.** `\"auto\"` always works, using built-in rules of thumb. Running `calibrate_parallelization()` once replaces those rules of thumb with real measurements of *your* machine, so `\"auto\"` decides better.\n",
+    "\n",
+    "The calibration is **just data** — it does nothing by itself; it only sharpens the choices `\"auto\"` makes. So \"everything optimal\" needs **both**: turn on `\"auto\"` *and* (recommended) calibrate once. The reverse is fine too — `\"auto\"` works without calibrating, just with generic instead of machine-specific numbers.\n",
+    "\n",
+    "### Asking for a recommendation\n",
+    "\n",
+    "`scqubits.recommend_parallelization` is a *pure* function — it starts no worker processes, so it is safe to call anywhere — that reads the Hilbert-space dimension, the number of grid points, the eigenvalue count, and whether sparse diagonalization applies, and returns a recommendation:\n",
+    "\n",
+    "```python\n",
+    "cfg = scqubits.recommend_parallelization(hilbertspace=hs, num_points=384, evals_count=20)\n",
+    "print(cfg.num_cpus, cfg.blas_threads, cfg.reason)\n",
+    "sweep = scqubits.ParameterSweep(..., num_cpus=cfg.num_cpus)\n",
+    "```\n",
+    "\n",
+    "Because a `ParameterSweep` runs as soon as it is constructed, call `recommend_parallelization` *before* building the sweep. More conveniently, pass the sentinel `num_cpus=\"auto\"`, which makes the sweep tune itself **before** it runs:\n",
+    "\n",
+    "```python\n",
+    "sweep = scqubits.ParameterSweep(..., num_cpus=\"auto\")\n",
+    "```\n",
+    "\n",
+    "### `num_cpus=\"auto\"` vs `settings.AUTO_PARALLEL = True` — same engine, different reach\n",
+    "\n",
+    "Both trigger the *exact same* auto-tuner; the only difference is *when* it kicks in:\n",
+    "\n",
+    "- **`num_cpus=\"auto\"` — per call.** You opt in for one specific sweep; nothing else is affected.\n",
+    "- **`settings.AUTO_PARALLEL = True` — global default.** Now *any* sweep where you do **not** pass `num_cpus` behaves as if you had written `\"auto\"`. Set it once and forget it.\n",
+    "\n",
+    "An explicit number always wins:\n",
+    "\n",
+    "```text\n",
+    "num_cpus=4        ->  exactly 4 workers      (you decide; auto-tuner not consulted)\n",
+    "num_cpus=\"auto\"   ->  auto-tuner decides     (always; calibrated if you ran calibrate())\n",
+    "num_cpus omitted  ->  auto-tuner if AUTO_PARALLEL=True, otherwise serial (the default)\n",
+    "```\n",
+    "\n",
+    "The recommendation applies its choice live (no kernel restart) and works the same in Jupyter and in a plain script; only a sweep that the heuristic decides to parallelize starts workers, which in a plain script needs the `__main__` guard described below.\n",
+    "\n",
+    "### Calibrating to your machine (recommended, run once)\n",
+    "\n",
+    "`calibrate_parallelization()` times your hardware — how long it takes to start worker processes, and how expensive a grid point is to diagonalize (dense and sparse) — and saves the result to a small JSON file (`~/.scqubits/parallel_calibration.json` by default; change the location with `settings.PARALLEL_CALIBRATION_PATH`). From then on, every `num_cpus=\"auto\"` decision reads that file and tailors its choice to *your* machine instead of using generic defaults.\n",
+    "\n",
+    "```python\n",
+    "scqubits.calibrate_parallelization()   # ~1 minute; measures this machine and writes the file\n",
+    "```\n",
+    "\n",
+    "**Calibrate under realistic, steady conditions — and redo it freely.** The measurement is only as good as the machine state while it runs, and **re-running simply overwrites the previous file**, so recalibrating is cheap and safe. Redo it whenever:\n",
+    "\n",
+    "- you accidentally calibrated while the machine was **busy** with other work (the measured costs come out inflated), or\n",
+    "- a **laptop was on battery / CPU-throttled** at calibration time — many laptops clock down sharply when unplugged, so the calibration over-estimates every cost and `\"auto\"` then plays it too safe (parallelizes less than it should), or\n",
+    "- you **changed hardware**.\n",
+    "\n",
+    "For the most representative numbers, calibrate on an otherwise-idle machine, plugged into wall power. The calibration runs its measurements as `python -m` subprocesses, so the call itself needs no `__main__` guard.\n",
+    "\n",
+    "> Tip: if `\"auto\"` ever seems oddly conservative, your calibration may have been taken under load or on battery — just run `calibrate_parallelization()` again on a quiet, plugged-in machine to refresh it."
+   ],
+   "metadata": {}
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -168,6 +256,106 @@
     "scqubits.settings.NUM_CPUS = 6"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Limiting BLAS threads per worker (`MULTIPROC_BLAS_THREADS`)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As noted at the top of this page, Numpy/Scipy internally multi-thread their linear algebra (through the BLAS backend), which competes with process-level parallelization: with `num_cpus` worker processes each spawning a full BLAS thread pool, the cores become oversubscribed and a sweep can run *slower* than with fewer threads per worker.\n",
+    "\n",
+    "scqubits caps the per-worker BLAS threads automatically while the worker pool is created. The cap is controlled by `MULTIPROC_BLAS_THREADS`, which defaults to `\"auto\"`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# \"auto\" (default): cap each worker to max(1, cores // num_cpus) -- no oversubscription\n",
+    "# a positive int : a fixed per-worker cap, e.g. 1 for many small diagonalizations\n",
+    "# None           : opt out and leave the thread environment untouched\n",
+    "scqubits.settings.MULTIPROC_BLAS_THREADS = \"auto\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "With the default `\"auto\"`, each worker is capped to `max(1, cores // num_cpus)`, so the workers together use about one thread per core and never oversubscribe. A positive integer sets a fixed per-worker cap (e.g. `1` for many small diagonalizations), and `None` opts out, leaving the thread environment untouched. The cap affects only the worker pool; the parent process's environment and BLAS thread count are restored once the pool has been built, so serial work (`num_cpus = 1`) is never affected.\n",
+    "\n",
+    "How the cap reaches the workers depends on the platform:\n",
+    "\n",
+    "- **Spawn-based workers** (macOS and Windows) re-read the environment when they re-import Numpy/Scipy, so the cap applies directly.\n",
+    "- **Fork-based workers** (Linux) inherit the parent's already-initialized BLAS pool and ignore the environment variables. For these, scqubits uses [`threadpoolctl`](https://github.com/joblib/threadpoolctl) (a scqubits dependency) to reduce the parent's BLAS thread count for the duration of pool creation, so the forked workers inherit it.\n",
+    "- It has **no effect** when Numpy's BLAS exposes no thread control, as with Apple Accelerate on Apple Silicon. Note, however, that Scipy ships its own OpenBLAS there, so the cap still limits the threads used by Scipy's eigensolvers -- which is what most scqubits diagonalization relies on.\n",
+    "\n",
+    "If the cap cannot take effect on your platform, scqubits emits a one-time warning; in that case you can fall back to exporting `OMP_NUM_THREADS`/`OPENBLAS_NUM_THREADS` (etc.) in the shell *before* importing Numpy."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Worker-pool reuse\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Within a single computation that issues several parallel `map` calls — for example a `ParameterSweep`, which sweeps each bare subsystem and then the dressed system — scqubits caches the worker pool in `scqubits.settings.POOL` and reuses it whenever the requested core count and backend match, instead of starting a fresh pool each time. The cached pool is shut down automatically at interpreter exit. This is transparent and requires no user action.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Process start method (`fork` vs `spawn`)\n\nHow worker processes are created — the *start method* — is determined by your platform.\nThere is exactly one safe choice per platform, so scqubits selects it automatically; it is\nnot a user setting:\n\n| platform | start method | why |\n|---|---|---|\n| Linux | `fork` | fast, and fork is safe |\n| macOS | `spawn` | fork-after-threads is **unsafe** on macOS — Apple's Accelerate/GCD and the Objective-C runtime are not fork-safe, so forking a worker pool after the numerics have started threads can crash, deadlock, or hang. CPython itself defaults macOS to `spawn` since 3.8. This applies to **both Intel and Apple Silicon** Macs. |\n| Windows | `spawn` | the only option |\n\nThe only consequence you need to be aware of is the `__main__` guard, below."
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### The `__main__` guard (`spawn`/`forkserver` only)\n",
+    "\n",
+    "With `spawn` (and `forkserver`), each worker process **re-imports your program's entry\n",
+    "module**. A **plain script** that triggers `num_cpus > 1` must therefore guard its entry\n",
+    "point, or the workers would re-run the script and Python raises a `RuntimeError`:\n",
+    "\n",
+    "```python\n",
+    "import scqubits as scq\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    sweep = scq.ParameterSweep(..., num_cpus=4)\n",
+    "```\n",
+    "\n",
+    "**Jupyter/IPython need no guard.** scqubits emits a one-time warning the first time it\n",
+    "starts a `spawn` pool outside IPython, reminding you of this requirement.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Cost\n",
+    "\n",
+    "`spawn` workers re-import numpy/scipy/scqubits, so the **first** parallel sweep of a\n",
+    "session pays a one-time startup of roughly a second. Because the pool is cached and\n",
+    "reused (see *Worker-pool reuse* above), **every subsequent sweep is as fast as fork** — the\n",
+    "cost is paid once per session, not per sweep. For the heavy sweeps where `num_cpus > 1` is\n",
+    "worthwhile, this is negligible.\n",
+    "\n",
+    "> **Note:** unlike fork children, `spawn` workers are not automatically reaped if the\n",
+    "> parent process is killed with `SIGKILL` mid-run, and may linger. A normal exit (or a\n",
+    "> `ParameterSweep.run()` completing) cleans them up.\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -222,4 +410,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
\ No newline at end of file
+}