Skip to content

Commit d96e6b2

Browse files
authored
Update README.md
1 parent 00f954c commit d96e6b2

1 file changed

Lines changed: 18 additions & 10 deletions

File tree

logs/run_1/README.md

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,15 @@
66
**Total Duration:** ~58 Hours
77

88
## Scientific Summary
9-
This run represents the alpha test of the Neuromodulatory Control Network (NCN) architecture. The model was trained for **1 epoch** on the **TinyStories** dataset (mapped to binary format) to verify the hypothesis that neuromodulatory gating can improve sample efficiency on narrative data.
9+
This run serves as the alpha validation of the **Neuromodulatory Control Network (NCN)** architecture on the TinyStories binary dataset.
1010

11-
The model achieved a final validation perplexity of **4.5184**. This is a significant result for an 18M parameter model trained for only one pass over the data, suggesting the NCN hypernetwork effectively regulates the plasticity of the main transformer backbone.
11+
Unlike standard Transformer training, this experiment tests the hypothesis that a parallel hypernetwork can **implicitly learn an optimal processing strategy** (Section 2.1 of the paper) by modulating the main network's gain, precision, and gating dynamics. The goal was to observe if the NCN could stabilize without "Entropy Shock" and achieve competitive perplexity through dynamic resource allocation rather than static weight optimization.
12+
13+
## Theoretical Hypotheses Tested
14+
This run specifically targets three biological mechanisms proposed in the NCN paper:
15+
1. **Thermodynamic Regulation (Exploration vs. Exploitation):** Can the `precision` signal ($\beta$) dynamically regulate the entropy of the attention mechanism, mimicking the signal-to-noise ratio modulation of Norepinephrine?
16+
2. **Gradient Shielding:** Does the multiplicative `gain` ($g$) allow the model to selectively down-regulate layers during specific contexts, theoretically shielding specialized weights from catastrophic interference (Plasticity-Stability Dilemma)?
17+
3. **Metabolic Efficiency:** Verifying if **Homeostatic Regularization** ($\mathcal{L}_{reg}$) prevents the control manifold from collapsing into a rigid state or exploding.
1218

1319
## Final Metrics
1420
| Metric | Value |
@@ -26,29 +32,31 @@ The model achieved a final validation perplexity of **4.5184**. This is a signif
2632
* **Heads:** 8
2733
* **Feedforward Dim:** 1024
2834
* **Dropout:** 0.1
29-
* **Act Function:** GELU (Standard)
35+
* **Act Function:** GELU
3036
* **Total Params:** 18.01M
3137

3238
## NCN (Neuromodulatory) Hyperparameters
33-
* **Role:** Hypernetwork / Meta-controller
39+
* **Role:** Meta-controller / Hypernetwork
3440
* **Input Dimension:** 256 (Tied to d_model)
3541
* **Hidden Dimension:** 64
3642
* **NCN Heads:** 2
3743
* **Activation Function:** Tanh
3844
* **Modulation Signals:**
39-
1. `gain` (Layer scaling)
40-
2. `precision` (Attention sharpening)
41-
3. `ffn_gate` (Feedforward throttling)
45+
1. `gain` ($g$): Signal-to-noise / Layer integration rate.
46+
2. `precision` ($\beta$): Inverse temperature / Attention entropy control.
47+
3. `ffn_gate` ($\gamma$): Metabolic gating of FFN blocks.
4248
* **Parameter Overhead:** 281.30K (1.56% of total)
4349

4450
## Training Configuration
4551
* **Dataset:** TinyStories (Binary)
4652
* **Learning Rate:** 6e-4 (Linear Decay with 100 warmup steps)
4753
* **Batch Size:** 64 (16 per device * 4 gradient accumulation steps)
4854
* **Optimizer:** AdamW (Weight Decay 0.1)
49-
* **Precision:** Mixed Precision (AMP) Enabled
55+
* **Initialization:** Bias Initialization Strategy (Section 4.1.4 of paper) applied to prevent "Metabolic Throttling."
56+
57+
## Training Dynamics & Observations
58+
The log confirms the efficacy of the **Bias Initialization Strategy** described in Section 4.1.4. The model avoided the "Entropy Shock" typical of hypernetworks; the loss curve shows immediate, stable descent from step 0.
5059

51-
## Training Dynamics
52-
The training was stable with no loss spikes. The `grad_clip` of 1.0 was rarely triggered after the warmup phase. The NCN parameters introduced a computational overhead of roughly <2% compared to a vanilla forward pass.
60+
The validation perplexity of **4.51** on a small-scale model (18M) suggests that the NCN is successfully compressing the loss manifold by dynamically altering the effective depth and sharpness of the network per token, rather than treating all tokens with uniform computational intensity.
5361

5462
**Log file:** `training.log` (Attached in this directory)

0 commit comments

Comments
 (0)