You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ Goals:
26
26
* torch- and python-idiomatic
27
27
* hackable
28
28
* few external dependencies (currently only torch and torchvision)
29
-
*~world-record single-GPU training time (this repo holds the current world record at ~<7 (!!!) seconds on an A100, down from ~18.1 seconds originally).
29
+
*~world-record single-GPU training time (this repo holds the current world record at ~<6.3 (!!!) seconds on an A100, down from ~18.1 seconds originally).
30
30
* <2 seconds training time in <2 years (yep!)
31
31
32
32
This is a neural network implementation of a very speedily-training network that originally started as a painstaking reproduction of [David Page's original ultra-fast CIFAR-10 implementation on a single GPU](https://myrtle.ai/learn/how-to-train-your-resnet/), but written nearly from the ground-up to be extremely rapid-experimentation-friendly. Part of the benefit of this is that we now hold the world record for single GPU training speeds on CIFAR10, for example.
@@ -39,6 +39,9 @@ What we've added:
39
39
* dirac initializations on non-depth-transitional layers (information passthrough on init)
40
40
* and more!
41
41
42
+
What we've removed:
43
+
* explicit residual layers. yep.
44
+
42
45
This code, in comparison to David's original code, is in a single file and extremely flat, but is not as durable for long-term production-level bug maintenance. You're meant to check out a fresh repo whenever you have a new idea. It is excellent for rapid idea exploring -- almost everywhere in the pipeline is exposed and built to be user-friendly. I truly enjoy personally using this code, and hope you do as well! :D Please let me know if you have any feedback. I hope to continue publishing updates to this in the future, so your support is encouraged. Share this repo with someone you know that might like it!
43
46
44
47
Feel free to check out my[Patreon](https://www.patreon.com/user/posts?u=83632131) if you like what I'm doing here and want more!. Additionally, if you want me to work up to a part-time amount of hours with you, feel free to reach out to me at hire.tysam@gmail.com. I'd love to hear from you.
# To replicate the ~95.78%-accuracy-in-113-seconds runs, you can change the base_depth from 64->128, train_epochs from 12.1->85, ['ema'] epochs 10->75, cutmix_size 3->9, and cutmix_epochs 6->75
46
+
bias_scaler=64
47
+
# To replicate the ~95.79%-accuracy-in-110-seconds runs, you can change the base_depth from 64->128, train_epochs from 12.1->90, ['ema'] epochs 10->80, cutmix_size 3->10, and cutmix_epochs 6->80
48
48
hyp= {
49
49
'opt': {
50
-
'bias_lr': 1.64*bias_scaler/512, # TODO: Is there maybe a better way to express the bias and batchnorm scaling? :'))))
51
-
'non_bias_lr': 1.64/512,
52
-
'bias_decay': 1.08*6.45e-4*batchsize/bias_scaler,
53
-
'non_bias_decay': 1.08*6.45e-4*batchsize,
50
+
'bias_lr': 1.525*bias_scaler/512, # TODO: Is there maybe a better way to express the bias and batchnorm scaling? :'))))
51
+
'non_bias_lr': 1.525/512,
52
+
'bias_decay': 6.687e-4*batchsize/bias_scaler,
53
+
'non_bias_decay': 6.687e-4*batchsize,
54
54
'scaling_factor': 1./9,
55
55
'percent_start': .23,
56
-
'loss_scale_scaler': 1./128, # * Regularizer inside the loss summing (range: ~1/512 - 16+). FP8 should help with this somewhat too, whenever it comes out. :)
56
+
'loss_scale_scaler': 1./32, # * Regularizer inside the loss summing (range: ~1/512 - 16+). FP8 should help with this somewhat too, whenever it comes out. :)
57
57
},
58
58
'net': {
59
59
'whitening': {
60
60
'kernel_size': 2,
61
61
'num_examples': 50000,
62
62
},
63
-
'batch_norm_momentum': .5, # * Don't forget momentum is 1 - momentum here (due to a quirk in the original paper... >:( )
64
-
'conv_norm_pow': 2.6,
63
+
'batch_norm_momentum': .4, # * Don't forget momentum is 1 - momentum here (due to a quirk in the original paper... >:( )
# i believe the eigenvalues and eigenvectors come out in float32 for this because we implicitly cast it to float32 in the patches function (for numerical stability)
conv_layer.weight.data[-eigenvectors.shape[0]:, :, :, :] = (eigenvectors/torch.sqrt(eigenvalues+eps))[-shape[0]:, :, :, :] # set the first n filters of the weight data to the top n significant (sorted by importance) filters from the eigenvectors
269
+
eigenvectors_sliced= (eigenvectors/torch.sqrt(eigenvalues+eps))[-shape[0]:, :, :, :] # set the first n filters of the weight data to the top n significant (sorted by importance) filters from the eigenvectors
'project': Conv(whiten_conv_depth, depths['init'], kernel_size=1, norm=2.2), # The norm argument means we renormalize the weights to be length 1 for this as the power for the norm, each step
0 commit comments