RT-Dino

# DinoV2 with registers backbone, transformer decoder, classifier free guidance film layers training script

☘️ Shoot an email to sebastian@mbodi.ai if you'd like to tackle this issue and I'll help as often as I can. Can provide A100 access once script is ready.

[Starter Code](https://github.com/MbodiAI/mbodied-agents/blob/rt_dino/src/mbodied_agents/scripts/rt_dino.py)
[Example Doing Identical task but with MaxViT](https://github.com/kyegomez/RT-X/commit/5258e9b8f09b930e6a7918ab2dce44024cc93503)


### Resources
[Highly-Recommended Guide to Follow](https://karpathy.github.io/2019/04/25/recipe/?fbclid=IwAR0HcxdKV2lPGYGKTZI7x1wkNAnxfp7hvNHAHPJl2CtxAK26w44d76n37ig)
[Transformer Head Code](https://github.com/lucidrains/x-transformers)
[DinoV2 Source Code](https://github.com/facebookresearch/dinov2)
[Text Guidance with Film](https://github.com/lucidrains/classifier-free-guidance-pytorch)
RT1: Robotics Transformers paper

### Tokenize Actions (x, y, z, roll, pitch, yaw, grasp)

Transform pattern: (b  frames action) -> (b f a bins), bins=255

This is just simple classification not sequence to sequence modeling

1. Apply [MinMax Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#minmaxscaler)

2. Apply [kbins](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html)

### Apply film layers from classifier-free-guidance

Inference pattern: (b f c h w ), str --> (b f a bins)

[Example Doing Identical task but with MaxViT](https://github.com/kyegomez/RT-X/commit/5258e9b8f09b930e6a7918ab2dce44024cc93503)

### Details
- [ ] Use pytorch lightning, transformers, or fastai (transformers preferred but fastai likely easiest)
- [ ] Use **pretrained** ViT-g/14 small or large with registers 
- [ ] Start with basic encoder-decoder pattern (see the [starter code script](https://github.com/MbodiAI/mbodied-agents/blob/rt_dino/src/mbodied_agents/scripts/rt_dino.py))

Use the following losses:
- [ ] Asymmetric loss: https://timm.fast.ai/asymmetric_loss
- [ ] CutUpMix: https://timm.fast.ai/random_resized_crop
- [ ] Standard Image Augmentations: https://timm.fast.ai/random_resized_crop
- [ ] Repeat with Film Layers added to ViT blocks and no caption caption input using: [classifier-free-guidance](https://github.com/lucidrains/classifier-free-guidance-pytorch) (again refer to the MaxViT example above).

### Follow-On Work
- [ ] Ablations with early, middle, late fusion
- [ ] Ablations with DinoV2 frozen, dinov2 without registers, smaller or larger dinov2
- [ ] Whiten image inputs with [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)
- [ ] AutoAugment with timm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RT-Dino #33

DinoV2 with registers backbone, transformer decoder, classifier free guidance film layers training script

Resources

Tokenize Actions (x, y, z, roll, pitch, yaw, grasp)

Apply film layers from classifier-free-guidance

Details

Follow-On Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RT-Dino #33

Description

DinoV2 with registers backbone, transformer decoder, classifier free guidance film layers training script

Resources

Tokenize Actions (x, y, z, roll, pitch, yaw, grasp)

Apply film layers from classifier-free-guidance

Details

Follow-On Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions