Vibe coding some different ideas on vit param sharing/recursion/latents#2656
Draft
Vibe coding some different ideas on vit param sharing/recursion/latents#2656
Conversation
…nt/perceiver structure.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…upervision/recursion loop options.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Some ideas I was exploring over a few days, different vit recursion / latent approaches. Created some new 'Recursive Supervision' ViT variants, revived Perceiver arch.
Possibly some ideas worth exploring more...
In this PR there are 3 variants of "Recursive Supervision" ViT. RSViT, RSPViT (latent handling insipired by Perceiver), RSTViT (update patterns inspired by TinyRecursiveModels).
These models roughly have x (patches), z (latent/hidden state), y (output representation/proxy, similar to class token).
There is an inner loop iterating over blocks and updating z, w/ some input/cross from x and previous y (or initial y), and then an outer supervision loop that usually updates y at the end (sometimes y is updated in the inner loop too). Each supervision step the y output is passed through the head, returned. The models differ in terms of how these updates are performed, cross attentions, concatenations, etc.
The RecursiveTask calculates the loss with weighting across each step. There's also a 'halting' signal/loss in attempt to provide a signal for early iteration stopping, though this part is rather half baked an untested (and needs a batching solution)...
Also included in this PR was some fiddling with my original Perceiver arch from a few years ago. It's udpated with a fourier embedding closer to the original, and an alternative ROPE that applies to the pixels (or patchified input), and cross inputs to cross attention.