feat: implement rae autoencoder.#13046
Conversation
|
@bytetriper if you could take a look? |
|
nice works @Ando233 checking |
|
off the bat,
lets sort out these things and then re-look |
|
Agree with @kashif . Also if possible we can bake all the params into config so we can enable .from_pretrained(), which is more elegant and aligns with diffusers usage. I can help convert our released ckpt to hgf format afterwards |
|
@Ando233 we're happy to provide assistance if needed. |
|
@Ando233 the one remaining thing is the use of the |
|
@bytetriper could you kindly try to run the conversion scripts and upload the diffusers style weights to your huggingface hub for the checkpoints you have? |
|
@bytetriper i sent you some fixes to the weights if you can kindly merge |
|
@kashif Merged! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sayakpaul
left a comment
There was a problem hiding this comment.
Left some comments. Let me know if this makes sense. @bytetriper it would be great if you could also test the diffusers counterparts of RAE and let us know your thoughts.
| specific language governing permissions and limitations under the License. | ||
| --> | ||
|
|
||
| # AutoencoderRAE |
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
sayakpaul
left a comment
There was a problem hiding this comment.
Left a major comment regarding the presence of the encoder-specific classes now. LMK your thoughts.
| self.model.layernorm.weight = None | ||
| self.model.layernorm.bias = None |
There was a problem hiding this comment.
We're already stripping the layernorms in the conversion. Seems like it's not needed anymore?
| self.model.vision_model.post_layernorm.weight = None | ||
| self.model.vision_model.post_layernorm.bias = None |
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| class Dinov2Encoder(nn.Module): |
There was a problem hiding this comment.
Now, I am a bit confused.
The layernorm-related modifications seem to be the only stuff for which we require these separate encoder classes.
Now that we're doing the layernorm related modifications in the conversion script, do we still need them?
There was a problem hiding this comment.
so we still have that since the weights on the hub have not been updated yet. Once they are updated, then yes, we could use the transformer models directly, but the different models have different forward logic, so we would still need a per-encoder forward dipatch.
There was a problem hiding this comment.
but the different models have different forward logic, so we would still need a per-encoder forward dipatch.
Yeah that is fine. We maintain standalone functions and dispatch accordingly. WDYT?
What does this PR do?
This PR adds a new representation autoencoder implementation, AutoencoderRAE, to diffusers.
Implements diffusers.models.autoencoders.autoencoder_rae.AutoencoderRAE with a frozen pretrained vision encoder (DINOv2 / SigLIP2 / ViT-MAE) and a ViT-MAE style decoder.
The decoder implementation is aligned with the RAE-main GeneralDecoder parameter structure, enabling loading of existing trained decoder checkpoints (e.g. model.pt) without key mismatches when encoder/decoder settings are consistent.
Adds unit/integration tests under diffusers/tests/models/autoencoders/test_models_autoencoder_rae.py.
Registers exports so users can import directly via from diffusers import AutoencoderRAE.
Fixes #13000
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Usage
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.