remix logo

Hacker Remix

Janus: Decoupling visual encoding for multimodal understanding and generation

35 points by jinqueeny 3 days ago | 3 comments

josh-sematic 2 days ago

Interesting! It seems to be that there would be a tradeoff between specialist subsystems (which allow you to excel at the specialized tasks, but which can't handle things outside the specialization well) and generalized subsystems (which allow you to integrate information across multiple specializations but which may not be great at any of them). Ultimately you likely need a mix of both, but it's not obvious to me how you would identify when it will be beneficial to "hard code" separations for different subsystems (as is done here for image generation & encoding) vs when the model should be left to "figure it out" during training and implicitly develop the appropriate subsystems within the network.

wiz21c 2 days ago

The online demo returns "Error" :-( My prompt was a picture and the question was "what is written on that screenshot" ?

afdhfd3w 1 day ago

[dead]

jadbox 2 days ago

Does anyone know how Janus compares with rhymes-ai Aria model?