Share this comment
Thanks Eugene, appreciate the comment. This paper covers initialisation and simplification and may be of interest: arxiv.org/abs/2311.01906 Easier to say than do because it's clearly risky to just change the architecture.
Regarding longer range, I see there's a linear drop off in the information with context (which is then placed in an ex…
© 2025 Recursal AI
Substack is the home for great culture
Thanks Eugene, appreciate the comment. This paper covers initialisation and simplification and may be of interest: https://arxiv.org/abs/2311.01906 Easier to say than do because it's clearly risky to just change the architecture.
Regarding longer range, I see there's a linear drop off in the information with context (which is then placed in an exponent), but I guess the later layers have non-linearity and are able to compensate for that decline and somehow keep important information in "memory"?