1 Comment
â­  Return to thread

Thanks Eugene, appreciate the comment. This paper covers initialisation and simplification and may be of interest: https://arxiv.org/abs/2311.01906 Easier to say than do because it's clearly risky to just change the architecture.

Regarding longer range, I see there's a linear drop off in the information with context (which is then placed in an exponent), but I guess the later layers have non-linearity and are able to compensate for that decline and somehow keep important information in "memory"?

Expand full comment