as
This is a huge - the implication is that the transformers implement cascades of "if-then-else" with the actions of these decisions being optimal tokens for the task solved by the next layer. If this is what they actually do, it is fair to ask what optimality means in the context of specific tasks (the preprint partially addresses this) and whether this optimal embedding can be done with more efficient compute approaches.
arxiv.org/abs/2308.16898

#transformers #supportvectorregressors #svms

Last updated 1 year ago