xLSTM: The inventors of LSTM are presenting a Transformer contender.

Has Sepp Hochreiter done it again? After months of announcements, a group around the inventor of the LSTM finally published a paper presenting ๐ฑ๐‹๐’๐“๐Œ to the world.

Until the appearance of the Transformer in 2017, ๐‹๐’๐“๐Œ had been the go-to technology for a wide variety of sequence-related tasks, including text generation. Three limitations

  1. the inability to revise storage decisions,
  2. limited storage capacity, and
  3. the need for sequential rather than parallel processing,

relegated LSTMs to second place behind Transformers.

The group proposes two types of new LSTM memory cells, baptized ๐ฌ๐‹๐’๐“๐Œ and ๐ฆ๐‹๐’๐“๐Œ (see the graph from the original paper below). Both sLSTM and mLSTM are placed in a residual block (adding skip connections, similar to Transformers). These blocks can be stacked in various combinations, and thus constitute the complete ๐ฑ๐‹๐’๐“๐Œ architecture.

From the original paper.
  • The input and forget gates in the ๐ฌ๐‹๐’๐“๐Œ cell get exponential gates, equipping it with the ability to ๐ซ๐ž๐ฏ๐ข๐ฌ๐ž ๐ฌ๐ญ๐จ๐ซ๐š๐ ๐ž ๐๐ž๐œ๐ข๐ฌ๐ข๐จ๐ง๐ฌ. Identical to normal LSTMs, sLSTM can have multiple cells (one for every sequence step), allowing memory mixing. sLSTM, however, can also have multiple heads, again a Transformer idea injected into LSTMs. Memory mixing across heads is not possible.
  • Instead of a scalar, the ๐ฆ๐‹๐’๐“๐Œ memory cell is a matrix for ๐ž๐ง๐ก๐š๐ง๐œ๐ž๐ ๐ฌ๐ญ๐จ๐ซ๐š๐ ๐ž ๐œ๐š๐ฉ๐š๐œ๐ข๐ญ๐ฒ. For retrieval, mLSTM adapts the key, value, and query vectors concept from Transformers. Consequently, there is no memory mixing, but multiple heads are possible here as well.
  • The memory mixing in sLSTM requires ๐ฌ๐ž๐ช๐ฎ๐ž๐ง๐ญ๐ข๐š๐ฅ ๐œ๐š๐ฅ๐œ๐ฎ๐ฅ๐š๐ญ๐ข๐จ๐ง๐ฌ, ruling out parallelization. Sepp Hochreiterโ€™s team does propose a fast CUDA kernel, but the speed handicap remains.

In the experimental section, xLSTM is pitched against other methods, most notably Transformers. Overall, ๐ฑ๐‹๐’๐“๐Œ ๐œ๐จ๐ฆ๐ฉ๐š๐ซ๐ž๐ฌ ๐Ÿ๐š๐ฏ๐จ๐ซ๐š๐›๐ฅ๐ฒ in many tasks, including Large Language Model. Ablation studies show that both components, sLSTM and mLSTM, contribute to the improvement over regular LSTM. An important observation is also that xLSTM scales well, so their performance is not limited to smaller datasets. Speed tests are not shown.

It will be interesting to observe to what extent xLSTM will gain traction, and what, in business speak, its key success factors will be.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *