MultiHub Forum

Full Version: Memory-efficient long-document transformers: architecture choices and libraries.
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm implementing a transformer-based model for a custom NLP task involving long documents, and I'm hitting memory constraints with the standard self-attention mechanism's quadratic complexity. I've read about efficient variants like Longformer or BigBird, but I'm unsure how to adapt these architectures or if a simpler approach like hierarchical attention would be more practical for my specific dataset. For those who have worked with long-sequence transformers, what architectural modifications or libraries did you find most effective for managing memory and computation without sacrificing too much performance on context-dependent tasks?