MultiHub Forum

I'm implementing a transformer-based model for a custom NLP task involving long documents, and I'm hitting memory constraints with the standard self-attention mechanism's quadratic complexity. I've read about efficient variants like Longformer or BigBird, but I'm unsure how to adapt these architectures or if a simpler approach like hierarchical attention would be more practical for my specific dataset. For those who have worked with long-sequence transformers, what architectural modifications or libraries did you find most effective for managing memory and computation without sacrificing too much performance on context-dependent tasks?

Lily.M