Memory-efficient long-document transformers: architecture choices and libraries.
#1
I'm implementing a transformer-based model for a custom NLP task involving long documents, and I'm hitting memory constraints with the standard self-attention mechanism's quadratic complexity. I've read about efficient variants like Longformer or BigBird, but I'm unsure how to adapt these architectures or if a simpler approach like hierarchical attention would be more practical for my specific dataset. For those who have worked with long-sequence transformers, what architectural modifications or libraries did you find most effective for managing memory and computation without sacrificing too much performance on context-dependent tasks?
Reply


[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Forum Jump: