12-25-2025, 09:46 AM
I'm a machine learning engineer working on a document summarization project, and I'm trying to decide between fine-tuning a pre-trained transformer like BART or T5 versus building a custom architecture from scratch. Our dataset is domain-specific and relatively small. For others who have implemented transformer models for similar NLP tasks, what factors led you to choose one approach over the other? I'm particularly concerned about the computational cost of fine-tuning a large model versus the performance limitations of a smaller custom transformer, and whether techniques like knowledge distillation or parameter-efficient fine-tuning are viable for production systems where inference speed is critical.