Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From paper: Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: