Microsoft introduces the 2.7B parameter language model Phi-2

Microsoft introduces Phi-2, a groundbreaking 2.7 billion-parameter model that redefines the standards for reasoning and language understanding.

Jan 1, 2024 - 12:50
 0  90
Microsoft introduces the 2.7B parameter language
Microsoft Phi 2

Microsoft introduces Phi-2, a groundbreaking 2.7 billion-parameter model that redefines the standards for reasoning and language understanding. This model sets a new performance benchmark among base language models with less than 13 billion parameters, surpassing even larger models.

Building on the success of its predecessors, Phi-1 and Phi-1.5, Phi-2 achieves a remarkable feat by matching or outperforming models up to 25 times its size. This achievement is attributed to innovations in model scaling and meticulous training data curation.

Phi-2's compact size positions it as an ideal playground for researchers, enabling exploration in mechanistic interpretability, safety enhancements, and fine-tuning experiments across diverse tasks.

The success of Phi-2 is rooted in two crucial elements Training Data Quality: Microsoft underscores the importance of high-quality training data in influencing model performance. Phi-2 leverages "textbook-quality" data, incorporating synthetic datasets designed to instill common sense reasoning and general knowledge. The training corpus is enriched with carefully selected web data, filtered for educational value and content quality.

Innovative Scaling Techniques: Microsoft employs innovative techniques to scale up Phi-2 from its predecessor, Phi-1.5. Knowledge transfer from the 1.3 billion parameter model expedites training convergence, resulting in a substantial improvement in benchmark scores.

Phi-2 undergoes rigorous evaluation across various benchmarks, including Big Bench Hard, commonsense reasoning, language understanding, math, and coding. Despite having only 2.7 billion parameters, Phi-2 outshines larger models, including Mistral and Llama-2, and matches or exceeds the performance of Google's recently announced Gemini Nano 2.

As a Transformer-based model with a next-word prediction objective, Phi-2 is trained on 1.4 trillion tokens from synthetic and web datasets. The meticulous 14-day training process, powered by 96 A100 GPUs, prioritizes safety and claims to surpass open-source models in terms of toxicity and bias.

With the unveiling of Phi-2, Microsoft continues to push the boundaries of what smaller base language models can achieve, showcasing unparalleled advancements in reasoning and language understanding.