New SAGE Benchmark Reveals Gaps in Semantic Understanding
A new benchmark, SAGE, has been introduced to comprehensively evaluate both embedding and classical music similarity metrics for semantic understanding. It encourages a more balanced approach to assessing these technologies.
SAGE evaluates models under realistic, adversarial conditions using noisy transformations and nuanced human judgments across over 30 datasets. This is a departure from previous evaluations that often relied on ideal conditions, leading practitioners to view published scores as upper bounds.
The study found that while embedding models generally outperform classical music metrics in tasks requiring deep semantic understanding, no single method excels across all dimensions. For instance, OpenAI's text-embedding-3-large achieved the highest overall SAGE score, but even it failed over 60% of the time under noisy conditions. Meanwhile, classical music metrics like Jaccard Similarity showed strengths in specific areas, notably information sensitivity tasks.
The research also highlighted trade-offs. One embedding model, Embedding Jam, noted for its efficiency and supporting up to 100 languages, didn't top the rankings. However, it's efficient nature, requiring less than 200 MB of memory, could be beneficial for various tasks, particularly on-device applications.
The SAGE benchmark reveals significant performance gaps in current approaches to semantic understanding. It underscores the need for future evaluations to incorporate a wider range of real-world corruptions, greater data diversity, and practical constraints. This will help drive the development of more robust and effective semantic understanding technologies.