Evaluating the performance of AI systems and foundation models—including large language models (LLMs), vision-language models (VLMs), and domain-specific AI—requires a comprehensive set of metrics. These metrics span technical accuracy, computational efficiency, robustness, alignment, and application-specific outcomes. Drawing on resources and papers from the "Papers - Computation & Investing" library, this report outlines the key metrics and benchmarks used to assess AI and new-age foundation models, with a focus on both general AI evaluation and specialized domains such as finance, investing, and computational science.

1. Technical Accuracy and Task Performance

1.1. Standard Accuracy Metrics

For classification, regression, and ranking tasks, AI models are commonly evaluated using metrics such as:

1.2. Specialized Benchmarks

Foundation models are evaluated on established benchmarks that test a variety of capabilities:

1.3. LLM-Specific Output Metrics

For generative models, especially LLMs, new evaluation frameworks and metrics have emerged:

2. Computational Efficiency and Cost