Why MLPerf

Harvard MLPerf Research

The ML field requires responsible and ethical benchmarking standards...

Deep Learning is transforming the field of machine learning (ML) from theory to practice. It has also sparked a renaissance in computer system design. Both academia and the industry are scrambling to integrate ML-centric designs into their products. But despite the breakneck pace of innovation, there is a crucial issue affecting the research and industry communities at large: how to enable fair and useful benchmarking of ML software frameworks, ML hardware accelerators, and ML systems. The ML field requires responsible and ethical benchmarking standards that are both representative of real-world use-cases and useful for making fair comparisons across different software and hardware platforms.

Born in part at Harvard, MLPerf answers the call.

The goal of Harvard MLPerf Research is as follows:

  • Investigate new opportunities to guide researchers and the industry at large on open-source ML benchmarking of systems.
  • Research into ML software frameworks and ML hardware accelerator designs for IoT, edge and cloud computing systems.
  • Develop new ML datasets to accurately capture real industry workloads and improve the ML experience.
  • Deploy responsible AI systems by understanding fairness and security issues when bridging systems and policy.

Table of content

ML Benchmarks

The launch and adoption of SPEC and TPCC benchmarks corresponds to the golden age of innovation...

Benchmarks are at the heart of building fair and useful systems. SPEC and TPCC are long-standing examples of good benchmark suites that are widely adopted for research, both in academia and the industry. The launch and adoption of those benchmarks corresponds to the golden age of innovation in microprocessor performance and transaction processing, respectively. In that vein, to usher in a new era of ML hardware accelerators and software systems, the ML systems-building community needs new benchmarks, such as MLPerf. Drawing upon the lessons from past benchmarks, we strive to develop benchmarks that achieve the following goals:

  • Accelerate progress in ML via fair and empirically sound measurement
  • Serve both the commercial and academic communities
  • Keep benchmarking affordable to foster participation
  • Encourage reproducibility to ensure reliable results

ML System Research Challenges

...we need to understand the complex design space and learn how to study systems carefully.

Machine learning is a complicated field. There are several issues studying machine learning systems. Performance, fairness, quality, etc. are all important metrics that matter equally. The art of studying machine learning systems is in and of itself an important topic. Take for example the following three challenges that researchers face when studying machine learning.

  1. There are many ML models in use in production, which do we pick as representative? ML use cases are diverse and so are its users. A single use case benchmark, such as object detection, can be realized with many different models, datasets and quality targets. So, a key challenge for MLPerf is to capture these diverse characteristics while being easy to use.
  2. Fair performance metrics for ML systems are convoluted and complicated, so how do we study them? There are complex interactions between implementation choices, such as batch size, optimizer hyperparameters, numerical precision, and the inherently stochastic nature of machine learning that all have an effect on the end goal of good performance. Therefore, we need to understand the complex design space and learn how to study systems carefully.
  3. There are over a dozen ML frameworks. Are they all the same? Can models imported from one framework to another be identical such that benchmarking any framework is the same? No! Implementation differences across frameworks and inference engines make result comparison difficult. There is no consensus on how to implement critical functionality. One such example is quantization, which can come in different forms: uniform affine/symmetric quantization, post-training versus retraining quantization, probabilistic quantization, etc. Each of these can have varying levels of performance impact, so understanding these nuances is essential to determine whether the claimed hardware improvements are due to the hardware innovation or due to clever software tricks. Hence, we need to have standards for issues such as quantization, sparsity optimizations, etc. to streamline benchmarking of ML systems.

Beyond ML Systems Research

We aim to address challenges that relate to deploying ML responsibly.

Deploying ML at scale requires us to carefully think about engineering AI systems that are not just “smart,” but also responsible. Hence, Harvard MLPerf Research at CRCS seeks to address concerns that extend beyond just benchmarking models, software frameworks and architectures. We aim to address challenges that relate to deploying ML responsibly. For instance, our goal is to help foster research and innovation in the development of future datasets that can help ML systems be unbiased and fair. Along these lines, one of the goals for Harvard MLPerf Research is to work closely with the Kennedy and Law schools to ensure that systems are built responsibly.