Willow Ventures

Achieving 10,000x training data reduction with high-fidelity labels | Insights by Willow Ventures

Achieving 10,000x training data reduction with high-fidelity labels | Insights by Willow Ventures

Understanding the Impact of Curation on Model Performance

In the world of machine learning, the quality of training data can significantly influence the performance of models. This blog post explores the experiments conducted with different-sized Large Language Models (LLMs) and how curated datasets affect their effectiveness across various tasks.

Experiment Overview

We aimed to understand how different models and tasks could benefit from a meticulous curation process. For our experiments, we fine-tuned two LLMs: Gemini Nano-1, with 1.8 billion parameters, and Nano-2, with 3.25 billion parameters. These models were tested on tasks of varying complexity, supported by crowdsourced labels.

Dataset Characteristics

Each dataset consisted of approximately 100,000 annotations, exhibiting a considerable class imbalance—about 95% of the labels were benign. This imbalance provided a unique opportunity to evaluate the models under real-world conditions, where most data points are typically benign.

Fine-Tuning Process

We compared four baseline conditions against curated conditions, wherein each model underwent multiple rounds of fine-tuning. The curation process involved selecting tailored examples for model evaluation at each iteration.

Model Convergence

Both models eventually plateaued, failing to reach the expert alignment levels. We conducted six iterations (with around 400 fine-tuning and 250 evaluation samples) for lower complexity tasks, while higher complexity tasks required five iterations (about 250 fine-tuning and 150 evaluation samples). Notably, the lower complexity task had a more diverse set of examples, potentially explaining the additional iterations needed for convergence.

Data Quality Assessment

The table below summarizes the scale and quality of the data generated during these experiments. Our expert reviewers achieved an average pairwise Cohen’s Kappa of 0.81 for the lower complexity task and 0.78 for the higher complexity task. These figures set a benchmark for optimal model performance.

To gauge the quality of crowdsourced data, we calculated Kappa alignment between the crowdsourced annotations and expert evaluations, resulting in scores of 0.59 for lower complexity and 0.41 for higher complexity tasks.

Conclusion

Our experiments underscore the importance of data curation in enhancing model performance and aligning machine learning outputs with expert standards. By continuously refining the datasets used for training, we can significantly improve the effectiveness of machine learning models across various complexities.

Related Keywords

  • Data curation
  • Machine learning models
  • Large language models
  • Fine-tuning techniques
  • Crowdsourced annotations
  • Expert alignment
  • Model performance evaluation


Source link