Machine Learning Best Practices

This Week: Andrew Ng and the five "Need to Know's" for ML Projects

Dear Reader…

Spend some time researching the topic of Artificial Intelligence Engineering and you will come across Andrew Ng, Ph.D. His multifaceted contributions to data and AI engineering have significantly advanced the field and made AI education more accessible to a global audience. His work spans academic research, industry applications, and educational initiatives like coursera.org . He has authored and co-authored more than 200 research papers on Machine Learning, AI and Computer Science. Notable examples include:

Professor Ng is best known for his pioneering work in online education leading the way with deeplearning.ai an essential resource for practitioners wanting to embrace AI engineering and Data Science. He also authored Machine Learning Yearning, a practical guide, which we will unpack this week as part of the “need to know’s” of Modern Data Management.

The 5 “Need to Know’s” For Data Engineers

Andrew Ng’s Machine Learning Class on Coursera has had more than half a million students, and is one of the most popular courses on the platform. The purpose of Machine Learning Yearning is to equip Data Analysts, Engineers and Scientists with what they need to know not only to build an application, but also to set the direction of a team charged with a ML project. Here are the core principles of the book that can help you align a team building a ML-based application.

1. Importance of Rapid Iteration

Agility is crucial for success in accurately discerning patterns in data. Approaching the project in this way allows teams to develop fit for purpose outcomes without “selling the house”. Teams need to test hypotheses, experiment with different approaches, and improve models efficiently. This can in turn significantly reduce the time-to-market and increase the overall productivity of a team.

For Data Engineers this means:

  • Designing and implementing data pipelines that can handle frequent updates and modifications. Use technologies like Apache Airflow, Luigi, or Prefect for workflow management. These tools allow for easy scheduling, monitoring, and modification of data workflows.

  • Developing comprehensive automated testing for data pipelines and preprocessing steps. Tools like Great Expectations can be used to ensure data quality and consistency. This helps catch issues early and prevents data-related bugs from propagating to the model training stage.

  • Implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines specifically for data and model deployment. Use tools like Jenkins, GitLab CI, or GitHub Actions to automate the testing and deployment process. This ensures that new data processing code or model versions can be quickly and safely deployed to production.

  • Adopting a Modular Code Design approach, where you develop reusable code components for common data processing tasks and creating a library of functions that can be easily integrated into various projects. This reduces redundancy and allows for quicker implementation of new data processing pipelines.

2. Proper Data Distribution and Splitting

Ng stresses the importance of correctly handling data distribution and splitting. Something that is critical to developing robust ML models and falls squarely within the data engineer's domain. Proper data handling ensures that models are trained and evaluated on truly representative data, leading to more accurate performance estimates and better generalisations to real-world scenarios.

For Data Engineers this means:

  • Implementing deterministic data splitting methods ensures consistency across different runs and experiments. Using tools like scikit-learn's train_test_split with fixed random states. This allows for reproducible results and fair comparisons between different model versions.

  • Distribution Matching: With tools to analyse and ensure that dev and test sets accurately reflect the target distribution of data. This might involve implementing custom sampling methods or using stratified sampling techniques to maintain the proportion of different classes or subgroups in the data.

  • Creating systems for monitoring and detecting data distribution shifts or drift over time. Consider using tools like Evidently AI or AWS SageMaker Model Monitor for data drift detection. This helps identify when models need to be retrained due to changes in the underlying data distribution.

  • Implementing robust versioning for datasets and documentation the characteristics of each split. This allows teams to track changes in data distribution over time and understand how these changes might affect model performance.

Daily News for Curious Minds

Be the smartest person in the room by reading 1440! Dive into 1440, where 4 million Americans find their daily, fact-based news fix. We navigate through 100+ sources to deliver a comprehensive roundup from every corner of the internet – politics, global events, business, and culture, all in a quick, 5-minute newsletter. It's completely free and devoid of bias or political influence, ensuring you get the facts straight. Subscribe to 1440 today.

3. Single-Number Evaluation Metrics

Using single-number evaluation metrics to compare different algorithms and models is crucial in the design and implementation of evaluation systems. This provides a clear, quantifiable way to compare model performance, simplifying decision-making processes and enabling easy tracking of improvements over time.

For Data Engineers this means:

  • Clear performance Metric Design and Implementation. Designing metrics that accurately reflect project goals, and implementing them in a computationally efficient manner, making sure they can be calculated in real-time or near-real-time.

  • Set up robust monitoring systems that track key metrics over time. Tools like Prometheus and Grafana can be invaluable for this purpose. These systems should allow for easy visualisation of metric trends and quick identification of performance changes.

  • Developing automated reporting systems that generate regular updates on model performance metrics. Consider using tools like Apache Superset, Tableau, or custom dashboards built with frameworks like Dash by Plotly for data visualization.

  • Design for and create infrastructure to support A/B testing of different models or algorithms based on these metrics. This allows for data-driven decision-making when comparing different approaches or model versions.

4. Error Analysis and Optimisation

Ng emphasises the importance of error analysis and working towards optimal error rates. This process is crucial for improving model performance and requires significant support from data engineering. Systematic error analysis helps identify areas where models are underperforming and guides efforts to improve model accuracy. It's key to implementing ongoing iterative improvement process in machine learning.

For Data Engineers this means:

  • The development of interactive dashboards for visualising model errors. Tools like Plotly or D3.js can be used to create insightful, interactive visualizations. These should allow for drilling down into specific error types or data subsets.

  • Implementing comprehensive logging systems that capture detailed information about model predictions and errors. Consider using ELK stack (Elasticsearch, Logstash, Kibana) for log management and analysis. This allows for in-depth analysis of error patterns.

  • Creating user-friendly feedback interfaces for domain experts to review and annotate errors. Tools like Label Studio can be adapted for this purpose. This facilitates collaboration between data scientists and domain experts in understanding and addressing model errors.

  • Implementing systems to automatically categorise and cluster types of errors, making the analysis process more efficient. This might involve using unsupervised learning techniques to group similar error cases.

  • Developing systems for performance monitoring, and tracking error rates over time, both overall and for specific categories or data subsets. Creating alert systems that notify the team when models approach or surpass predefined error rate thresholds.

5. Flexibility in Learning Approaches

Depending on the application Ng suggests that different learning approaches, including end-to-end learning and pipeline learning, may work better. By designing systems that can accommodate different approaches you can compare the effectiveness of one vs the other. Flexibility in implementation allows teams to experiment with various approaches and choose the most effective one for each situation.

For Data Engineers this means:

Designing data pipelines that can easily switch between traditional feature engineering approaches and end-to-end learning setups. This might involve creating modular pipeline components that can be easily reconfigured.

Implementing systems for comparing the performance of traditional pipeline approaches versus end-to-end learning approaches. This should include not just accuracy metrics, but additional considerations like computational resources required and inference time.

Utilising tools for analysing the impact of different learning approaches on model interpretability. This might involve implementing techniques like SHAP (SHapley Additive exPlanations) values or LIME (Local Interpretable Model-agnostic Explanations).

Set up systems to monitor computational resource usage for different learning approaches, as end-to-end learning often requires more resources. This helps in making informed decisions about the trade-offs between model performance and computational cost. This includes the creation of infrastructure to support efficient fine-tuning of pre-trained models on new datasets and systems for managing and versioning pre-trained models, along with their metadata.

Overall Ng’s guide to ML development represents a strong foundation from which to experiment, iterate and build ML Op’s into your business.

Like this content? Join the conversation at the Data Innovators Exchange.

Thank you
That’s a wrap for this week.