June 2025

Finished Chapters 3 & 4 of DMLS Book

Completed reading Chapter 4 (Training Data) & Chapter 3 (Data Engineering Fundamentals) of DMLS book.

Chapter 3: Data Engineering Fundamentals

  • Data sources: User input data (requires careful validation) / system-generated data (logs & system outputs, can grow to large volume) / third-party data (privacy concerns).
  • Data formats: Row major vs Column major (for reading specific columns, especially useful for training ML models) optimized for different use cases
  • Data models: Relational model / Document model (NoSQL, responsibility of handling structure of data is transferred from database to application)
  • Modes of Dataflow: through databases, services (request-driven, data is passed via API calls between microservices), real-time transport (event-driven, message broker like Apache Kafka)
  • Batch Processing & Stream Processing: Batch - processes large amounts of historical data at once. Stream - processes data in real-time as it arrives. Batch processing can be seen as a special case of stream processing

Chapter 4: Training Data

  • Sampling: Often infeasible to handle all possible data. Some methods of sampling are non-probability sampling (based on convenience), probability sampling (includes simple random, stratified (to ensure representation of all classes), weighted sampling)
  • Labeling: Hand labels, natural labels (highly valuable but often have a delayed feedback loop)
  • Lack of labels: Weak Supervision (labeling functions), Semi-Supervision (uses model's confident predictions to generate more labels for future model training), Transfer learning (model pretrained on a large different dataset as a starting point), active learning (requesting for labels where the model is most uncertain)
  • Class imbalance: upsampling / downsampling, more weight to minority classes in loss function
  • Data Augmentation: to generate more training data. perturbation - adding noise to data to make it more robust to adversarial attacks

Note - Gemini-CLI helped in recollecting the chapters.

Started OverTheWire Hacking Challenges

Started tinkering with hacking challenges on OverTheWire.

I have stumbled upon OTW multiple times but never managed to solve beyond 2-3 challenges. Hoping this would remain as a good hobby that I keep myself binded to. I am starting with Leviathan game that is beginner friendly.

Finished Chapters 5 & 6 of DMLS Book

Completed reading Chapter 6 (Model Development) & Chapter 5 (Feature Engineering) of DMLS book.

Chapter 5: Feature Engineering

  • Learned vs Engineered Features: Deep learning models can learn features automatically
  • Feature Engineering Operations: Handling missing feature values (very interesting examples), scaling, discretization, encoding categorical features (with the hashing trick) feature crossing, etc
  • Data Leakage: How can data leakage lead to optimistic evaluation numbers that fails in production env (ex., pre-computation across training and testing set)

Chapter 6: Model Development & Offline evaluation

  • Model Selection: Start with simple models and then evaluate tradeoffs (performance, latency, cost, interpretability)
  • Ensembles (Dominates most Kaggle competitions): Bagging, Boosting & Stacking
  • Debugging ML models example: Try to overfit a single batch of data to ensure that the model is capable of learning
  • Evaluations: Keep baseline to compare with, have different slice-level evaluation (simpson's paradox), perform perturbation & invariance tests
  • Four Phases of ML Development: Before ML, Simplest ML models, Optimizing ML, Complex models

Note - Gemini-CLI helped in recollecting the chapters.

MLOps Learners Book Club

Found a very valuable Book club YT playlist that will help me in reading DMLS book.

Interestingly, Chip (author of the book) makes regular appearances in the book club.

Enjoying the book 'Designing Machine Learning Systems'

I'm enjoying reading Designing Machine Learning Systems book by Chip Huyen.

I plan to read this book a bit out of order as I am not new to this subject. I will pick chapters in the order of my interest.

I am finding this book a great starting point for building expertise in ML systems.