Hello!

Finished Chapters 3 & 4 of DMLS Book

June 19, 2025

Completed reading Chapter 4 (Training Data) & Chapter 3 (Data Engineering Fundamentals) of DMLS book.

Chapter 3: Data Engineering Fundamentals

Data sources: User input data (requires careful validation) / system-generated data (logs & system outputs, can grow to large volume) / third-party data (privacy concerns).
Data formats: Row major vs Column major (for reading specific columns, especially useful for training ML models) optimized for different use cases
Data models: Relational model / Document model (NoSQL, responsibility of handling structure of data is transferred from database to application)
Modes of Dataflow: through databases, services (request-driven, data is passed via API calls between microservices), real-time transport (event-driven, message broker like Apache Kafka)
Batch Processing & Stream Processing: Batch - processes large amounts of historical data at once. Stream - processes data in real-time as it arrives. Batch processing can be seen as a special case of stream processing

Chapter 4: Training Data

Sampling: Often infeasible to handle all possible data. Some methods of sampling are non-probability sampling (based on convenience), probability sampling (includes simple random, stratified (to ensure representation of all classes), weighted sampling)
Labeling: Hand labels, natural labels (highly valuable but often have a delayed feedback loop)
Lack of labels: Weak Supervision (labeling functions), Semi-Supervision (uses model's confident predictions to generate more labels for future model training), Transfer learning (model pretrained on a large different dataset as a starting point), active learning (requesting for labels where the model is most uncertain)
Class imbalance: upsampling / downsampling, more weight to minority classes in loss function
Data Augmentation: to generate more training data. perturbation - adding noise to data to make it more robust to adversarial attacks

Note - Gemini-CLI helped in recollecting the chapters.

Started OverTheWire Hacking Challenges

June 17, 2025

Started tinkering with hacking challenges on OverTheWire.

I have stumbled upon OTW multiple times but never managed to solve beyond 2-3 challenges. Hoping this would remain as a good hobby that I keep myself binded to. I am starting with Leviathan game that is beginner friendly.

Finished Chapters 5 & 6 of DMLS Book

June 11, 2025

Completed reading Chapter 6 (Model Development) & Chapter 5 (Feature Engineering) of DMLS book.

Chapter 5: Feature Engineering

Learned vs Engineered Features: Deep learning models can learn features automatically
Feature Engineering Operations: Handling missing feature values (very interesting examples), scaling, discretization, encoding categorical features (with the hashing trick) feature crossing, etc
Data Leakage: How can data leakage lead to optimistic evaluation numbers that fails in production env (ex., pre-computation across training and testing set)

Chapter 6: Model Development & Offline evaluation

Model Selection: Start with simple models and then evaluate tradeoffs (performance, latency, cost, interpretability)
Ensembles (Dominates most Kaggle competitions): Bagging, Boosting & Stacking
Debugging ML models example: Try to overfit a single batch of data to ensure that the model is capable of learning
Evaluations: Keep baseline to compare with, have different slice-level evaluation (simpson's paradox), perform perturbation & invariance tests
Four Phases of ML Development: Before ML, Simplest ML models, Optimizing ML, Complex models

Note - Gemini-CLI helped in recollecting the chapters.

MLOps Learners Book Club

June 11, 2025

Found a very valuable Book club YT playlist that will help me in reading DMLS book.

Interestingly, Chip (author of the book) makes regular appearances in the book club.

Enjoying the book 'Designing Machine Learning Systems'

June 10, 2025

I'm enjoying reading Designing Machine Learning Systems book by Chip Huyen.

I plan to read this book a bit out of order as I am not new to this subject. I will pick chapters in the order of my interest.

I am finding this book a great starting point for building expertise in ML systems.

June 2025

Finished Chapters 3 & 4 of DMLS Book

Started OverTheWire Hacking Challenges

Finished Chapters 5 & 6 of DMLS Book

MLOps Learners Book Club

Enjoying the book 'Designing Machine Learning Systems'