| Preface | ix |
| 1. Introduction to Building AI Applications with Foundation Models | 1 |
| The Rise of AI Engineering | 2 |
| - From Language Models to Large Language Models | 2 |
| - From Large Language Models to Foundation Models | 8 |
| - From Foundation Models to AI Engineering | 12 |
| Foundation Model Use Cases | 16 |
| - Coding | 20 |
| - Image and Video Production | 22 |
| - Writing | 22 |
| - Education | 24 |
| - Conversational Bots | 26 |
| - Information Aggregation | 26 |
| - Data Organization | 27 |
| - Workflow Automation | 28 |
| Planning AI Applications | 28 |
| - Use Case Evaluation | 29 |
| - Setting Expectations | 32 |
| - Milestone Planning | 33 |
| - Maintenance | 34 |
| The AI Engineering Stack | 35 |
| - Three Layers of the AI Stack | 37 |
| - AI Engineering Versus ML Engineering | 39 |
| - AI Engineering Versus Full-Stack Engineering | 46 |
| Summary | 47 |
| 2. Understanding Foundation Models | 49 |
| Training Data | 50 |
| - Multilingual Models | 51 |
| - Domain-Specific Models | 56 |
| Modeling | 58 |
| - Model Architecture | 58 |
| - Model Size | 67 |
| Post-Training | 78 |
| - Supervised Finetuning | 80 |
| - Preference Finetuning | 83 |
| Sampling | 88 |
| - Sampling Fundamentals | 88 |
| - Sampling Strategies | 90 |
| - Test Time Compute | 96 |
| - Structured Outputs | 99 |
| - The Probabilistic Nature of AI | 105 |
| Summary | 111 |
| 3. Evaluation Methodology | 113 |
| Challenges of Evaluating Foundation Models | 114 |
| Understanding Language Modeling Metrics | 118 |
| - Entropy | 119 |
| - Cross Entropy | 120 |
| - Bits-per-Character and Bits-per-Byte | 121 |
| - Perplexity | 121 |
| - Perplexity Interpretation and Use Cases | 122 |
| Exact Evaluation | 125 |
| - Functional Correctness | 126 |
| - Similarity Measurements Against Reference Data | 127 |
| - Introduction to Embedding | 134 |
| AI as a Judge | 136 |
| - Why AI as a Judge? | 137 |
| - How to Use AI as a Judge | 138 |
| - Limitations of AI as a Judge | 141 |
| - What Models Can Act as Judges? | 145 |
| Ranking Models with Comparative Evaluation | 148 |
| - Challenges of Comparative Evaluation | 152 |
| - The Future of Comparative Evaluation | 155 |
| Summary | 156 |
| 4. Evaluate AI Systems | 159 |
| Evaluation Criteria | 160 |
| - Domain-Specific Capability | 161 |
| - Generation Capability | 163 |
| - Instruction-Following Capability | 172 |
| - Cost and Latency | 177 |
| Model Selection | 179 |
| - Model Selection Workflow | 179 |
| - Model Build Versus Buy | 181 |
| - Navigate Public Benchmarks | 191 |
| Design Your Evaluation Pipeline | 200 |
| - Step 1. Evaluate All Components in a System | 200 |
| - Step 2. Create an Evaluation Guideline | 202 |
| - Step 3. Define Evaluation Methods and Data | 204 |
| Summary | 208 |
| 5. Prompt Engineering | 211 |
| Introduction to Prompting | 212 |
| - In-Context Learning: Zero-Shot and Few-Shot | 213 |
| - System Prompt and User Prompt | 215 |
| - Context Length and Context Efficiency | 218 |
| Prompt Engineering Best Practices | 220 |
| - Write Clear and Explicit Instructions | 220 |
| - Provide Sufficient Context | 223 |
| - Break Complex Tasks into Simpler Subtasks | 224 |
| - Give the Model Time to Think | 227 |
| - Iterate on Your Prompts | 229 |
| - Evaluate Prompt Engineering Tools | 230 |
| - Organize and Version Prompts | 233 |
| Defensive Prompt Engineering | 235 |
| - Proprietary Prompts and Reverse Prompt Engineering | 236 |
| - Jailbreaking and Prompt Injection | 238 |
| - Information Extraction | 243 |
| - Defenses Against Prompt Attacks | 248 |
| Summary | 251 |
| 6. RAG and Agents | 253 |
| RAG | 253 |
| - RAG Architecture | 256 |
| - Retrieval Algorithms | 257 |
| - Retrieval Optimization | 268 |
| - RAG Beyond Texts | 273 |
| Agents | 275 |
| - Agent Overview | 276 |
| - Tools | 278 |
| - Planning | 281 |
| - Agent Failure Modes and Evaluation | 298 |
| Memory | 300 |
| Summary | 305 |
| 7. Finetuning | 307 |
| Finetuning Overview | 308 |
| When to Finetune | 311 |
| - Reasons to Finetune | 311 |
| - Reasons Not to Finetune | 312 |
| - Finetuning and RAG | 316 |
| Memory Bottlenecks | 319 |
| - Backpropagation and Trainable Parameters | 320 |
| - Memory Math | 322 |
| - Numerical Representations | 325 |
| - Quantization | 328 |
| Finetuning Techniques | 332 |
| - Parameter-Efficient Finetuning | 333 |
| - Model Merging and Multi-Task Finetuning | 347 |
| - Finetuning Tactics | 357 |
| Summary | 361 |
| 8. Dataset Engineering | 363 |
| Data Curation | 365 |
| - Data Quality | 368 |
| - Data Coverage | 370 |
| - Data Quantity | 372 |
| - Data Acquisition and Annotation | 377 |
| Data Augmentation and Synthesis | 380 |
| - Why Data Synthesis | 381 |
| - Traditional Data Synthesis Techniques | 383 |
| - AI-Powered Data Synthesis | 386 |
| - Model Distillation | 395 |
| Data Processing | 396 |
| - Inspect Data | 397 |
| - Deduplicate Data | 399 |
| - Clean and Filter Data | 401 |
| - Format Data | 401 |
| Summary | 403 |
| 9. Inference Optimization | 405 |
| Understanding Inference Optimization | 406 |
| - Inference Overview | 406 |
| - Inference Performance Metrics | 412 |
| - AI Accelerators | 419 |
| Inference Optimization | 426 |
| - Model Optimization | 426 |
| - Inference Service Optimization | 440 |
| Summary | 447 |
| 10. AI Engineering Architecture and User Feedback | 449 |
| AI Engineering Architecture | 449 |
| - Step 1. Enhance Context | 450 |
| - Step 2. Put in Guardrails | 451 |
| - Step 3. Add Model Router and Gateway | 456 |
| - Step 4. Reduce Latency with Caches | 460 |
| - Step 5. Add Agent Patterns | 463 |
| - Monitoring and Observability | 465 |
| - AI Pipeline Orchestration | 472 |
| User Feedback | 474 |
| - Extracting Conversational Feedback | 475 |
| - Feedback Design | 480 |
| - Feedback Limitations | 490 |
| Summary | 492 |
| Epilogue | 495 |
| Index | 497 |
