ai – Insights into World Wide Web

How Large Language Models Really Work: Next-Token Prediction at the Core

Dr. Anjing Wang October 7, 2025 0 Comments

If you peel away all the complexity of modern large language models (LLMs)—billions of parameters, reinforcement learning from human feedback, retrieval-augmented generation—the essence of how they work comes down to…

Understanding the Three Types of Transformers: Encoder, Decoder, and Encoder–Decoder

Dr. Anjing Wang October 7, 2025 0 Comments

The term Transformer has become almost synonymous with modern large language models (LLMs). But when people talk about “encoder-only,” “decoder-only,” or “encoder–decoder” architectures, they are drawing on terminology that predates…

Why Hugging Face Shows the “Wrong” Parameter Count for AWQ Models

Dr. Anjing Wang September 19, 2025 0 Comments

If you’ve loaded an AWQ-quantized model from Hugging Face (like Qwen2.5-VL-3B-Instruct-AWQ), you might have noticed something confusing: 👉 Hugging Face says the model has ~0.9B parameters, but the architecture is…

PyTorch Matrix Multiplication: matmul, mm, and @

Dr. Anjing Wang September 19, 2025 0 Comments

Matrix multiplication is one of the most fundamental operations in machine learning. In PyTorch, you’ll often see three different ways to do it: At first glance, they look interchangeable —…

LoRA: Low-Rank Adaptation Made Simple

Dr. Anjing Wang September 17, 2025 0 Comments

Large language models are huge — billions of parameters, often stored as massive square weight matrices like 4096 × 4096. Fine-tuning all of those parameters for a new task is…

Mixed Precision Training: Faster Deep Learning Without Losing Accuracy

Dr. Anjing Wang September 17, 2025 0 Comments

Training today’s deep learning models is resource-hungry. Models have billions of parameters, and every step requires trillions of floating-point operations. To make training feasible, researchers and engineers rely on mixed…

Understanding FP32, FP16, and BF16: Floating-Point Formats in Deep Learning

Dr. Anjing Wang September 17, 2025 0 Comments

Modern deep learning wouldn’t be possible without floating-point numbers. They’re the backbone of every matrix multiplication, activation, and gradient update. But as models grow larger and GPUs become more specialized,…

Demystifying Floating-Point Precision: Half, Single, and Double

Dr. Anjing Wang September 16, 2025 0 Comments

If you’ve ever written code in Python, CUDA, or TensorFlow, you’ve probably seen terms like float16, float32, or float64. They map directly to the IEEE-754 floating-point standard: But what do…

Causal Language Models vs. Conditional Generation Models: Key Differences and Real-World Examples

Dr. Anjing Wang February 7, 2025 0 Comments

Natural language generation has rapidly evolved with the rise of large language models, but one common point of confusion is distinguishing between causal language models (CLMs) and conditional generation models.…

Unlocking the Power of Conditional Generation in AI

Dr. Anjing Wang February 2, 2025 0 Comments

Artificial Intelligence (AI) has revolutionized how we interact with technology, from chatbots that answer questions to AI models that generate lifelike images and translate languages instantly. But behind many of…

Understanding sklearn.metrics.accuracy_score and How to Calculate Accuracy Manually

Dr. Anjing Wang November 13, 2024 0 Comments

When evaluating machine learning models, accuracy is one of the most commonly used metrics for classification tasks. In this blog post, we’ll dive into the accuracy_score function provided by Scikit-Learn’s…

Supervised Fine-Tuning (SFT): How to Fine-Tune Your Model Like a Pro

Dr. Anjing Wang November 1, 2024 0 Comments

In the world of machine learning, pretrained models are like finding a treasure chest of knowledge. They save us hours, days, or even weeks of training time, allowing us to…

Deep Learning Model Precision: FP32, BF16, INT8 and INT4

Dr. Anjing Wang October 31, 2024 0 Comments

When training or deploying deep learning models, precision isn’t just about getting accurate predictions—it’s also about finding the right balance between performance, memory usage, and speed. Choosing the optimal precision…

LoRA vs. QLoRA: Fine-Tuning Your Model’s Superpowers on a Budget

Dr. Anjing Wang October 31, 2024 0 Comments

Have you ever tried running a colossal language model on a GPU that feels more like a toaster than a supercomputer? Enter LoRA and QLoRA—two magical spells for squeezing every…

AI Infra

How to Run a CUDA Docker Image on an AWS Ubuntu LTS Instance

Dr. Anjing Wang October 29, 2024 0 Comments

Running a CUDA Docker image on an AWS Ubuntu instance enables you to leverage GPU-accelerated computations directly within Docker containers. In this guide, we’ll walk through the process of installing…