Ethan Goldfarb
ethankgoldfarb@gmail.com

(Draft) From Thermodynamics to Sora, my explainer post on video diffusion models

This is not as complete used to be, partially due to time constraints, and partially because a lot my reading is now involved in forthcoming research. Going forward, this will be moreso an indicator of where my current thoughts are rather than a complete record of my research absorbtion, though it will tend towards that in the long run (the older the month, the more likely it has everything I read in it).

Here's Claude's summary of what I've been up to since August:
"There are 57 papers and resources listed in this more recent reading list (from August-October 2025).
This list focuses heavily on recent work in mechanistic interpretability, video generation and understanding, reinforcement learning, and transformer architectures. There are also several blog posts and non-paper resources included (like the KV Cache visualizer and GitHub repositories)."

Artifact containing an initial attempt to quantify SAE quality (section 6): Steering CLIP's ViT

here's a list of the papers I've been reading recently^†:

†:
what "read" in this instance constitutes ranges from "read relevant sections" to "fully annotated". Most are fully read and synthesized but not annotated. Ask me if curious about a particular paper.
* indicates I either liked a paper for its contents or found its results notable. Lack of a * does not indicate otherwise.
Papers were read during or before the listed month (some were read years ago)
This list is not comprehensive, just what has been recorded on a subset of my devices (many imply familiarity with papers not listed)

Favorites:
* Learning Transferable Visual Models From Natural Language Supervision
* Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
* An Image Is Worth 16X16 Words: Transformers for Image Recognition at Scale
* Formal Algorithms for Transformers
* Improving Language Understanding by Generative Pre-Training
* Toy Models of Superposition
* In-Context Learning and Induction Heads
* Training Compute-Optimal Large Language Models

2025:
September:
Circuit Tracing: Revealing Computational Graphs in Language Models
Defeating Nondeterminism in LLM Inference
Superposition, Memorization, and Double Descent

August:
* Stochastic Parameter Decomposition
Inception v1

May:
Effect of the initial configuration of weights on the training and function of artificial neural networks
Flow Matching for Generative Modeling
Towards Automated Circuit Discovery for Mechanistic Interpretability
Monolith: Real Time Recommendation System With Collisionless Embedding Table
Studying Large Language Model Generalization with Influence Functions

April:
* Structure and Content-Guided Video Synthesis with Diffusion Models
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

March:
Muon

January:
Matryoshka Representation Learning
(gpt-3) Language Models are Few-Shot Learners
Formal Algorithms for Transformers
* Measuring the Intrinsic Dimension of Objective Landscapes
Learning Transferable Visual Models From Natural Language Supervision
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
The Llama 3 Herd of Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Ctrl: A Conditional Transformer Language Model for Controllable Generation
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Analysing the Generalisation and Reliability of Steering Vectors
Visual Instruction Tuning
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
DeepSeek-V3 Technical Report

2024:
November:
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
* Denoising Diffusion Probabilistic Models
Decoding Vision features into Language
A Holistic Approach to Unifying Automatic Concept Extraction and Concept Importance Estimation
Clip-dissect: Automatic Description of Neuron Representations in Deep Vision Networks

August:
Improving Language Understanding by Generative Pre-Training

July:
* In-Context Learning and Induction Heads
* Toy Models of Superposition
A Mathematical Framework for Transformer Circuits
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

May:
Mamba: Linear-Time Sequence Modeling with Selective State Spaces

April:
Elucidating the Design Space of Diffusion-Based Generative Models
Fast Inference from Transformers via Speculative Decoding
AWQ: Activation-Aware Weight Quantization for On-Device Llm Compression and Acceleration

February:
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

January:
QLoRA: Efficient Finetuning of Quantized LLMs
Analyzing and Improving the Training Dynamics of Diffusion Models
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

2023:
December:
Mixtral of Experts
Mistral 7B

September:
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

August:
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
BatchTopK Sparse Autoencoders
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

July:
Llama 2: Open Foundation and Fine-Tuned Chat Models

April:
When Deep Learning Met Code Search
Text and Code Embeddings by Contrastive Pre-Training
Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds
Learning Deep Semantic Model for Code Search Using Codesearchnet Corpus
Learning a Strategy for Adapting a Program Analysis via Bayesian Optimisation

March:
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
Unsupervised Deep Embedding for Clustering Analysis
Hyena Hierarchy: Towards Larger Convolutional Language Models
Emergent Abilities of Large Language Models
GPT-4 Technical Report
* GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers
LoRA: Low-Rank Adaptation of Large Language Models
Adding Conditional Control to Text-to-Image Diffusion Models
8-Bit Approximations for Parallelism in Deep Learning
LLaMA: Open and Efficient Foundation Language Models
Deep Clustering with Convolutional Autoencoders
Effective Dimensionality Reduction for Word Embeddings
All-but-the-Top: Simple and Effective Postprocessing for Word Representations
Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation
Efficient Training of Audio Transformers with Patchout

February:
MusicLM: Generating Music From Text
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Feature-informed Embedding Space Regularization For Audio Classification
Wav2vec: Unsupervised Pre-Training for Speech Recognition
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
Large Transformer Model Inference Optimization
Mixed Precision Training With 8-bit Floating Point
Benchmarking Detection Transfer Learning with Vision Transformers
DETReg: Unsupervised Pretraining with Region Priors for Object Detection
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Deformable Convolutional Networks
End-to-End Object Detection with Transformers
A Survey on Transformers in Reinforcement Learning
A Review of the Decision Transformer Architecture

January:
BEIT: BERT Pre-Training of Image Transformers
HetSeq: Distributed GPU Training on Heterogeneous Infrastructure
PixelCNN++: Improving the PixelCNN With Discretized Logistic Mixture Likelihood and Other Modifications
Zero-Shot Text-to-Image Generation
Segmenter: Transformer for Semantic Segmentation
* Masked Autoencoders Are Scalable Vision Learners
* A Unified View of Masked Image Modeling
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Online Decision Transformer
Flow-guided Semi-supervised Video Object Segmentation
Learning Good Features to Transfer Across Tasks and Domains
Tighter Bounds on the Expressivity of Transformer Encoders
Out of Distribution Performance of State of Art Vision Model
Image Super-Resolution using Efficient Striped Window Transformer
Cut and Learn for Unsupervised Object Detection and Instance Segmentation
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Optimizing Distributed Training Deployment in Heterogeneous GPU Clusters
BOA: Batch Orchestration Algorithm for Straggler Mitigation of Distributed Dl Training in Heterogeneous GPU Cluster
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
High-Resolution Image Synthesis with Latent Diffusion Models
Diffusion Models Beat GANs on Image Synthesis
Measuring the Intrinsic Dimension of Objective Landscapes
How Much Over-Parameterization is Sufficient to Learn Deep ReLu Networks?
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models
A Theoretical Study of Inductive Biases in Contrastive Learning
Insights into Pre-training via Simpler Synthetic Tasks
Language to Logical Form with Neural Attention
Learning Tensor Representations for Meta-Learning
Diffusion Models Beat GANs on Image Synthesis
MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
LAION-5B: An open large-scale dataset for training next generation image-text models * Deep Unsupervised Learning using Nonequilibrium Thermodynamics
Offline Reinforcement Learning as One Big Sequence Modeling Problem
High-Resolution Image Synthesis with Latent Diffusion Models
Hierarchical Text-Conditional Image Generation with CLIP Latents

2022:
December:
The Benefits of Implicit Regularization from SGD in Least Squares Problems
* Decision Transformer: Reinforcement Learning via Sequence Modeling
Pretrained Transformers as Universal Computation Engines
CyCLIP: Cyclic Contrastive Language-Image Pretraining
VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models
! Not ML ! Phase Coexistence Implications of Violating Newton's Third Law
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
Staged Training for Transformer Language Models
On-Demand Sampling:Learning Optimally from Multiple Distributions
Masked Autoencoding for Scalable and Generalizable Decision Making
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
* An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Diffusion-LM Improves Controllable Text Generation
Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning
Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning
Deep Bidirectional Language-Knowledge Graph Pretraining
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
Improving Self-Supervised Learning by Characterizing Idealized Representations
On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies
Neural Abstructions: Abstractions that Support Construction for Grounded Language Learning
LightSeq2: Accelerated Training for Transformer-based Models on GPUs
Improving Neural Network Learning Through Dual Variable Learning Rates
NS3: Neuro-Symbolic Semantic Code Search
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

November:
November was a month of travel and reflection

October:
Using Language to Extend to Unseen Domains
Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity
* YOLOv3: An Incremental Improvement
* An Image Is Worth 16X16 Words: Transformers for Image Recognition at Scale
CoCa: Contrastive Captioners are Image-Text Foundation Models
Scaling Vision Transformers
* CoAtNet: Marrying Convolution and Attention for All Data Sizes
PaLI: A Jointly-Scaled Multilingual Language-Image Model
* Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Self-Attention Does Not Need O(n^2) Memory
* ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
* Learning Transferable Visual Models From Natural Language Supervision
Automatic Differentiation in Machine Learning: a Survey

September:
* Formal Algorithms for Transformers
TransGAN: Two Transformers Can Make One Strong GAN
* BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion
DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-pass Error-Compensated Compression
1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed
* 1-Bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed
Instance Normalization: The Missing Ingredient for Fast Stylization
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
* Adam: A Method for Stochastic Optimization
Decoupled Weight Decay Regularization
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
EfficientNetV2: Smaller Models and Faster Training
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
* Deep Residual Learning for Image Recognition
Unified Scaling Laws for Routed Language Models
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
HellaSwag: Can a Machine Really Finish Your Sentence?
Proximal Policy Optimization Algorithms
Continuous Control With Deep Reinforcement Learning
Multi-Agent Deep Reinforcement Learning for Liquidation Strategy Analysis
Practical Deep Reinforcement Learning Approach for Stock Trading
* Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Learning Compositional Neural Programs with Recursive Tree Search and Planning
On Pseudo-Absence Generation and Machine Learning for Locust Breeding Ground Prediction in Africa
Multi-Objective Quality Diversity Optimization
Squeeze-and-Excitation Networks
Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks
MnasNet: Platform-Aware Neural Architecture Search for Mobile
* Training Compute-Optimal Large Language Models
Scaling Local Self-Attention for Parameter Efficient Visual Backbones
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations
Improving Language Models by Retrieving From Trillions of Tokens
Program Synthesis with Large Language Models
Conformer: Convolution-Augmented Transformer for Speech Recognition
* Attention Is All You Need
Pointwise Convolutional Neural Networks
GLU Variants Improve Transformer
Self-Attention with Relative Position Representations
Stand-Alone Self-Attention in Vision Models
Image Transformer
Mesh-TensorFlow: Deep Learning for Supercomputers
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches
Learning both Weights and Connections for Efficient Neural Networks
Deep Compression: Compressing Deep Neural Networks With Pruning, Trained Quantization and Huffman Coding
Wav2CLIP: Learning Robust Audio Representations From CLIP
* Contrastive Learning with Large Memory Bank and Negative Embedding Subtraction for Accurate Copy Detection
ObamaNet: Photo-Realistic Lip-Sync From Text
The State of Sparse Training in Deep Reinforcement Learning
Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization
Chunked Autoregressive GAN for Conditional Waveform Synthesis
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model