Subscribe
Sign in
Home
Notes
Free Courses
Archive
About
The Understanding Series
MCP is just a fancy API
Understand MCP as an Engineer. Build and debug one step-by-step in Python using FastMCP.
Jun 7
•
Alex Razvant
27
Share this post
Neural Bits
MCP is just a fancy API
Copy link
Facebook
Email
Notes
More
How to add structure to your LLM Applications using SGLang
Unpacking SGLang technicals, RadixAttention and fast decoding for Structured Output.
Apr 3
•
Alex Razvant
10
Share this post
Neural Bits
How to add structure to your LLM Applications using SGLang
Copy link
Facebook
Email
Notes
More
How does vLLM serve LLMs at scale?
The Online/Offline API modes, PagedAttention and distributed inference with Ray.
Mar 27
•
Alex Razvant
28
Share this post
Neural Bits
How does vLLM serve LLMs at scale?
Copy link
Facebook
Email
Notes
More
2
Unpacking NVIDIA Dynamo LLM Inference Framework
Everything you need to know about Dynamo. Code, components, concepts with diagrams and details.
Mar 20
•
Alex Razvant
26
Share this post
Neural Bits
Unpacking NVIDIA Dynamo LLM Inference Framework
Copy link
Facebook
Email
Notes
More
5
Understanding LLM Optimization Techniques
Weights quantization using GPTQ, BitsAndBytes. Parallelism techniques, KV-caching, Flash Attention and Speculative Decoding.
Mar 1
•
Alex Razvant
31
Share this post
Neural Bits
Understanding LLM Optimization Techniques
Copy link
Facebook
Email
Notes
More
2
Understanding LLM Inference
Explaining LLM pre-fill and generation phases, unpacking model configuration files from HuggingFace.
Feb 20
•
Alex Razvant
54
Share this post
Neural Bits
Understanding LLM Inference
Copy link
Facebook
Email
Notes
More
1
The AI/ML Engineer's starter guide to GPU Programming
#1 Programming on GPUs from scratch by implementing CUDA Kernels in C++, CuPy Python and OpenAI Triton.
Jan 30
•
Alex Razvant
85
Share this post
Neural Bits
The AI/ML Engineer's starter guide to GPU Programming
Copy link
Facebook
Email
Notes
More
4
Guide to understanding Concurrency & Parallelism in Python
Practical deep dive on what concurrency method to use for each type of workload in ML scenarios.
Oct 5, 2024
•
Alex Razvant
25
Share this post
Neural Bits
Guide to understanding Concurrency & Parallelism in Python
Copy link
Facebook
Email
Notes
More
2
Stop using Python Dataclasses - start using Pydantic Models
Add data schemas and sanity to your data models. See how easily will using Pydantic streamline your data validation and serialization workflows.
Sep 17, 2024
•
Alex Razvant
12
Share this post
Neural Bits
Stop using Python Dataclasses - start using Pydantic Models
Copy link
Facebook
Email
Notes
More
Let's build Andrej Karpathy's BPETokenizer in Rust and use it from Python
Learn how to build a custom Rust library and generate Python Bindings. See how popular frameworks like Bytewax and Polars use the same workflow.
Sep 3, 2024
•
Alex Razvant
22
Share this post
Neural Bits
Let's build Andrej Karpathy's BPETokenizer in Rust and use it from Python
Copy link
Facebook
Email
Notes
More
3
Python flexibility and C++ performance in one language — Mojo
The new Mojo Programming Language. LLVM and MLIR as core compiler frameworks. How to test Llama 2 in pure Mojo.
Aug 27, 2024
•
Alex Razvant
7
Share this post
Neural Bits
Python flexibility and C++ performance in one language — Mojo
Copy link
Facebook
Email
Notes
More
Share
Copy link
Facebook
Email
Notes
More
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts