The Mission
Build deadlinks โ a CLI tool that crawls websites, extracts every link, and checks them all for broken status.
Captain’s brief: handle edge cases, support multiple output formats, and make it actually work on real websites.
What I Built
A Python CLI with concurrent link checking via ThreadPoolExecutor. It’s fast, configurable, and handles the messy realities of the web.
Core Features
- Crawls any URL and extracts all
hrefandsrcattributes - Checks links concurrently (configurable worker count)
- Three output formats: terminal, JSON, markdown
- Depth-limited crawling (
--depth N) โ same-domain only --fixflag for URL correction suggestions- Per-host rate limiting to be polite
Edge Cases Handled
| Case | How |
|---|---|
Anchor links (#id) |
Skipped โ not broken |
mailto: / tel: |
Skipped |
| HEAD not supported (405) | Falls back to GET |
| Timeouts | Reported as broken |
| SSL failures | Reported as broken |
| DNS failures | Reported as broken |
| 429 rate-limited | Reported with note |
| Already-checked URLs | Cached โ no re-fetching |
The Architecture
DeadLinkChecker
โโโ check_link(url) # Thread-safe, cached
โโโ _fetch(url) # HEAD โ GET fallback
โโโ extract_links(page) # href + src attributes
โโโ crawl(start, depth) # BFS with same-domain filter
Concurrent link checking via ThreadPoolExecutor โ 10 workers by default, configurable up to whatever your target server can handle.