This doesnt happen often. Mostly with personal project sites as operating a SMB WordPress site kind of requires enterprise level considerations (like backups). Obviously.
We came across a situation recently where a local historian’s WordPress site was completely lost and no backups were retained.
What can one do in this situation? Well if the website is relatively straightforward in terms of sitemap organization and if it has been publicly available for a period of time, then the WayBack Machine can help!
But your site has thousands of posts! Right? You’ll end up staring at the Internet Archive’s Wayback Machine, clicking through half-intact snapshots and manually copying content. To do this manually is slow, messy and only partially reliable. And what do you do with the data you extract? Its represented as static HTML files as it crawls the internet capturing snapshots of websites in periods of time.
We thought “there has to be a better way!” So we built an open source solution “WaybackPress” to make that process sane.
WaybackPress is a Python CLI that discovers your archived WordPress posts and pages in the Wayback Machine, validates what’s worth recovering, fetches media when possible, then exports a clean WXR file you can import back into WordPress. It’s designed for people who actually need to restore content, not just browse the past.
Check out the GIT repository and full docs here
Below is a practical walkthrough of the problem it solves, how it works under the hood, and where it helps the most.
The problem: recovery from Wayback in a predictable way
WordPress sites are especially fragile when recovering from Wayback because content lives in the database while media lives on disk. Themes and plugins change HTML structure constantly so when you try to “just save pages” from the Wayback Machine, you end up with a pile of HTML and broken asset links. Doing this at scale for hundreds (or in our case: thousands) of posts is a non-starter.
The Wayback Machine does provide programmatic access through APIs like CDX and Memento, but you still need logic to decide which URLs are real posts, pick the best snapshots, pull assets, and output something WordPress will import without drama.
That is the gap WaybackPress fills.
How WaybackPress works
WaybackPress runs as a staged pipeline so you can stop, inspect results, and continue without losing progress:
- Discover URLs from Wayback.
- Validate which URLs are actual posts or pages with enough content to matter.
- Fetch media through multi-pass attempts.
- Export a WXR 1.2 file that WordPress imports cleanly.
You can run the whole thing in one command or step through each stage.
Quick start
# Full pipeline
waybackpress run example.com
# Stage-by-stage
waybackpress discover example.com
waybackpress validate --output wayback-data/example.com
waybackpress fetch-media --output wayback-data/example.com
waybackpress export --output wayback-data/example.com
WaybackPress writes a working folder like:
wayback-data/
└── example.com/
├── discovered_urls.tsv
├── valid_posts.tsv
├── validation_report.csv
├── media_report.csv
├── wordpress-export.xml
└── html/ … media/ …
You then import wordpress-export.xml with the standard WordPress Importer (Tools → Import → WordPress). WXR is the native WordPress export format; it carries posts, pages, terms, comments, and media references, and it’s designed to round-trip into WordPress and WP-CLI.
Why heuristics beat theme-specific scraping
Traditional “site rippers” depend on CSS selectors that match a specific theme. That works for one site and breaks on the next. WaybackPress uses content heuristics and generalized extraction so it can handle many WordPress themes and customizations without special rules. It targets the common structure of a WordPress post instead of a particular CSS class name and filters out archives, tag pages, and thin content during validation. This is the only way to scale across years of theme changes and partial snapshots.
Built for real archives, not ideal ones
Wayback archives are incomplete by nature. A snapshot might have the HTML but not the hero image. Another snapshot might have the image but a trimmed body. WaybackPress deals with that reality by:
- Using the CDX index to discover many captures per URL, not just the first one you see in the calendar.
- Attempting media recovery across multiple snapshots in passes, with retries and throttling.
- Logging successes and failures so you can run another pass later with different limits or timing.
Key features at a glance
- Theme-agnostic discovery and validation. No per-theme config. It works across many WordPress layouts because it looks for content patterns, not front-end class names.
- Respectful rate limiting. Default 5-second delay and low concurrency to play nicely with archive.org. You can tune it, but the defaults are intentionally conservative.
- Resumable by design. Each stage writes state and reports, so you can pause and continue later without restarting from zero.
- Multi-pass media fetching. Try again using alternate captures when the first attempt fails. This measurably improves media recovery rates on older sites.
- Clean WXR export. Produces a WordPress import file that works with the dashboard importer or
wp import. Media files are organized for bulk upload or regeneration.
A practical recovery example
Say you need to recover oldsite.example after the hosting expired last year.
Install from source
git clone https://github.com/stardothosting/shift8-waybackpress.git
cd shift8-waybackpress
pip install -r requirements.txt
pip install -e .
# 1) Discover captures for the domain
waybackpress discover oldsite.example
# 2) Validate and download HTML for each candidate post
waybackpress validate --output wayback-data/oldsite.example
# 3) Fetch media in two passes for better coverage
waybackpress fetch-media --output wayback-data/oldsite.example --pass 1
waybackpress fetch-media --output wayback-data/oldsite.example --pass 2
# 4) Export a WXR for import
waybackpress export --output wayback-data/oldsite.example
Import wordpress-export.xml via Tools → Import. Upload the media/ folder to wp-content/uploads/ and run a thumbnail regen if needed. Expect most posts to import cleanly. Expect some media to be missing on very old snapshots. Inspect media_report.csv for what came through and what did not. That’s the honest reality of working with archival data.
If you prefer to inspect stage outputs, run each stage separately as shown earlier. The README in the repository covers options like --delay, --concurrency, --skip-media, and the export flags for title, URL, and author metadata.
For context on the Wayback APIs and why we query CDX the way we do, the Internet Archive’s developer pages are useful references.
Technical architecture: reliability first
Selector brittleness vs heuristic extraction. CSS selectors that target .entry-content or .post-title are brittle across themes and time. WaybackPress instead identifies posts using a mix of signal checks: minimum content length, date extraction, URL patterns, duplicate detection, and common WordPress structures. That allows it to survive theme redesigns and partial markup.
Multiple data paths. WaybackPress builds its candidate list from the Wayback index and evaluates more than one snapshot per URL. That matters when the HTML exists in May 2016 but the image only exists in October 2016. Cycling different captures is often the difference between a blank thumbnail and a recovered hero image. The behavior aligns with how CDX is intended to be used for capture discovery and filtering.
WXR as the output contract. WordPress imports WXR predictably and at scale. Leaning on WXR means we are compatible with the built-in importer and WP-CLI, rather than inventing a custom restore format. It also means devs can diff or post-process the export if they want to reorder categories or authors before the final import.
Operational hygiene. The tool keeps request concurrency low by default and includes a delay between requests. You can tune both with flags if you understand the trade-offs. Respectful defaults are important when working with public infrastructure like the Internet Archive.
What you can expect (and what you can’t)
- Posts and pages: Usually recover well, especially text.
- Images: Expect partial recovery on older or sparsely archived sites. Running a second or third media pass helps. Review the media report to target manual fixes for high-value assets.
- Attachments and downloads: Mixed results. If the file wasn’t captured, it cannot be recovered.
- Perfect pixel parity: Not the goal. The goal is clean content you can re-publish and re-theme quickly.
Legal and ethical use
WaybackPress is for recovering your own content or content you have rights to use. It respects archive.org through conservative defaults and clear user-agent identification. It is not designed for bulk scraping or republishing other people’s material. Review the Internet Archive’s public API notes and use policies before large recoveries, and keep your usage reasonable.
Final Thoughts
WaybackPress won’t rebuild your exact theme, and it cannot create files that were never archived. It does give you a reliable, repeatable way to pull real content out of the Wayback Machine and land it back in WordPress with a minimum of hand work.
Lastly, please consider donating to the Internet Archive if you enjoy using it or rely on it in any way.