We scrape data from sites throughout the federation so that we can consult it later when searching for something of current interest. See How Search Works
Our sitemap scraper runs four times a day. Logs from each run can be viewed online. page
We report the page count and domain name of sites found to be online and reporting pages in their sitemaps. page
We plot statistics collected over the live our search indexing. plots
We distribute the index files individually and in a single 48 megabyte compressed tar file. tgz
.
I was surprised to learn that we don't index words that appear in the title of a page. This is indeed true as confirmed by inspecting the source code. github
A new pure javascript version of the federation scraper was envisioned as an "outpost" service. This work has been stalled by higher priorities. See Federation Scraper