Redesigning the Index

I believe I’ve reached a point in Clew’s development where, armed with the knowledge I’ve acquired from months of crawling sites and using that data to search the index, it’s time to wipe the index and start over.

Why the heck would I do this? Well, my options are either to re-crawl every single site, or to get a fresh start; the latter choice gives a chance to end up with a higher quality index rather than an upgrade of the current one.

keeping a record

The first big improvement I want to make is to keep a better record of actions the crawler has taken in the past and the results of those actions. This record may not be used or required by the crawler itself, but would make future decisions like this far easier; the crawler can use this record as a cache to generate the new information without having to spam every site in the index to get the information.

I’d like to implement this using WARC, a standard format for archiving webpages that Marginalia also uses. A standard format will be useful both in terms of library support and possible interoperability between crawlers.

Now, the reason I didn’t implement something like this originally is that I want to be able to publish the index without it becoming a prime source for Machine Learning training; that would be incredibly disrespectful to the sites in Clew’s index. If I keep an archive, it would have to be one that’s optional; that way I can publish the index without the archive and people still use it to self-host their own Clew instances.

possible fragmentation of the crawler

One of the most requested features upon Clew’s launch was something I wasn’t expecting: many people don’t have the time to contribute code to Clew or money to spare for financial support, but are willing to contribute bandwidth and computing resources by hosting their own crawler instance.

With the current architecture of the crawler, this is impossible. In fact, even I can’t run multiple crawler instances on the same machine.

If I implemented WARC, however, this would become a possibility. A centralized manager could decide what needs to be crawled, send out a batch of URLs to each crawler instance (volunteer or official), receive WARC replies, then process the information and index it. (“Ariadne@HOME”, anybody?)

There would have to be some sort of manual vetting of volunteer crawlers and authentication to be sure that there’s no one trying to poison the index, but that’s a bridge that I’m confident can be crossed.

more detailed crawling-focused information

Currently, the information that ends up in the database is very focused on its relevance to ranking eventual search results. The crawler, however, could really benefit from a couple tables keeping track of its own progress and information needed.

better compound keyword detection

Compound keywords are something I implemented very early on into Clew’s closed beta. Searching for pages with “Benjamin Hollon” instead of just “Benjamin” and “Hollon” separately gets you far more accurate results, for example.

The tricky thing was how to know when successive words are compound keywords. I ended up just entering all two-word sequences from every page into the database and hoping it would sort itself out.

Now, having implemented code using this system, I’ve come up with a better option: take all the two-word sequences, then figure out which are most repeated within each specific page! That way I’m not entering in all the coincidental alignments of filler words into the database and then trying to rank pages based on those words.

This would also allow me to start doing this with longer sequences; three, four, and even five-word sequences, perhaps. I don’t want to keep track of all five-word sequences, but if the same exact five-word sequence appears multiple times in the same page, it’s probably relevant.

tracking overall document language

I’ve already been trying to detect a general document language, but I never entered that language into the database; I thought individual keyword languages would be enough. I’ve since discovered it’s not really enough, so I’ll alter this behavior.

detection of server ASNs

Keeping track of the Autonomous System Number of servers helps me know what company is providing the hosting. This is helpful to more easily detect spam, better calculate the sustainability impact of pages, and potentially tell the crawler to de-prioritize the crawling of sites hosted with a certain provider or company.

It could also add the functionality for users to be able to filter out, say, any sites using Cloudflare. Which I know some people using Clew would appreciate the ability to do.

evaluation of page/site value

Classic Google had “PageRank”, a formula to evaluate the value of a domain name. That strategy has been considered somewhat flawed in retrospect, as publishing a set of criteria which is used to value pages resulted in lots of bad-faith optimization of websites, which is part of why I didn’t originally implement anything like this in Clew.

I still don’t want to implement any kind of “reputation” feature—the purpose of the engine is to highlight independent and small websites, so that would be counterproductive.

Still, some kind of rank of how well a site adheres to values I want to reward could be useful. If not for actual result ranking, for crawling purposes. Which brings me to the final point…

optimizing crawl order priority

This is the most compelling reason to re-crawl from scratch. Given what I know now about the web from the statistics I’ve gathered in the process of crawling it, I can re-optimize the crawler to focus on sites that I’m most interested in crawling. Taking the value score I mentioned above, I can pick which links to follow, prioritizing high-value websites.

conclusion

I don’t know when I’ll have time to implement this. If I do, I’ll probably leave the current index live on the site so the search engine doesn’t go completely down while I completely redesign the crawler.

I’m excited about these changes, though, and for the first time in a while I really look forward to having the time to do some work on the crawler.

I hope you’re excited too. Let me know if you have thoughts or feedback, and see you all in the next update!