Building Clew

A Secret Web

2025-05-04T02:34:31.000Z

The web is mind-bogglingly massive. So massive, in fact, that it’s nearly impossibly to visualize its true scale. Even if your entire lifetime was spent perusing the web and searching every nook and cranny, you would never reach more than a miniscule fraction of the vast ocean of information available to you.

There is so much information in the world that “post-scarcity” is a severe understatement of the scale of our information age.

To have any hope of meaningfully browsing the web, we need systems in place that artificially limit that scope—curation technologies. Today, the web curation process is almost entirely through search engines and social media algorithms, commercial automated systems to narrow our focus on the web to the specific things we know we want.

Before search engines, though, existed older, more powerful discovery methods, curated by humans for our benefit; these systems and networks still exist, but they seem hidden from the view of the mainstream web.

Let’s take a look at the secret web, the festival of personal expression and idea sharing happening below the surface, out of the mainstream view, spearheaded by people like you and me.

Many Names, One Network

Outside the grasp of social media and the commercial web sits a broad community of people with personal websites and blogs, interacting with and following each other without trying to make money or become famous.

This community has received many names, each trying to capture a different side of the network.

The Small Web contrasts this community with the “Big Web”, valuing personal ownership over scale.
The IndieWeb also values personal ownership of websites, providing numerous technical standards and proposals to help facilitate interaction between different people’s blogs.
Web 1.0 rejects the hype of “Web 2.0” apps, using simple, straightforward technologies to build websites.
The Blogosphere is an old term that’s been around since 1999, referencing the community of bloggers.
The Web Revival is the concept shared by many that this community has been growing and making a comeback.

Whatever the form, this idea keeps coming back; something appeals to many about the idea of a smaller, more personal web, made up of connections between real people, without corporate interests.

I call this a “network” because it truly is one; this community of sites has developed many, many ways for readers to discover more of it, and they all revolve around the hyperlink.

The Link Graph

The most fundamental innovation of the world wide web as a medium is hypertext (HTTP, for example, stands for “HyperText Transfer Protocol”). Hypertext, put simply, is text that links to other pages. This is the fundamental technology that ties together the web: every method of discovering sites, other than word of mouth, relies on links between webpages.

To really understand how these methods work, we need to understand the link graph. Let’s construct a very tiny subset of the web, made up of just three websites:

benjaminhollon.com - My personal website
joelchrono.xyz - A friend of mine, who designed the Clew logo
neil-clarke.com - The editor of Clarkesworld Magazine

Joel and I link to each other’s sites fairly frequently; we talk often and inspire each other. This link is bidirectional.

While Joel and I both are subscribers to Clarkesworld, I’m the only one who links to Neil Clarke’s blog on my site. Neil Clarke, of course, probably has never encountered my site, so this link is one way.

This relationship makes for a simple chart

This is a very simple example of a link graph, a representation of how various sites link to each other. The more sites you look at, the more complicated it gets; search engines internally graph this relationship between hundreds of thousands, millions, or even billions of sites.

Given a graph like this, it’s possible to find out how similar sites are, decide which sites are the most important or official (one of Google’s key innovations), or predict how a random person surfing the web might discover websites. This concept of the link graph is the most central tool in any analysis of relationships on the web.

The big difference between the secret web and the large, messy corporate web is that people on it frequently link and reply to each other’s websites and blog posts, creating a compact network of people with related ideas and interests, while on the corporate web companies profit from keeping you on their site and link outwards begrudgingly. This allows projects like IndieMap, a link graph of 2,300 sites on the Indie Web and their relationships to each other.

Classic Web Discovery: Blogrolls, Link Blogs, and Webrings

Now that we understand the importance of links to discovery on the web, let’s examine some classic tools for discovering new sites on the web without search engines.

One powerful tool is the blogroll—many personal websites have a list of other sites the author finds interesting. This “blogroll” lets readers who enjoy one site easily find a curated list of other sites they’ll enjoy. I have a blogroll, where I link to all the sites I most closely follow. Many of the blogs I link to also have their own blogrolls, often larger than mine, providing almost endless opportunity to discover new, fascinating people.

A tool for discovering sites I’ve personally been enjoying lately is the link blog. Link blogs can either be their own dedicated sites or within a larger blog; the idea is a regular post that links to and comments on interesting articles and webpages the author encountered recently.

One famous (and fantastic) example is Pluralistic, a daily link blog by Cory Doctorow focused largely on topics such as digital privacy, monopolistic practices by big tech, environmental sustainability, and right to repair. A recent favorite of mine is Scrolls, by Mike Sass, a weekly roundup of fascinating posts and projects from the IndieWeb and Fediverse.

Everything old is new again, and webrings are no exception. A webring is a collection of sites who all agree to link to each other in a sort of loop; each site links to two neighbors, and by following the links you reach every site in the ring. Some excellent current examples are the IndieWeb Webring, Fediring, and Polyring (which I designed the logo for). Often, web rings have a theme, letting people browsing easily surf through many related websites.

And of course, no discussion of discovery on the web would be complete without mentioning web feeds such as RSS and Atom, a spam-proof, simple way to subscribe to sites we love to hear about new updates. About Feeds is a great introduction, if you’re not already familiar with them.

The Danger of Search Engines

While all of those wonderful methods for discovering the secret web exist, it’s time to mention the elephant in the room: search engines.

The selling point of a search engine is essentially to remove the need for you, the reader, to worry about the link graph; when you’re looking for something specific, you can quickly find it without needing to browse numerous websites to get to what you need.

Search engines are extremely valuable tools, but also bring dangers, if you let them be your only discovery tool.

Remember, the web is unimaginably big: you will never see all of it. Any search engine you use will, by necessity, show you a subset of information out there, and the designers have to make decisions about what to value. When a search engine is your sole source of information, you give it full control over what parts of the web, what voices and opinions, even exist for you.

With a corporate search engine like Google, that means giving a commercial entity whose main interest is monetary full control over the ideas you receive. From an epistemology standpoint, that should terrify you.

Even when you start looking at alternate search engines, it’s important to realize that many, such as DuckDuckGo and Ecosia, use the corporate engines’ indexes, under the hood. Mojeek’s Search Engine Map is a good visualization of which engines use their own indexes and which use big tech sources. Seirdy’s article on search engines with independent indexes is also a fantastic resource.

All that to say, a search engine cannot be your sole source of information and discovery. Its strength is in helping you find specific things when you need them, but for a well-rounded information gathering experience, we all need to put more faith into other discovery methods, especially for the independent, secret web.

New Solutions, Old Concepts

I’m certainly not the only one who’s trying to find new discovery methods for the web right now; very, very many others are attaching the same task, and that’s good; reinventing the wheel can be a force for positive change.

First, I of course am developing an independent search engine that focuses on the small, independent web, Clew. Right now, many others are doing the same or similar:

Marginalia - This is my favorite independent search engine right now, and the one I use the most often.
Unobtanium - The developer and I frequently chat about our work and bounce ideas off of each other.
Stract - This one launched around the same time as Clew; I haven’t looked too close, though.
Lieu - A search engine aimed at searching webrings. Very cool mix of old and new discovery methods!
Mwmbl - A user-curated search engine, a fascinating experiment.
Search My Site - Very similar in goals to Clew, but only crawls user-submitted sites instead of trying to discover new sites.
Wiby - A search engine for websites using older technology, great for use on vintage computers.
YaCy - A decentralized search engine; cool!
PeARS - A search engine that can be run in the browser, without needing a server. I love the concept, and I’m looking forward to seeing it develop.
Mojeek - Probably the biggest of the independent search engines; they’ve been around forever.

I’ve also been noticing a few cool new takes on older discovery technologies, many of which I’m considering integrating in some way into Clew. OPML, a format that can list RSS feeds, has a relatively recent proposal for auto-discovering machine-readable blogrolls, which would be interesting to integrate into crawler code to better find which links on a page are valuable to the site’s creator.

James has been developing a feed reader, Artemis, and most interesting is a recent integration, Artemis Link Graph, a browser extension that will tell you when a page you’re reading has been linked to by sites you follow. This is an excellent use of the concept of the link graph, and it would be interesting to integrate this concept into a search engine; perhaps users could specify sites they trust, and the search engine could weight or highlight results appropriately, based on where those sites link?

I have a few other ideas of my own, and my redesign of Clew has had excellent progress in the last month, so I’m excited to get to publish those ideas relatively soon, probably this summer. It’s a bright world we’re moving into, with amazing ideas for discovering new sites.

Preserving and Growing the Secret Web

So, we’ve examined how the secret web works, looking at how readers discover new sites and even considering some new, promising ideas. Where do we go next?

The Secret Web is exciting, and it’s unlikely that it can ever be killed, but it needs intention and attention to really thrive.

Thankfully, many people are working on the technology side of this; the IndieWeb, particularly, is spearheading developing wonderful technology for growing and connecting personal websites.

With the technical side taken care of, we should consider the social side: much of the independent web today is made up of people with similar interests, in technology in particular. We need to look to lowering the barrier to entry and expanding access to this web. We need to make this web an open secret.

There are two sides to this: making it easier to learn to create websites and easing discovering websites. In the past, I worked on readable.css, to ease making beautiful and accessible sites. Now, I’m working on Clew, to make discovering what real people think easier.

That’s really Clew’s mission: when you look something up, you’re not trying to get to an “official” source of information, you’re looking for blog posts or fan pages about the topic. Very early after releasing Clew, when results were worse than they are today (and, hopefully, much worse than where they will be soon), the feedback I received was still overwhelmingly positive: even though people weren’t getting the results they’d hoped for, the sites they did find were fascinating and ones they hadn’t seen before.

So, what can you, the person reading this do? If you have a website, fantastic! Maybe write some more code towards it, or draft a blog post if you haven’t done so in a while.

If you think starting a website sounds like fun, you should give it a go! There are very many resources out there. A good starting place is my own guide to blogging on a budget, where I look at options to have a very personal site without breaking the bank, or even for free.

And, if you’re interested in finding new, exciting sites from the secret web, perhaps look through a blogroll, link blog, webring, or try searching the web with Clew.

I hope to see you around my corner of my web neighborhood soon—happy browsing!

The New Ariadne Architecture

2024-12-14T00:58:29.000Z

While on a fourteen-hour international flight, I finally managed to come up with an architecture for Clew’s web crawler that I’m happy with. Here’s the run-down.

The crawler will be split into three main pieces that will run/function independently:

ariadne

Ariadne is the main process of the web crawler, overseeing tasks and processing data.

When the task queue is low, Ariadne will select new tasks as needed from the database based on following links, re-crawling pages, checking feeds, and so forth. It will calculate what web requests need to be made and log those to the database, where the other parts will process it.

Then, from a set of parallelized workers, it will process data from the completed requests as that data comes in. This part of the process is very easily scalable, as I can set how many workers I want to be active at any given time, and each will work in its own process.

The one crucial difference from the previous architecture: Ariadne will not be making any web requests, itself. It simply decides what requests need to happen and processes the data once they’ve been made.

behind the name

The name Clew refers to the ball of string used by Theseus to navigate the Labyrinth, an apt metaphor for the difficulty of navigating the web that this search engine tries to ease.

Ariadne was the woman who wound the clew and gave it to Theseus, and feels like an appropriate name for the web crawler that fetches and organizes the data needed for the engine to be able to find anything.

Ariadne will continue to be the name for the overall crawler and the user agent it uses when sending requests to sites, to avoid confusion.

daedalus

Daedalus is the first new part of the crawler’s architecture. Similarly to Ariadne, it will not make any requests itself, but neither will it process any data. Daedalus is, so to speak, a middleman. A manager.

Daedalus takes items in the task queue that Ariadne generated and collates them into parcels of requests to sites that make sense together. Then, it’ll send this off to individual crawler “nodes” to actually make the requests. This allows me to scale up crawling speeds by having multiple nodes, and even to accept volunteers who want to have their own nodes.

Daedalus will also handle authentication with those nodes, so that we don’t have malicious nodes poisoning the index data.

Daedalus is something of a bottleneck, as it can’t be parallelized very well itself, but as it doesn’t actually process any data or make web requests, that shouldn’t be an issue, at least in the foreseeable future.

behind the name

Daedalus was the designer of the Labyrinth, a brilliant inventor, and the one who designed and gave Ariadne the clew used by Theseus.

After Theseus and Ariadne escape, Daedalus and his son, Icarus, are imprisoned by King Minos and escape via wings Daedalus constructs. Unfortunately, Icarus dies in the process.

Daedalus felt an appropriate name for a process that manages and organizes what needs to be done, as this takes a good deal of logic and inventive skills to decide where to assign tasks.

icarus

Icarus will be the final piece of the puzzle, a small piece of code that accepts URLs from Daedalus and actually does the fetching of those URLs. It will record the data, probably in the WARC format (which should increase the possibility of collaboration between my crawlers and others in the future), then send that data to Daedalus once it’s collected.

Icarus is very scalable, as I can host as many nodes of it as I want, even on different machines and possibly hosted by volunteers. Great care will be taken, however to keep from request pages on a single site from multiple Icarus nodes at once, to avoid overloading other people’s servers. This will be part of the logic in Daedalus.

behind the name

As mentioned, Icarus was Daedalus’s son, which was about the full extent of the logic for naming this section of the code. Icarus’s tragic death from flying too close to the sun, melting the wax holding the feathers to his wings, will hopefully not be repeated in this project.

conclusion

And there you have it, the three pieces I plan to construct for my rewrite of Clew’s crawler. I’ve set up the code so that the parts can be easily installed via pipx and am building CLIs for them, which should ease setup and use of each of the parts. Running a volunteer Icarus node should be as simple as installing the utility, entering your provided authentication key, and running a command to start the process.

If you’ve got ideas sparked by this, by all means contact me with them! I eagerly await your feedback.

Redesigning the Index

2024-11-15T20:52:34.000Z

I believe I’ve reached a point in Clew’s development where, armed with the knowledge I’ve acquired from months of crawling sites and using that data to search the index, it’s time to wipe the index and start over.

Why the heck would I do this? Well, my options are either to re-crawl every single site, or to get a fresh start; the latter choice gives a chance to end up with a higher quality index rather than an upgrade of the current one.

keeping a record

The first big improvement I want to make is to keep a better record of actions the crawler has taken in the past and the results of those actions. This record may not be used or required by the crawler itself, but would make future decisions like this far easier; the crawler can use this record as a cache to generate the new information without having to spam every site in the index to get the information.

I’d like to implement this using WARC, a standard format for archiving webpages that Marginalia also uses. A standard format will be useful both in terms of library support and possible interoperability between crawlers.

Now, the reason I didn’t implement something like this originally is that I want to be able to publish the index without it becoming a prime source for Machine Learning training; that would be incredibly disrespectful to the sites in Clew’s index. If I keep an archive, it would have to be one that’s optional; that way I can publish the index without the archive and people still use it to self-host their own Clew instances.

possible fragmentation of the crawler

One of the most requested features upon Clew’s launch was something I wasn’t expecting: many people don’t have the time to contribute code to Clew or money to spare for financial support, but are willing to contribute bandwidth and computing resources by hosting their own crawler instance.

With the current architecture of the crawler, this is impossible. In fact, even I can’t run multiple crawler instances on the same machine.

If I implemented WARC, however, this would become a possibility. A centralized manager could decide what needs to be crawled, send out a batch of URLs to each crawler instance (volunteer or official), receive WARC replies, then process the information and index it. (“Ariadne@HOME”, anybody?)

There would have to be some sort of manual vetting of volunteer crawlers and authentication to be sure that there’s no one trying to poison the index, but that’s a bridge that I’m confident can be crossed.

more detailed crawling-focused information

Currently, the information that ends up in the database is very focused on its relevance to ranking eventual search results. The crawler, however, could really benefit from a couple tables keeping track of its own progress and information needed.

better compound keyword detection

Compound keywords are something I implemented very early on into Clew’s closed beta. Searching for pages with “Benjamin Hollon” instead of just “Benjamin” and “Hollon” separately gets you far more accurate results, for example.

The tricky thing was how to know when successive words are compound keywords. I ended up just entering all two-word sequences from every page into the database and hoping it would sort itself out.

Now, having implemented code using this system, I’ve come up with a better option: take all the two-word sequences, then figure out which are most repeated within each specific page! That way I’m not entering in all the coincidental alignments of filler words into the database and then trying to rank pages based on those words.

This would also allow me to start doing this with longer sequences; three, four, and even five-word sequences, perhaps. I don’t want to keep track of all five-word sequences, but if the same exact five-word sequence appears multiple times in the same page, it’s probably relevant.

tracking overall document language

I’ve already been trying to detect a general document language, but I never entered that language into the database; I thought individual keyword languages would be enough. I’ve since discovered it’s not really enough, so I’ll alter this behavior.

detection of server ASNs

Keeping track of the Autonomous System Number of servers helps me know what company is providing the hosting. This is helpful to more easily detect spam, better calculate the sustainability impact of pages, and potentially tell the crawler to de-prioritize the crawling of sites hosted with a certain provider or company.

It could also add the functionality for users to be able to filter out, say, any sites using Cloudflare. Which I know some people using Clew would appreciate the ability to do.

evaluation of page/site value

Classic Google had “PageRank”, a formula to evaluate the value of a domain name. That strategy has been considered somewhat flawed in retrospect, as publishing a set of criteria which is used to value pages resulted in lots of bad-faith optimization of websites, which is part of why I didn’t originally implement anything like this in Clew.

I still don’t want to implement any kind of “reputation” feature—the purpose of the engine is to highlight independent and small websites, so that would be counterproductive.

Still, some kind of rank of how well a site adheres to values I want to reward could be useful. If not for actual result ranking, for crawling purposes. Which brings me to the final point…

optimizing crawl order priority

This is the most compelling reason to re-crawl from scratch. Given what I know now about the web from the statistics I’ve gathered in the process of crawling it, I can re-optimize the crawler to focus on sites that I’m most interested in crawling. Taking the value score I mentioned above, I can pick which links to follow, prioritizing high-value websites.

conclusion

I don’t know when I’ll have time to implement this. If I do, I’ll probably leave the current index live on the site so the search engine doesn’t go completely down while I completely redesign the crawler.

I’m excited about these changes, though, and for the first time in a while I really look forward to having the time to do some work on the crawler.

I hope you’re excited too. Let me know if you have thoughts or feedback, and see you all in the next update!

I'm Losing Faith in BM25

2024-06-27T15:41:48.000Z

Up till now, Clew has ranked its results primarily based on the BM25 algorithm, a quite brilliant formula for determining the “best match” for a set of keywords out of a set of documents.

When I first implemented this I was thrilled—not least because I somehow got a complicated math formula working in pure SQL—but as time has gone on I’m less and less pleased with the actual results.

how things work now

I’m gonna try and give a brief overview of what’s going on behind the scenes when the Clew backend is fed a query to help you better contextualize the system I want to put into place. If you want deeper technical details, do your own research into the algorithm.

The basic idea behind BM25 is that, given the number of times a word appears in a document (in the case of Clew, "document" == "webpage"), some basic information about the document, and some information about the word’s frequency overall, you can calculate how well the keyword matches the document.

For the architecture of Clew, a model like this was a godsend, since it doesn’t require the full text of sites to search, only a “bag of words” linking keywords to documents they appear in.

The problem arises when a search is for multiple keywords. With base BM25, you’re instructed to add up the scores for each keyword to get the overall score. In many cases, this works fine; rarer words are weighted more heavily, which keeps words like “the” from being considered more important than “armadillo”. However, an excessive number of occurrences of a less-important word can outweigh that.

As an example, take my own name, “Benjamin Hollon”. As of today, searching for my full name on Clew does not actually bring up myself as the first result; instead, it brings up someone whose site doesn’t even mention “Hollon”, it just says “Benjamin” more times than mine does.

And that, I think, is the fundamental problem with BM25 for this application: it’s not actually complex enough.

Now, don’t get me wrong, I like simplicity. But everything multiplying and adding up to a single score isn’t really suited to a search engine that’s looking at websites from a wide range of authors and web developers.

how I want things to work

The fundamental principle of the way I want to rank things is this: some factors are more important than others at an absolute level.

You see, I’m not limited to sorting by only one factor. I can say “rank by this, and then if it’s a tie, rank by this other measure”. So, while BM25 will likely remain a component of rankings on Clew, it’ll probably only be a small tiebreaker for when the methods I’m about to describe fail.

The primary factor I want to rank by is the completeness of a match. What do I mean by that?

Well, let’s start with a simple model. In a search for “Benjamin Hollon”, a match is 50% complete if it only has either “Benjamin” or “Hollon”. It must contain both to be a 100% match.

But, of course, a match containing only “Hollon” is more likely to be relevant, since it’s a relatively uncommon surname, while “Benjamin” is a fairly common given name, so in reality, perhaps it should be 30% and 70%.

And, to take it further, if “Benjamin” and “Hollon” appear in separate paragraphs, that’s less complete a match than a site with “Benjamin Hollon” in order.

I want to make code to take a query, extract the keywords, weigh the keywords relative to each other, and then be able to calculate given a result how complete the match is.

If one result is more complete a match than another, it gets ranked higher, no further questions asked.
Next, if a site has invasive ads or tracking, I want it to be ranked lower than any equally-complete matches without ads or tracking.
I haven’t decided yet, but I could decide to rank sites using https:// higher than sites with http:// if both of the above are tied
I want to take BM25 into account at this point to break remaining ties, but probably weighted with some other factors I also care about at around the same level, perhaps including my handy-dandy page size scores

Why put BM25 so low? Well, to put it simply… it’s a very easily-spoofed scoring system. I can say the word “armadillo” over and over on one page of my website to artificially boost how relevant that term looks to the BM25 algorithm. And while I haven’t seen any cases of this being done maliciously, I have seen times where the way a site was designed meant a word was repeated more often on one site than another.

conclusion

If you’ve made it to the end of today’s incoherent ramblings, congratulations! I may sound very intelligent and bright at this point, but it remains to be seen whether I can actually manage to implement this grand vision and, perhaps more crucially, whether I can implement it without sacrificing performance.

Future updates to this blog probably won’t be as technical; mainly I couldn’t sleep and needed to write my thoughts down to calm down my inner monologue, who is kinda bossy.

See you next time, you glorious nerds.

Welcome to the Madness

2024-06-22T00:35:43.000Z

After a while of developing Clew I realized it would probably be a good idea to start a development blog so that you can trace the collapse of my sanity in real time (why are emojis allowed in URLs anyway??? 😩).

what to expect

I have no idea. We’ll have to figure it out together. I’ll probably post progress updates here, so you can be sure that I’m not just spending my days eating chips and playing video games instead of reinventing the wheel like I’m supposed to. Perhaps the occasional announcement.

If nothing else, I plan to try and make it entertaining. So hey, follow along. There’s an Atom Feed to get updates as they come out, and if you don’t know what that means, read this excellent post by Hund (#DiscoveredWithClew) to learn all about it.

current status

So as not to leave you empty handed, here’s a quick update on what’s going on right now.

I just finished coding a multithreaded, multi-queue task prioritization system for Ariadne, Clew’s crawler, and somehow (somehow) it worked first try. With that out of the way, I’ve set the crawler running on my server and the index is steadily growing.

There’s still a little work to do before releasing the crawler’s code publicly on Clew’s Codeberg organization, but this was the biggest blocker to getting that done, so it shouldn’t be long now, assuming no mental breakdowns on my side.

Once that’s all wrapped up and the crawler is running at full speed, I’ll be starting work on a refactor of the query parsing and matching logic for searches.

You see, one of the current issues with Clew is that it can rank how strong matches are given keywords, but it doesn’t actually have a way to gauge whether a result actually fully matches. For example, a search for “benjamin hollon” could have a page that just says “benjamin” over and over at the top, even if it never says “hollon”, because the strength of the match for “benjamin” is so high.

Once this refactor is finished, results will first be sorted by the completeness of the match, then by the strength results match the keywords with.

It sure won’t be easy, but it should drastically improve the results you get when using Clew.

bye

Okay, I need to get back to bashing my head against the brick wall that is the Clew codebase. See you next time!