The New Ariadne Architecture

While on a fourteen-hour international flight, I finally managed to come up with an architecture for Clew’s web crawler that I’m happy with. Here’s the run-down.

The crawler will be split into three main pieces that will run/function independently:

ariadne

Ariadne is the main process of the web crawler, overseeing tasks and processing data.

When the task queue is low, Ariadne will select new tasks as needed from the database based on following links, re-crawling pages, checking feeds, and so forth. It will calculate what web requests need to be made and log those to the database, where the other parts will process it.

Then, from a set of parallelized workers, it will process data from the completed requests as that data comes in. This part of the process is very easily scalable, as I can set how many workers I want to be active at any given time, and each will work in its own process.

The one crucial difference from the previous architecture: Ariadne will not be making any web requests, itself. It simply decides what requests need to happen and processes the data once they’ve been made.

behind the name

The name Clew refers to the ball of string used by Theseus to navigate the Labyrinth, an apt metaphor for the difficulty of navigating the web that this search engine tries to ease.

Ariadne was the woman who wound the clew and gave it to Theseus, and feels like an appropriate name for the web crawler that fetches and organizes the data needed for the engine to be able to find anything.

Ariadne will continue to be the name for the overall crawler and the user agent it uses when sending requests to sites, to avoid confusion.

daedalus

Daedalus is the first new part of the crawler’s architecture. Similarly to Ariadne, it will not make any requests itself, but neither will it process any data. Daedalus is, so to speak, a middleman. A manager.

Daedalus takes items in the task queue that Ariadne generated and collates them into parcels of requests to sites that make sense together. Then, it’ll send this off to individual crawler “nodes” to actually make the requests. This allows me to scale up crawling speeds by having multiple nodes, and even to accept volunteers who want to have their own nodes.

Daedalus will also handle authentication with those nodes, so that we don’t have malicious nodes poisoning the index data.

Daedalus is something of a bottleneck, as it can’t be parallelized very well itself, but as it doesn’t actually process any data or make web requests, that shouldn’t be an issue, at least in the foreseeable future.

behind the name

Daedalus was the designer of the Labyrinth, a brilliant inventor, and the one who designed and gave Ariadne the clew used by Theseus.

After Theseus and Ariadne escape, Daedalus and his son, Icarus, are imprisoned by King Minos and escape via wings Daedalus constructs. Unfortunately, Icarus dies in the process.

Daedalus felt an appropriate name for a process that manages and organizes what needs to be done, as this takes a good deal of logic and inventive skills to decide where to assign tasks.

icarus

Icarus will be the final piece of the puzzle, a small piece of code that accepts URLs from Daedalus and actually does the fetching of those URLs. It will record the data, probably in the WARC format (which should increase the possibility of collaboration between my crawlers and others in the future), then send that data to Daedalus once it’s collected.

Icarus is very scalable, as I can host as many nodes of it as I want, even on different machines and possibly hosted by volunteers. Great care will be taken, however to keep from request pages on a single site from multiple Icarus nodes at once, to avoid overloading other people’s servers. This will be part of the logic in Daedalus.

behind the name

As mentioned, Icarus was Daedalus’s son, which was about the full extent of the logic for naming this section of the code. Icarus’s tragic death from flying too close to the sun, melting the wax holding the feathers to his wings, will hopefully not be repeated in this project.

conclusion

And there you have it, the three pieces I plan to construct for my rewrite of Clew’s crawler. I’ve set up the code so that the parts can be easily installed via pipx and am building CLIs for them, which should ease setup and use of each of the parts. Running a volunteer Icarus node should be as simple as installing the utility, entering your provided authentication key, and running a command to start the process.

If you’ve got ideas sparked by this, by all means contact me with them! I eagerly await your feedback.

Return home ↩


Benjamin Hollon

Benjamin Hollon is Clew’s creator. When not reinventing the wheel, Benjamin writes for his numerous blogs, crafts stories, plays and composes trombone, travels the world, commits atrocities in the terminal, runs a social media site, codes, studies Communication and Professional Writing at Texas A&M, forgets his family’s birthdays, gets locked into the library (not realizing it’s closed), and generally goofs around.

You can financially support his work, including the development of Clew, on Liberapay.

Email Public Key