I'm Losing Faith in BM25

Up till now, Clew has ranked its results primarily based on the BM25 algorithm, a quite brilliant formula for determining the “best match” for a set of keywords out of a set of documents.

When I first implemented this I was thrilled—not least because I somehow got a complicated math formula working in pure SQL—but as time has gone on I’m less and less pleased with the actual results.

how things work now

I’m gonna try and give a brief overview of what’s going on behind the scenes when the Clew backend is fed a query to help you better contextualize the system I want to put into place. If you want deeper technical details, do your own research into the algorithm.

The basic idea behind BM25 is that, given the number of times a word appears in a document (in the case of Clew, "document" == "webpage"), some basic information about the document, and some information about the word’s frequency overall, you can calculate how well the keyword matches the document.

For the architecture of Clew, a model like this was a godsend, since it doesn’t require the full text of sites to search, only a “bag of words” linking keywords to documents they appear in.

The problem arises when a search is for multiple keywords. With base BM25, you’re instructed to add up the scores for each keyword to get the overall score. In many cases, this works fine; rarer words are weighted more heavily, which keeps words like “the” from being considered more important than “armadillo”. However, an excessive number of occurrences of a less-important word can outweigh that.

As an example, take my own name, “Benjamin Hollon”. As of today, searching for my full name on Clew does not actually bring up myself as the first result; instead, it brings up someone whose site doesn’t even mention “Hollon”, it just says “Benjamin” more times than mine does.

And that, I think, is the fundamental problem with BM25 for this application: it’s not actually complex enough.

Now, don’t get me wrong, I like simplicity. But everything multiplying and adding up to a single score isn’t really suited to a search engine that’s looking at websites from a wide range of authors and web developers.

how I want things to work

The fundamental principle of the way I want to rank things is this: some factors are more important than others at an absolute level.

You see, I’m not limited to sorting by only one factor. I can say “rank by this, and then if it’s a tie, rank by this other measure”. So, while BM25 will likely remain a component of rankings on Clew, it’ll probably only be a small tiebreaker for when the methods I’m about to describe fail.

The primary factor I want to rank by is the completeness of a match. What do I mean by that?

Well, let’s start with a simple model. In a search for “Benjamin Hollon”, a match is 50% complete if it only has either “Benjamin” or “Hollon”. It must contain both to be a 100% match.

But, of course, a match containing only “Hollon” is more likely to be relevant, since it’s a relatively uncommon surname, while “Benjamin” is a fairly common given name, so in reality, perhaps it should be 30% and 70%.

And, to take it further, if “Benjamin” and “Hollon” appear in separate paragraphs, that’s less complete a match than a site with “Benjamin Hollon” in order.

I want to make code to take a query, extract the keywords, weigh the keywords relative to each other, and then be able to calculate given a result how complete the match is.

If one result is more complete a match than another, it gets ranked higher, no further questions asked.
Next, if a site has invasive ads or tracking, I want it to be ranked lower than any equally-complete matches without ads or tracking.
I haven’t decided yet, but I could decide to rank sites using https:// higher than sites with http:// if both of the above are tied
I want to take BM25 into account at this point to break remaining ties, but probably weighted with some other factors I also care about at around the same level, perhaps including my handy-dandy page size scores

Why put BM25 so low? Well, to put it simply… it’s a very easily-spoofed scoring system. I can say the word “armadillo” over and over on one page of my website to artificially boost how relevant that term looks to the BM25 algorithm. And while I haven’t seen any cases of this being done maliciously, I have seen times where the way a site was designed meant a word was repeated more often on one site than another.

conclusion

If you’ve made it to the end of today’s incoherent ramblings, congratulations! I may sound very intelligent and bright at this point, but it remains to be seen whether I can actually manage to implement this grand vision and, perhaps more crucially, whether I can implement it without sacrificing performance.

Future updates to this blog probably won’t be as technical; mainly I couldn’t sleep and needed to write my thoughts down to calm down my inner monologue, who is kinda bossy.

See you next time, you glorious nerds.