Cool, I've been thinking on this topic a bit lately. Crawling is indeed not that hard of a problem. Google could do it 23 years ago. The web is a bit bigger now of course but it's not that bad. Those numbers are well within the range of a very modest search cluster (pick your favorite technology; it shouldn't be challenging for any of them). 10x or 1000x would not matter a lot for this. Although it would raise your cost a little.
The hard problem is indeed separating the good stuff from the bad stuff; or rather labeling the stuff such that you can tell the difference at query time. Page rank was nice back in the day; until people figured out how to game things. And now we have bot farms filling the web with nonsense to drive political agendas, create memes, or to drown out criticism. Page rank is still a useful ranking signal; just not by it self.
The one thing no search engine has yet figured out is reputability of sources. Content isn't anonymous mostly. It's produced and consumed by people. And those people have reputations. Bot content is bad because it comes from sources without a credible reputation. Reputations are built over time and people value having them. What if we could value people's appreciation relative to their reputability? That could filter out a lot of nonsense. A simple like button + a flag button combined with verified domain ownership (ssl certificates) could do the trick. You like a lot of content that other people disliked, your reputation goes down the drain. If you produce a lot of content that people like, your reputation goes up. If a lot of reputable people flag your content, your reputation tanks.
The hard part is keeping the system fair and balanced. And reputability is of course a subjective notion and there is a danger of creating recommendation bubbles, politicizing certain topics, or even creating alternative reality type bubbles. It's basically what's happening. But it's mostly powered by search engines and social media that actually completely ignore reputability.
> The hard part is keeping the system fair and balanced.
It is, which is why I think the author should stay away from anything requiring users to vote on things.
The problem with deriving reputability from votes over time is in distinguishing legitimate votes from malicious votes. Voting is something that doesn't just get gamed, it gets gamed as a service. You'll have companies selling votes, and handling all the busywork necessary to game the bad vote detector.
Search engines and social media companies don't ignore this topic - on the contrary, they live by it. The problem of reputation vote quality is isomorphic to the problem of ad click quality. The "vote" is a click event on an ad, and the profitability for both the advertiser and the ad network depend on being able to tell legitimate clicks and fake clicks apart. Ludicrous amounts of money went into solving this problem, and the end result is... surveillance state. All this deep tracking on the web, it doesn't exist just - or even primarily - to target ads. It exists to determine whether a real would-be customer is looking at an ad, or if it's a bot farm (including protein bot farm, aka. people employed to click on ads en masse).
We need something better. Something that isn't as easy to game, and where mitigations don't come with such a high price for the society.
The hard problem is indeed separating the good stuff from the bad stuff; or rather labeling the stuff such that you can tell the difference at query time. Page rank was nice back in the day; until people figured out how to game things. And now we have bot farms filling the web with nonsense to drive political agendas, create memes, or to drown out criticism. Page rank is still a useful ranking signal; just not by it self.
The one thing no search engine has yet figured out is reputability of sources. Content isn't anonymous mostly. It's produced and consumed by people. And those people have reputations. Bot content is bad because it comes from sources without a credible reputation. Reputations are built over time and people value having them. What if we could value people's appreciation relative to their reputability? That could filter out a lot of nonsense. A simple like button + a flag button combined with verified domain ownership (ssl certificates) could do the trick. You like a lot of content that other people disliked, your reputation goes down the drain. If you produce a lot of content that people like, your reputation goes up. If a lot of reputable people flag your content, your reputation tanks.
The hard part is keeping the system fair and balanced. And reputability is of course a subjective notion and there is a danger of creating recommendation bubbles, politicizing certain topics, or even creating alternative reality type bubbles. It's basically what's happening. But it's mostly powered by search engines and social media that actually completely ignore reputability.