For a simple test, I searched "fall of the roman empire". In your search engine, I got wikipedia, followed by academic talks, chapters of books, and long-form blogs. All extremely useful resources.
When I search on google, I get wikipedia, followed by a listicle "8 Reasons Why Rome Fell", then the imdb page for a movie by the same name, and then two Amazon book links, which are totally useless.
Good comparison. Reminds me of an analogy I like to make of today's web, which is it feels like browsing through a magazine store — full of top 10s, shallow wow-factoids, and baity material. I genuinely believe terrible results like this are making society dumber.
The context matters. I'd happily read "Top 10" lists on a website if the site itself was dedicated to that one thing. "Top 10 Prog Rock albums", while a lazy, SEO-bait title, would at least be credible if it were on a music-oriented website.
But no, these stories all come from cookie-cutter "new media" blog sites, written by an anonymous content writer who's repackaged Wikipedia/Discogs info into Buzzfeed-style copy writing designed to get people to "share to Twitter/FB". No passion, no expertise. Just eyeballs at any cost.
This got me thinking that maybe one of the other big reasons for this is that the algorithms prioritize newer pages over older pages. This produces the problem where instead of covering a topic and refining it over time, the incentive is to repackage it over and over again.
It reminds me of an annoyance I have with the Kindle store. If I wanted to find a book on, let's say, Psychology, there is no option to find all-time respected books of the past centenary. Amazon's algorithms constantly push to recommend the latest hot book of the year. But I don't want that. A year is not enough time to have society determine if the material withstands time. I want something that has stood the test of time and is recommended by reputable institutions.
This is just a guess, but I believe that they use machine learning and rank it by the clicks. I took some coursera courses and Andrew Ng sort of suggested that as their strategy.
The problem is that clickbait and low effort articles could be good enough to get the click, but low effort enough to drag society into the gutter. As time passes, the system is gamified more and more where the least effort for the most clicks is optimized.
But they have. or could have. At least Google (and to a smaller extend Microsoft), if you are using Chrome/Bing have exactly that signal. If you stay on the site and scroll (taking time, reading, not skimming) all this could be a signal to evaluate if the search result met your needs.
I've heard google would guess with bounce rate. Or another way, if the user clicks on LinkedIn website A, after a few moments keeps trying other linksw/related search. It would mean it was not valuable.
It is pretty obvious if you search for any old topic that is also covered incessantly by the news. "royal family" is a good example. There's no way those news stories published an hour ago are listed first due to a high PageRank score (which necessarily depends on time to accumulate inbound links).
Even your example would depend upon the context. There are many cases where a programming question in 2021 is identical to one from 2012, along with the answer. In those instances, would you rather a shallow answer from 2021 or an indepth answer from 2012? This is not meant to imply that older answers offer greater depth, yet a heavy bias towards recent material can produce that outcome in some circumstances.
Yes, yet there are programming questions that go beyond "how do I do X in language Y" or "how do I do X with library Y". The language and library specific questions are the ones where I would be less inclined to want additional depth anyhow, well, provided they aren't dependent upon some language or library specific implementation detail.
There are of course a variety of factors, including the popularity of the site the page is published on. The signals related to the site are often as important as the content on the page itself. Even different parts of the same site can lend varying weight to something published in that section.
Engagement, as measured in clicks and time spent on page, plays a big part.
But you're right, to a degree, as frequently updated pages can rank higher in many areas. A newly published page has been recently updated.
A lot depends on the (algorithmically perceived) topic too. Where news is concerned, you're completely right, algos are always going to favor newer content unless your search terms specify otherwise.
PageRank, in it's original form, is long dead. Inbound link related signals are much more complex and contextual now, and other types of signals get more weight.
Your Google search results show the date on articles do they not? If people are more likely to click on "Celebrity Net Worth (2021)" than "Celebrity Net Worth (2012)", then the algo will update to favour those results, because people are clicking on them.
The only definitive source on this would be the gatekeeper itself. But Google never says anything explicitly, because they don't want people gaming search rankings. Even though it happens anyway.
The new evergreen is refreshed sludge for bottom dollar. College kids stealing Reddit comments or moving around paragraphs from old articles. Or linking to linked blogs that link elsewhere.
It's all stamped with Google Ads, of course, and then Google ranks these pages high enough to rake in eyeballs and ad dollars.
Also there's the fact that each year, the average webpage picks up two more video elements / ad players, one or two more ad overlays, a cookie banner, and half a dozen banner/interstitials. It's 3-5% content spread thinly over an ad engine.
The Google web is about squeezing ads down your throat.
Really makes you wonder: you play whack a mole and tackle the symptoms with initiatives like this search engine. But the root of that problem and many many others is the same: advertising. Why don't we try to tackle that?
> This got me thinking that maybe one of the other big reasons for this is that the algorithms prioritize newer pages over older pages.
Actually that's not always the case. We publish a lot of blog content and it's really hard to publish new content that replaces old articles. We still see articles from 2017 coming up as more popular than newer, better treatments of the same subject. If somebody knows the SEO magic to get around this I'm all ears.
Its the "healthy web" Mozilla^1 and Google keep telling their blog audiences about. :)
1 Accept quid pro quo to send all queries to Google by default
If what these companies were telling their readers was true, i.e., that advertising is "essential" for the web to survive, then how are the sites returned by this search engine for text-heavy websites (that are not discoverable through Google, the default search engine for Chrome, Firefox, etc.) able to remain online. Advertising is essential for the "tech" company middleman business to survive.
I'm not sure I agree with your example. It seems to me it is the exact same as a "Top ten drinks to drink on a rainy day" list. There's simply too many good albums and opinions differ, so a top ten would -just like the drinks- end up being a list of the most popular ones with maybe one the author picks to stir some controversy or discussion. In my opinion the world would be a smarter place if Google ranked all such sites low. Then we might at least get fluff like "Top ten prog rock albums if you love X, hate Y and listen to Z when no one is around" instead.
Google won't rank them low because they actually do serve an important purpose. They're there for people who don't really know what they want specifically, they're looking for an overview. A top 10 gives a digestible overview on some topic, which helps the searcher narrow down what they really want.
A "Top 10 albums of all time" post is actually better off going through 10 genres of popular music from the past 50 years and picking the top album (plus mentioning some other top albums in the genre) for each one.
That gives the user the overview they're probably looking for, whether those are the top 10 albums of all time or not. It's a case of what the user searched for vs what they actually really want.
So did Tim Berners-Lee. He was vehemently opposed to people shoehorning images into the WWW, because he didn't want it to turn into the equivalent of magazines. Which, I believe, he shared the opinion of them making society dumber.
Appropriately enough, I couldn't find a good quote to verify that since Google is only giving me newspapers and magazines talking about Sir Tim in the context of current events. I do believe it's in his book "Weaving the Web" though.
what I really want is a true AI to search through all that and figure out the useful truth. I don't know how to do this (and of course whoever writes the AI needs to be unbiased...)
I didn't say the AI should be unbiased, just whoever writes it.
I want an AI that is biased to the truth when there is an objective one, and my tastes otherwise. (that is when asked to find a good book it should give me fantasy even though romance is the most popular genre and so will have better reviews)
Cool, it appears that the trend towards JS may be causing self-selection -- if a page has a high amount of JS, it is highly unlikely to contain anything of value.
True. Unfortunately many large corporate websites through which you pay bills, order tickets, etc. are becoming infested with JS widgets and bulky, slow interfaces. These are hard to avoid.
The mostly JS-less web was fine, fast, and reliable 20 years ago and I never had ActiveX.
I hear stories about Flash and ActiveX but I literally never needed these to shop or pay bills online. Payments also didn't require scripts from a dozen domains and four redirects..
Huh. A weighted algorithm, somewhere between Google and the one linked, where you could subtract from sites by amount of JavaScript might be interesting.
Browsers should be cherry picking the most compelling things that people accomplish with complex code and supporting them as a native feature.
Maybe the Browser Wars aren’t keeping up anymore.
That query is too long apparently. But if you shorten to "haskell type inference", I think it delivers on its promise:
> If you are looking for fact, this is almost certainly the wrong tool. If you are looking for serendipity, you're on the right track. When was the last time you just stumbled onto something interesting, by the way?
The search engine doesn't do any type of re-ordering or synonym stuff, it only tires to construct different N-grams from the search query.
So if you for example compare "SDL tutorial" with "SDL tutorials". On google you'd get the same stuff, this search engine, for better or worse doesn't.
This is a design decision, for now anyway, mostly because I'm incredibly annoyed when algorithms are second-guessing me. On the other hand, it does mean you sometimes have to try different searches to get relevant results.
I’m not against a stemmer, actually, just against the aggressive concordances (?) that Google now employs, like when it shows me X in Banach spaces (the classical, textbook case) when I’m specifically searching for X in Fréchet spaces (the generalization I want to find
but am not sure exists); of course Banach spaces and Fréchet spaces are almost exclusively encountered in the same context, but it doesn’t mean that one is a popular typo for the other! (The relative rarity of both of these in the corpus probably doesn’t help. The farcical case is BRST, or Becchi-Rouet-Stora-Tyutin, in physics, as it is literally a single key away from “best” and thus almost impossible to search for.)
On the other hand, Google’s unawareness of (extensive and ubiquitous) Russian noun morphology is essentially what allowed Yandex to exist: both 2011 Yandex and 2021 Google are much more helpful for Russian than 2011 Google. I suspect (but have not checked) that the engine under discussion is utterly unusable for it. English (along with other Germanic and Romance languages to a lesser extent) is quite unusual in being meaningfully searchable without any understanding of morphology, globally speaking.
I thought you could fix that by enclosing "BRST" in quotes, but apparently not. DuckDuckGo (which uses Google) returns a couple of results that do contain "BRST" in a medical context, but most results don't contain this string at all. What's going on?
I’m not certain what DDG actually uses (wasn’t it Bing?), but in my experience from the last couple of months it ignores quotes substantially more eagerly than Google does. For this particular term, a little bit of domain knowledge helps: even without quotes, brst becchi, brst formalism, brst quantization or perhaps bv brst will get you reasonable results. (I could swear Google corrected brst quantization to best quantization a year ago, but apparently not anymore.) Searching for stuff in the context of BRST is still somewhat unpleasant, though.
I... don’t think anything particularly surprising is happening here, except for quotes being apparently ignored? I’ve had it explained to me that a rare word is essentially indistinguishable from a popular misspelling by NLP techniques as they currently exist, except by feeding the machine a massive dictionary (and perhaps not even then). BRST is a thing that you essentially can’t even define satisfactorily without at the very least four years of university-level physics (going by the conventional broad approach—the most direct possible road can of course be shorter if not necessarily more illuminating). “Best” is a very popular word both generally and in searches, and the R key is next to E on a Latin keyboard. If you are a perfect probabilistic reasoner with only these facts for context (and especially if you ignore case), I can very well believe that your best possible course of action is to assume a typo.
How to permit overriding that decision (and indeed how to recognize you’ve actually made one worth worrying about without massive human input—e.g. Russian adjectives can have more than 20 distinct forms, can be made up on the spot by following productive word-formation processes, and you don’t want to learn all of the world’s languages!) is simply a very difficult problem for what is probably a marginal benefit in the grand scheme of things.
In English, maybe; in Russian, I frequently find myself reaching for the nonexistent “morphology but not synonyms” operator (as the same noun phrase can take a different form depending on whether it is the subject or the object of a verb, or even on which verb it is the object of); even German should have the same problem AFAIU, if a bit milder. I don’t dare think about how speakers of agglunative languages (Finnish, Turkish, Malayalam) suffer.
(DDG docs do say it supports +... and even +"...", but I can’t seem to get them to do what I want.)
Ah, OK. I don’t know anything about Russian. This is a hard problem. I think the solution is something like what you suggest: more operators allowing different transformations. Even in English, I would like a "you may pluralize but nothing else" operator.
Well it’s not that alien, it (along with the other Eastern Slavic languages, Ukrainian and Belarusian) is mostly a run-of-the-mill European language (unlike Finnish, Estonian or Hungarian) except it didn’t lose the Indo-European noun case system like most but instead developed even more cases. That is, where English or French would differentiate the roles of different arguments of a verb by prepositions or implicitly by position, Russian (like German and Latin) has a special axis of noun forms called “case” which it uses for that (and also prepositions, which now require a certain case as well—a noun form can’t not have a case like it can’t not have a number).
There are six of them (nominal [subject], genitive [belonging, part, absence, “of”], dative [indirect object, recipient, “to”], accusative [direct object], instrumental [device, means, “by”], prepositional [what the hell even is this]), so you have (cases) × (numbers) = 6 × 2 = 12 noun forms, and adjectives agree in number and gender with their noun, but (unlike Romance languages) plurals don’t have gender, so you have (cases) × (numbers and genders) = 6 × (3 + 1) = 24 adjective forms.
None of this would be particularly problematic, except these forms work like French or Spanish verbs: they are synthetic (case, number and gender are all a single fused ending, not orthogonal ones) and highly convoluted with a lot of irregularities. And nouns and adjectives are usually more important for a web search than verbs.
Well yeah, English is kind of weird, but Finnish isn’t a Germanic language at all? It’s not even Indo-European, so even Hindi is ostensibly closer to English than Finnish. I understand Standard German (along with Icelandic) is itself a bit atypical in that it hasn’t lost its cases when most other Germanic languages did.
Re compounds, I expected they would be more or less easy to deal with by relatively dumb splitting, similar to greedy solutions to the “no spaces” problem of Chinese and Japanese, and your link seems to bear that out. But yeah, cheers to more language-specific stuff in your indexing. /s
Oh this sounds like it could be a really cool idea! This way it could also be subtly teaching users that the engine doesn't do automatic synonyms translation so it's worth experimenting; also kinda like giving the synonyms feature while still keeping user in full control.
> You can't combine a few different ranked lists and expect to get results better than any of the original ranked lists.
I am skeptical of this application of the theorem. Here is my proposal:
Take the top 10 Google and Bing results. If the top result from Bing is in the top 10 from Google, display Google results. If the top result from Bing is not in the top 10 from Google, place it at the 10th position. You'd have an algorithm that ties with Google, say 98% of the time, beats it say, 1.2% of the time, and loses .8% of the time.
Right. Arrow's theorem just says it's impossible to do it in all cases. It's still quite possible to get an improvement in a large proportion of cases, as you're proposing.
and the first conclusion is "something that you think will improve relevance probably won't"; the TREC conference went for about five years before making the first real discovery
It's true that Arrow's Theorem doesn't strictly apply, but thinking about it makes it clear that the aggregation problem is ill-defined and tricky. (e.g. note also that a ranking function for full text search might have a range of 0-1 but is not a meaningful number, like a probability estimate that a document is relevant, but it just means that a result with a higher score is likely to be more relevant than one with a lower score.)
Another way to think about it is that for any given feature architecture (say "bag of words") there is an (unknown) ideal ranking function.
You might think that a real ranking function is the ideal ranking function plus an error and that averaging several ranking functions would keep the contribution of the ideal ranking function and the errors would average out, but actually the errors are correlated.
In the case of BM25 for instance, it turns out you have to carefully tune between the biases of "long documents get more hits because they have more words in them" and "short documents rank higher because the document vectors are spiky like the query the vectors". Until BM25 there wasn't a function that could be tuned up properly and just averaging several bad functions doesn't solve the real problem.
But in both cases you face the problem of aggregating preferences of many into one. In one case you are combining personal preferences in the other case aggregating ‘preferences’ expressed by search engines.
But search engines aren't voting to maximize the chances that their preferred candidate shows up on top. The mixed ranker has no requirement to satisfy Arrows integrity constraints. It has to satisfy the end user, which is quite possible in theory.
Conditions the mixed ranker doesn't have to satisfy
"ranking while also meeting a specified set of criteria: unrestricted domain, non-dictatorship, Pareto efficiency, and independence of irrelevant alternatives"
Sure, but the problem that conventional IR ranking functions are not meaningful other than by ordering leads you to the dismal world of political economy where you can't aggregate people's utility functions. (Thus you can't say anything about inequality, only about Pareto efficiency)
Hypothetically you could treat these functions as meaningful but when you try you find that they aren't very meaningful.
For instance IBM Watson aggregated multiple search sources by converting all the relevance scores to "the probability that this result is relevant".
A conventional search engine will do horribly in that respect, you can fit a logit curve to make a probability estimator and you might get p=0.7 at the most and very rarely get that, in fact, you rarely get p>0.5.
If you are combining search results from search engines that use similar approaches you know those p's are not independent so you can't take a large numbers of p=0.7's and turn that into a higher p.
If you are using search engines that use radically different matching strategies (say they return only p=0.99 results with low recall) the Watson approach works, but you need a big team to develop a long tail of matching strategies.
If you had a good p-estimator for search you could do all sorts of things that normal search engines do poorly, such as "get an email when a p>0.5 document is added to the collection."
For now alerting features are either absent or useless and most people have no idea why.
That's an invalid application of this theorem. (It doesn't necessarily hold)
Suppose there's an unambiguous ranked preference by all people among a set (webpages, ranking). Suppose one search engine ranks correctly the top 5 results and incorrectly the next 5 results, while another ranks incorrectly the top 5 and correctly the next 5.
What can happen is that some there may be no universally preferred search engine (likely). In practice, as another commenter noted, you can also have most users prefer more a certain combination of results (that's not difficult to imagine, for example by combining top independent results from different engines for example).
For years I wanted to try Copernic Summarizer. It seemed like it actually worked. Then software that did summaries disappeared, maybe? And about 5 years ago bots on Reddit were doing summaries of news stories (and then links in comments).
This is a pattern I see over and over again, some research group or academics show that something can be done (summaries that make sense and are true summaries, evolutionary algorithm FPGA programming, real time gaze prediction, etc) and there's a few published code repos and a bit of news, then 'poof' - no where to be seen for 15 years or more.
This sort of optimization is why simple recipes are typically found at the end of a rambling pointless blog post now.
Still, the best way to break SEO is to have actual competition in the search space. As long as SEO remains focused on Google there is an opportunity for these companies to thrive by evading SEO braindamage.
That sort of recipe blog hasn't happened just for SEO. It's also a bit of a "two audiences" problem: if you are coming to that food blogger from a search you certainly would prefer the recipe first and then maybe any commentary on it below if the recipe looks good. If you are a regular reader of that food blogger you are probably invested in the stories up top and that parasocial connection and the recipes themselves are sometimes incidental to why you are a regular reader.
You see some of that "two readers" divide sometimes even in classic cookbooks, where "celebrity" chefs of the day might spend much of a cookbook on a long rambling memoir. Admittedly such books were generally well indexed and had table of contents to jump right to the recipes or particular recipes, but the concept of "long personal ramble of what these recipes mean to me" is an old one in cookbooks too.
I see your point, but argue you've misidentified the two audiences.
One audience matches your description and is the invested reader. They want that blogger's story telling. they might make the recipe, but they're a dedicated reader.
The other audience is not the recipe-searcher, but instead Google. Food bloggers know that recipe-searchers are there to drop in, get an ingredient list, and move on. They won't even remember the blog's name. So the site isn't optimized for them. It's optimized for Google.
"Slow the parasitic recipe-searcher down. They're leeches, here for a freebie. Well they'll pay me in Google Rank time blocks."
> Food bloggers know that recipe-searchers are there to drop in, get an ingredient list, and move on.
This is not entirely true, though. If a randomly found recipe turns out particularly good, I'll bookmark the site and try out other dishes. It's a very practical method to find particularly good* recipe collections.
*) In this case "good" means what you need - not just subjectively "tasty", but e.g. low cost, quick to prepare, low calorie or in line with a particular diet and so on.
> If you are a regular reader of that food blogger
I think this assumes facts not in evidence. It certainly seems like an overwhelming number of "blogs" are not actual blogs but SEO content farms. There's no regular readers of such things because there's no actual authors, just someone that took a job on Fivver to spew out some SEO garbage. Old content gets reposted almost verbatim because new results better according to Google.
The only reason these "blogs" exist is to show ads and hopefully get someone's e-mail (and implied consent) for a marke....newsletter.
I know at least a few that I commonly see in top search results that I have friends that read them like personalized soap operas where most of the drama revolves around food and family and serving food to family.
It's at least half the business models of Food Network shows: aspirational kitchens and the people that live in them and also sometimes here's their recipes. (The other half being competitions, obviously.) I've got friends that could deliver entire doctoral theses on the Bon Appetit Test Kitchen (and its many YouTube shows and blogs) and the huge soap operatic drama of 2020's events where the entire brand milkshake ducked itself; falling into people's hearts as "feel good" entertainment early in 2020/the pandemic and then exploding very dramatically with revelations and betrayals that Fall.
Which isn't to say that there aren't garbage SEO farms out there in the food blogging space as well, but a lot of the big ones people commonly complain about seeing in google's results do have regular fans/audiences. (ETA: And many of the smaller blogs want to have regular fans/audiences. It's an active influencer/"content creator" space with relatively low barrier to entry that people love. Everyone's family loves food, it's a part of the human condition.)
I've basically never been taken to a recipe without a rambling preamble from Google. While food blogs may serve two audiences, a long introduction seems to be a requirement to appear in the top Google search results.
Personally, I think that has a lot more to do with the fact that Google killed the Recipe Databases. There did used to be a few startups that tried to be Recipe Aggregators with advertising based business models, that would show recipes and then link to source blogs and/or cookbooks, and in the brief period where they existed Google scraped them entirely and showed entire recipes on search results and ate their ad revenue out from under them.
Such databases would get battered by demands to remove content these days, if not already back then. No one want a database listing their stuff for ad revenue like that because many wouldn't follow the links so see their adverts or be subject to their tracking.
A couple of browser add-ons specifically geared around trimming recipe pages down have been taken down due to similar complaints.
That's why I use Saffron [1], it magically converts those sites into a page in my recipe book. I found it when the developer commented here in HN. Also, a lot of cooking website have started to add a link with "jump to recipe" functionality allowing you to skip all the crap.
I've noticed this pattern start to pop up elsewhere. I've started to train my skimming skills, skipping a paragraph or two at a time to get past the fluff.
Like an article about some current event will undoubtedly begin with "when I was traveling ten years ago...".
It's also because that's a way of trying to copyright protect recipes, which are normally not copyright protected.
> “Mere listings of ingredients as in recipes, formulas, compounds, or prescriptions are not subject to copyright protection. However, when a recipe or formula is accompanied by substantial literary expression in the form of an explanation or directions, or when there is a combination of recipes, as in a cookbook, there may be a basis for copyright protection.”
But that copyright protection only extends to the literary expression. The recipe itself is still not covered by copyright, even if accompanied by an essay.
>> This sort of optimization is why simple recipes are typically found at the end of a rambling pointless blog post now.
I continue to be curious about this kind of complaint. If all you want is a recipe list, without any of the fluff, why would you click on a link to a blog, rather than on a link to a recipe aggregator?
Foodie blogs exist specifically for the people who want a foodie discussion and not just an ingredients' list.
Is it because blogs tend to have better recipes overall? In that case, isn't there a bit of entitlement involved in asking that the author self-sacrificingly provides only the information that you want, without taking care of their own needs and wants, also?
I think the complaint is that those blogs rank higher than nuts-and-bolts recipes now. It wasn't that way a few years ago. Yes, scrolling down the results to Food Network or Martha Stewart or whatever is possible, as is going directly to those sites and using their site search, but it's noticeable and annoying.
Not my experience. For a very quick test, I searched DDG for "omelette recipe, "carbonara recipe" and "peking duck recipe" (just to spice it up a bit) and all my top results are aggregators. Even "avgolemeono recipe" (which I'd think is very specialised) is aggregators on top.
To be honest, I don't follow recipes when I cook unless it's a dish I've never had before. At that point what I want is to understand the point of the dish. A list of ingredients and preparation instructions don't tell me what it's supposed to taste and smell like. The foodie blogs at least try to create a certain... feeling of place, I suppose, some kind of impression that guides you when you cook. I wouldn't say it always works but I appreciate the effort.
My real complaint with recipe writers is that they know how to cook one or two dishes well and they crib the rest off each other so even with all the information they provide, you still can't reliably cook a good meal from a recipe unless you've had the dish before. But that's my personal opinion.
It's the same thing that people always complain about. This thing is not in a format that I like, so it must be not what anyone likes.
If you want JUST recipes, pay money instead of just randomly googling around. America's test kitchen has a billion, vetted, and really good recipes. That solves that problem.
If you almost only plant wheat, you are going to end up with one hell of a pest problem.
If you almost only have Windows XP, you are going to have one hell of a virus problem.
If you almost only have SearchRank-style search engines (or just the one), you are going to have one hell of a content spam problem.
Even though they have some pretty dodgy incentives, I don't think google suffers quality problems because they are evil, I think ultimately they suffer because they're so dominant. Whatever they do, the spammers adapt almost instantly.
A diverse ecosystem on the other hand limits the viability of specialization by its very nature. If one actor is attacked, it shrinks and that reduces the opportunity for attacking it.
I don't think the existing media-heavy websites are gaming Google to rank higher. It's that Google itself prefers media heavy content; they don't have to "game" anything.
I also think a search engine like this would be quite hard to game. An ML-based classifier trained on thousands of text-heavy and media-heavy screenshots should be quite robust and I think would be very hard to evade, so the "game" will become more about how identify the crawler so you can serve it a high-ranking page while serving crap to the real users, and it seems fairly easily to defeat if the search engine does a second pass using residential proxies and standard browser user agents to detect this behavior (it could also threaten huge penalties like the entire domain being banned for a month to even deter attempts at this).
With the advances in text generation by machines that looks, but isn't quite accurate (aka GPT-3), seems like it would be easily gamed (given access to GPT-3). Even without GPT-3, if the content being prioritized is mere text, I'm sure that for a pile of money, I could generate something that looks like Wikipedia, in the sense that it's a giant pile of mostly text, but it would make zero sense to a human reader. (Building an SEO farm to boost ranking of not-wikpedia is left as an exercise for the reader.)
If there were a wider variety of popular search engines, with different ranking criteria, would sites begin to move away from gaming the system? Surely it would be too hard to game more than one search engine at a time?
It would be a matter of numbers anyway about which they optimize for. A/B testing is already in place and doesn't care about where it comes from, just which one does better.
There should be some perfect balance where this search engine is N% as popular as Google, where Google soaks up all of the gamifiers, but this search engine is still popular enough to derive revenue and do ML and other search-engine-useful stuff.
So just add human review to the mix, if a site is obviously trying to game the system (listicles, seo spam etc) just drop and ban them from the search index.
Search engines whose revenue is based on advertising will ultimately be tuned to steer you to the ad foodchain. All the incentives are aligned towards and all the metrics ultimately in service of, profit for advertisers. Not in the 99% of people who can convinced to consume something by ads? Welp, screw you.
Search engines should be something you pay for. Surely search engine powerusers can afford to pay for such a service. If Google makes $1 per user per month or something, that's not too high a bar to get over.
Search engines should be like libraries. At least some tiny sliver of the billions we spend on education and research should go to, you know, actually organizing the world's information and making it universally available.
I see another issue here: companies like Google prioritize information to 1) keep their users and 2) maximize their profit.
If you move data organization to another type of organization (non-profit, state, universities - private or public), then the question of data prioritization becomes highly political. What should be exposed? What should not? What to put first? ...
It is already, but to a smaller extend since money-making companies have little interest in data meaning, and high interest in the commercial value of their users.
I think this is just because of terms you have searched. In my test-searches Wikipedia has not come up once in first position (i think the highest was 3rd in the list).
Here's what I've tried with a few variations:
golang generics proposal,
machine learning transformer,
covid hospitalization germany
Or you could just search for 'rome movie'. Though for more complex disambiguation you would need to resort to, e.g. schema.org descriptions (which are supported by most search engines, and the foundation for most "smart" search result snippets).
That's a fair point. This engine would be useful if you need grep over internet (by without regexes), i.e. when you want to find the exact phrases. But that's a relatively narrow use case.
I tend to prefer Wikipedia for movies. The exception is actor headshots if I'm trying to identify someone, which Wikipedia lacks for licensing reasons, but otherwise Wikipedia tends to be better than IMDB for most needs. Wikipedia has an IMDB link on every article anyway.
Another need I guess might be reviews, for which RT or MC are better than IMDB: not sure if either of those two will fare better than IMDB in this search engine but again Wiki has links out (in addition to good reception summaries)
For me, imdb was much better when they had user comments/discussion.
I never even posted on it myself, but browsing the discussions one could learn all sorts of trivia, inside info, speculation, etc about each movie.
Since they (inexplicably) killed that feature, I rarely even visit anymore. Your right, for many purposes wikipedia is better, especially for TV series episode lists with summaries.
IMDB management thought it was their brilliant editorial work that drew people to their site. Morons. It was the comments all along. Of course they also believed they could create gravity-free zones by sheer force of executive will (and maybe still do).
Especially for old and lesser known movies, the discussion board for the movie was a brilliant addition that could give the movie an extra dimension. Context is very important in order to understand, and ulitmately enjoy something.
I think they removed it in part because new movies, like star wars and superhero movies, had alot of negative activity.
I find IMDb to be more convenient than RT/MC/Wikipedia for finding release dates of movies - nearly every other website lists only the American release date, maybe one or two others if the movie was disproportionately popular in certain regions.
I think it's a case where systems diversity can be an advantage. Much like how most malware was historically written for Windows and could be avoided by using Linux, the low-quality search engine bait is created for Google and can be avoided by using a different style of search engine.
Yeah, that's just not a type of query my search engine is particularly good at. It's pretty dumb, and just tries to match as much of the webpage against the query as it can.
This used to be how all search engines worked, but I guess people have been taught by google that they should ask questions now, instead of search for terms.
I wonder how I can guide people to make more suitable queries. Maybe I should just make it look less like google.
I had the exact opposite experience. I searched the site for "java", got a Wikipedia link first (for the island, not the programming language), and the 2nd result was to a random JEP page, and all the rest of the results were random tidbits about Java (e.g. "XZ compression algorithm in Java). Didn't get any high level results pointing to an overview of the language, getting started guides, etc.
If the goal was to remove modern web design, ok sure mission accomplished.
If your goal was to create a search engine that ignored listicles and other fluff and instead got you meatier results like "academic talks" and such, then no.
"Radiophone Transmitter on the U.S.S. George Washington (1920)
In 1906, Reginald Fessenden contracted with General Electric to build the first alternator transmitter. G.E. continued to perfect alternator transmitter design, and at the time of this report, the Navy was operating one of G.E.'s 200 kilowatt alternators
http://earlyradiohistory.us/1919wsh.htm
"
"I Looked and I Listened -- George Washington Hill extract (1954)
Although the events described in this account are undated, they appear to have occurred in late 1928. I Looked and I Listened, Ben Gross, 1954, pages 104-105: Programs such as these called for the expenditure of larger sums than NBC had anticipated. It be http://earlyradiohistory.us/1954ayl2.htm
"
Dramatically worse than Google.
---
Ok, how about a search for "Rome" then? Surely it'll pull some great text results for the city or the ancient empire.
Again, far off the mark and dramatically worse than Google.
I like the idea of Google having lots of search competition, this isn't there yet (and I wouldn't expect it to be). I don't think overhyping its results does it any favors.
What were you expecting to see for British? There must be millions of pages containing that term. Anyway the first screenful from Google is unadulterated crap, advertising mixed with the usual trivia questions.
If you are going top claim something is wide of the mark then you really ought to tell us at least roughly where the mark is.
This is not a Google competitor, it's a different type of search engine with different goals.
> If you are looking for fact, this is almost certainly the wrong tool. If you are looking for serendipity, you're on the right track. When was the last time you just stumbled onto something interesting, by the way?
I checked the results of the same query and they seem fine. Lots of speeches and articles about George Washington the US president. There's even his beer recipe.
As for the results you linked, it's part of the zeitgeist to list other entities sharing the same name. Sure, they could use some subtle changes in ranking, but overall the returned links satisfy my curiosity.
The project explicitly bills itself as a "search engine", not an "interesting and unexpected material surfacer". Moreover, projecting emotions like "angry" onto a comment in order to discredit the content of the comment (hey! is that an ad-hominem?) is just about exactly the opposite of the discussions that the HN mods are trying to curate, and the discussions that I like to see here.
If you click through to the About page, I think you'll see that "interesting and unexpected material surfacer" is a fairly apt description of the project.
I think in fairness that when "interesting and unexpected material surfacer" is merely a euphemism for "we didn't bother indexing the things you might actually be looking for", a degree of scepticism isn't unwarranted.
(Source: I looked up several Irish politicians because I run an all-text website containing every single word that they say in parliament. I got nothing of use, or even of interest, for anything.)
In the early days of google, I found what I was looking for on page 5+. On the way, I’d discover many interesting things I didn’t even know I was looking for, often completely unrelated to what I was searching for.
And now Google hides that more than one page even exists, as they populate their first page with buttons to ask similar questions and go to the first page of THOSE results.
> Hobby project leads angry person to interesting and unexpected material; angry person remains angry.
Not angry in the least. I'm thrilled someone is working on a search competitor to Google.
I understand you're attempting to dismiss my pointing out the bad results by calling me angry though. You're focusing your content on me personally, instead of what I pointed out.
The parent was far overhyping the results in a way that was very misleading (look, it's better than Google!). I tried various searches, they were not great results. The parent was very clearly implying something a lot better than that by what they said. The product isn't close to being at that level at this point, overhyping it to such an absurd degree isn't reasonable or fair to the person that is working on it.
I would specifically suggest people not compare it to Google. Let it be its own thing, at least for a good while. Google (Alphabet) is a trillion dollar company. Don't press the expectations so far and stage it to compete with Google at this point. I wouldn't even reference Google in relation to this search engine, let it be its own thing and find its own mindshare.
> I'm thrilled someone is working on a search competitor to Google.
Except the author goes to quite some lengths to explain that his search engine is not a competitor to Google, and is in fact exactly the opposite of Google in many ways: https://memex.marginalia.nu/projects/edge/about.gmi
For a simple test, I searched "fall of the roman empire". In your search engine, I got wikipedia, followed by academic talks, chapters of books, and long-form blogs. All extremely useful resources.
When I search on google, I get wikipedia, followed by a listicle "8 Reasons Why Rome Fell", then the imdb page for a movie by the same name, and then two Amazon book links, which are totally useless.