Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google rolls out algorithm change in the US (googleblog.blogspot.com)
149 points by po on Feb 25, 2011 | hide | past | favorite | 52 comments


> sites that copy others’ content

It would be interesting to know how Google determines, in an automated perspective, which site is the original and which is the copy - especially when copying could go both ways. For example, I could write an article, license it under the GFDL, and someone could copy it to Wikipedia. I might then copy the Wikipedia improvements back to my site. Technically, I had the content first - so would Wikipedia be penalised?

If there is a bias against smaller sites, this might make smaller sites be reluctant to license their content under licenses that let bigger sites copy them.


This is a huge problem with Wikia. Wikia has a relatively high pagerank. Sometimes, because of Wikia's often-abusive policies, communities will decide to move their wiki off Wikia and host it separately.

But Wikia will refuse to remove the original wiki even if all the contributors want to move it, in order to maximize advertising revenue. This means that it's nearly impossible to get the new wiki to rank highly on Google, even if all links across the internet are changed to point to the new wiki, because Wikia's pagerank is so high that Google deems the new one to be a "copy". This is why Wikia moved all their wikis to subdomains: in order to piggyback on the pagerank of the main site.

As a result there are a ton of long-dead wikis on Wikia that still get more search traffic than the active equivalent. Obviously this hurts users, since they get long-outdated information as a result.

In short, once a wiki is placed on Wikia, it's basically impossible to ever move it anywhere else because of Google's anti-duplicate biasing.

See http://en.wikipedia.org/wiki/Wikia#Controversy for more info.


I assume you're saying that the Wikia admins will prevent the community from deleting or overwriting content with a pointer redirecting to the new place? This does seem extremely likely if there's significant search engine traffic and ad revenue.


Yes. Now imagine if you had a blog on Blogspot and you wanted to host it on your own site instead -- and Blogspot prevented you from deleting your posts because they brought Blogspot good ad revenue?


I noticed this one Before you click, guess which is at the top? http://www.google.com/search?q=share+bookmarks Is the first one you want to see? ha


I don't think that Google needs to determine the originality of any one piece of content, but rather the tendency for a site to feature content verbatim from other sources.

The sites Google seem to be targeting are those who aggregate content wholesale from a number of sources. One way they could identify these would be to examine the number of different sites from which a particular site appears to have copied its content from.


Agree. What I think they should target is a set of sites that just copy content load their pages with Google Ads and mint money. Those are the ones people more likely to not want to see rather than sites like wikia.

In any case this is a bold step for Google. It shows they still care for the quality of the search and have guts to take big bold decisions to protect it.


My 4am thoughts on the matter...

If a site has N copies associated with it, then you could compare that to the average number of copies associated with sites of that pagerank. If the difference between N and the average is high, then that's a likely spam site. Let's call this difference σ.

This gives us a problem though, because while σ is a good indicator of spamminess, it's not foolproof. A low pagerank site could have been copied from lots of times, which would unfairly earn it a high σ.

What we can do is calculate a σ', by inversely weighing each copy with the σ of the site on the other end. A copy of a site with a low sigma will increase σ', and a copy of a site with a high sigma will have less of an effect on σ'.

So while our high σ may have initially suggested the site to be spammy, all its copies are from high σ sites, which are counted less towards σ', leading to an overall verdict of not spammy. [I wonder what would happen if you iterated this process]

As for your example, the smaller site wouldn't be penalized, because N would be low. If it were say, a full wikipedia mirror, then it would be penalized. Wikipedia would not be penalized, since it has an ungodly amount of pagerank. There's also no bias against smaller sites, since σ is calculated relative to pagerank.

What have I missed?


If you're interested, there are plenty of research papers on the topic of determining the canonical source of duplicated content online.


These days they crawl sites often enough to know where the info appeared first.


When Pagerank was created, it worked really well to find the most relevant sites. The old search engines such as AltaVista and Excite were drowning in spam since the SEO experts had figured them out.

Does anybody else feel that Google has been figured out now, and that they are now just applying patches to a system that is fundamentally broken (in the same way Excite and AV did)?

There might be an opportunity here for a new search engine - one that creates a new core method of ranking, as PageRank was, to filter out duplicate content, content farms, malware sites, etc.

"Relevance = links" today looks as tired as "relevance = keyword density" did 10 years ago, but I have no idea what the new "relevance = " is


Does anybody else feel that Google has been figured out now, and that they are now just applying patches to a system that is fundamentally broken (in the same way Excite and AV did)?

I don't. I think they do a much better job of actively improving their offering. No matter what the system is, someone will always try to game it. Even if the people running the system are very smart and try very hard, the smart gamers will (briefly) succeed. That's ok, though - it pushes the people running the system to work harder to improve it. It really is in Google's best interest (IMO) to return great search results, so I think they'll continue to improve what they do.


The original PageRank hasn't been used for a very long time-- the current system is a lot more sophisticated and blends many different ranking algorithms. I think that Google has done a good job of adapting to both the changes in web content and SEO scams. I do occasionally see noise in my search results, but there's always something useful among the top 3 links. This is quite different from the bad old days of Altavista, where you could get pages of outdated or irrelevant garbage.


There is a thing in HCI called the End-User Model Fallacy, where engineers try to create an interface based on a prototypical end-user, and then the real end-user spends time trying to figure out how they think the interface works, and it becomes a vicious cycle where the engineers think they are improving the interface, but in reality are just causing the real end-users to come up with a new model of the interface that they can use to get their work done.

This Fallacy was first described to me in Brenda Laurel's Computers as Theatre. http://c2.com/cgi/wiki?BrendaLaurel


Google is the best company to figuring out this "new" search engine. It's vastly cheaper and easier to continue tweaking their current system than it would be for someone else to code a search engine from scratch. IMHO there exists no silver bullet approach to search, instead it's a on-going quest/war consisting of infinite trials and errors.

Also, I am not sure that the spam problem is as bad as people claim it to be. I have looked at virtually all of the lists of content farms that have been published since Google release their Personal Blocklist for Google Search and it seems to be that (a) it's all very subjective and people rarely agree that a particular site is a content farm (b) and therefore the number of sites that people seem to agree are farms aren't more than a dozen.


The "fundamentally new" system for defining relevance is the social graph. It's why Facebook and Twitter are so hot.


That's partly true, but given the way that the social graph is used today, I don't think it can satisfy the entire need.

The social graph is essentially a personal thing, not professional. Assuming that we see a need to partition our professional and personal lives (and that should be true, in order to protect both sides), then the professional side isn't going to be reflected well in the personal graph.

So long as our networking tools don't recognize this split, they are going to be (relatively) starved for content that is strongly oriented towards the professional world. For example, I wouldn't expect my wife to be able to find deep information about Medicare reimbursement if we relied on mining social sites.


For a lot of professional needs, the most valuable information is hidden in the deep web. For example, many MIS/DSS/ERP/CRM/TLA systems don't provide a lot of public information about how their systems work, so if you are trying to evaluate the market place, it would probably take you 3-6 months of research.

For example, I work at a healthcare services start-up, and we have one person full-time researching our competitors and figuring out how to differentiate us in the marketplace, as well as figure out what keywords someone might search for to find this kind of product. We're mainly mathematicians and smart programmers who found a way to break into the market initially, not marketers, so our terminology doesn't match domain experts well.

Anecdotally, I think most users are VERY bad at ranking how valuable information is. For example, we'll get paid $100,000/year for something that takes 1 week and provides zero insight into their business, and then they will refuse to pay us at all for something that makes them $30,000,000 a year! This is so common that I am wondering if its more a symptom of human nature than our customers.


I am really curious what makes them $30,000,000 a year? Are they getting it for free, or are they loosing out on the income?


Fixing their billing procedures and optimizing their price list for services. This is pretty tricky stuff and easy to get wrong, because there are many factors that influence how hospitals get paid, such as stoploss rules, carve outs, worker's compensation groupers, various mother-baby rules, etc.

Often times hospitals receive $0 for something they should be getting $10 million per year for, just because nobody ever contests the NOPAY response. Finding that sort of error isn't easy.


Brilliant use of crowd-sourcing to gather a test set via the chrome plug-in. Makes absolute sense, otherwise you can overfit. Great turn-around time also.


From the article:

> It’s worth noting that this update does not rely on the feedback we’ve received from the Personal Blocklist Chrome extension


The implementation of the algorithm didn't rely on it, but they did test it using that data.

>However, we did compare the Blocklist data we gathered with the sites identified by our algorithm, and we were very pleased that the preferences our users expressed by using the extension are well represented. If you take the top several dozen or so most-blocked domains from the Chrome extension, then this algorithmic change addresses 84% of them, which is strong independent confirmation of the user benefits.


edited: to respond to abraham's post directly

There's a difference between training with that data and testing against it, which is precisely what I wanted to highlight here since many expected Google to use such data for training. You can't do that since you can overbias/over-fit given what little daat we, plug-in users, are able to produce versus the amounts they have access to.

2nd edit: it should also be noted that that 84% figure they mentioned related to the extension data is a measure of recall. Assuming Google wants to do a better job with precision in these cases (don't want to get many false positives), that recall is still fairly good in mind since I'm sure there's some noise in that data.


I know this is not a popular opinion here, but I'm not sure if I like Google reacting to what tech bloggers echo chamber talks about. Are we sure that this what is good for regular users and for the creation of quality content on the Internet? This group is not representative at all.

There seems to be enough "normal" spam in the search results that Google should be focusing on first. Just yesterday I searched for "viking dishwasher clog" and the #6 result is a .info page that is nothing but obvious spam.

I previously mentioned the issue of Wikipedia poorly regurgitating content from original creators (sometimes with attributing, sometimes without) and outranking that original page, even if it contained approachable, illustrated and in-depth article, and Wikipedia's content was much weaker. This is not something that tech bloggers will notice - they are much more likely to read Wikipedia's programming, science, math, or tech articles, which are of much higher quality than those on other topics, so you won't hear them complaining when they see Wikipedia as #1 everywhere.

Another bias I starting seeing lately is the high ranking of new Q&A sites (another focus of tech bloggers), when excellent topical forums with very good content, often exactly with the answers I needed, are ranking poorly. I’m speculating here, but I think Google’s focus on links might hurt its ability to bring up deeply hidden content from forums. These forums won’t get many links from tech bloggers and others looking for the next big thing, but they have very valuable content for many long-tail searches.


Is the change live yet?

I remember an example provided here was "nstoolbar bottom bar". Let's take a look: http://www.google.com/search?q=nstoolbar+bottom+bar

Granted, it was probably linked from HN somwhere which will have made matters worse.


Yes, the change is live (rolling out slowly through the day). A ton of large sites have already been affected. It's a pretty big change.


Not to nitpick, but...

    If you take the top several dozen or so most-blocked domains from the Chrome
    extension, then this algorithmic change addresses 84% of them, which is
    strong independent confirmation of the user benefits.
Well, what if they are mistakenly penalizing Common Jack's blog because it gets re-posted across the interwebs? I bet no one is manually blocking Common Jack's blog, so that factor is lost.

Example: let's say Google just randomly penalizes 84% of ALL domains. Chances are that they intersect with, you guessed it, 84% of the Chrome extension's data. Independent confirmation!


That is not what that said at all. It said that of the 36 or so domains that were the MOST blocked, 84% of those sites were lowered in the search results. Which means 4 or so crappy domains escaped being penalized. With that high of a "win" they might affect 1 or 2 domains on the entire Internet that should not be affected but I am sure it will eventually be corrected for those domains. This also means that many more than 36 domains will get punished for poor tactics.


The problem for Google is that any non-algorithmic approach (using a feedback mechanism) will just be abused by the SEO industry - they'll just crank up mechanical turk to start reporting anyone ranking above their clients.


I'm not advocating that, I'm just saying that the "independent confirmation" tells us little about the innocent sites that get negatively affected. They are, by definition, excluded.


And this somehow affects only 12% of searches?


True. It tests for completeness, not purity.


I really hope it will change for the better, but I don't trust Google anymore:

Small example: When you search "VLC" on google on find scams websites shipping VLC with a toolbar/crapware/malware, buying or not adwords...

Google plainly refuses to remove those websites, because they gives Google some money. So they make money from scammers, know it clearly and do not want to do anything...

And as we are a very small team of volunteers, it is impossible to attack those websites...

NB: this happens with other software too.


Google plainly refuses to remove those websites, because they gives Google some money.

You may be right about what's happening, but I don't think it's fair for you to assert as fact what you believe their motivations to be. Matt Cutts and others have categorically denied these claims, and provided good reasons.


Then, I am all ears to why Google refuses to take down, when asked, scam pages of open source products with toolbars and other adware/malware.


Strangely enough, your primary example always directed me to videolan in the past, and continues to do so.

I very much prefer DDG these days, but those statements are just not true. And I do not believe that Google would purposefully provide _much_ _worse_ results for US customers, relative to EU ones.


The english ones are the only fine, because the websites are in english...

Try .fr, .it, .es and you'll see...

And on .us, just deactivate your adblock.


Same for me, the entire page is videolan.org plus wikipedia and one cnet page. And I have no ad blocker.


Google works hard (or maybe less) to keep their front page UI ultra clean. I do think users could benefit a lot from a small drop down arrow under the search field that stretches open another field to input the words searchers do not want included in their results. Most of us in the Hacker News community know this can more easily accomplished by including a "-" (minus sign) before such terms; we know how to use "-" and "site:..." to narrow query results. However, I'd say 50% of Google users don't know these tricks and waste time navigating to the advanced search link.


There is an advanced search link next to the Google search bar that provides the functionality you described for non-power users.


I mentioned that in the last line. I just feel it's not efficient for those who want to leave out certain terms... being that "-" is the most common query alteration (or maybe second after "+".


For me, it's too little too late. I know a lot of people who have already moved to different search engines. Google lost a ton of money in the process. They didn't help themselves any when it was revealed JC Penny had successfully gamed Google for almost an entire year - and Google had no idea until the NY Times broke the story.


You may well know a lot, but I bet 100% of them are serious tech interest people which accounts for a tiny tiny % of the internet. You could argue that tech people drive the trends that spread out to the masses but in this case I think it's just bashing the big guy cos he's an easy target.

The idea that most people really care about the purity of their search results, let alone would understand what all this means even if the BBC or NYT ran a proper article is just unrealistic.


What other search engines have your friends moved to?


While I don't agree with the opinion expressed or the tone it is expressed in, I do think the JC Penny story sound interesting.

Personally not having heard of it before, I think this comment (accidentally?) contributes somewhat to the discussion. Has HN really become this trigger-happy on the downvotes?


Is there any reason to change the algorithm, initially, only in the US?


It looks like I have got higher up the ranks for some of my keywords, so the change must be good ;)


How will this impact sites like Rotten Tomatoes?

They have little original content, but are still a useful source.


It'll be as interesting to see how this affects Bing rankings.


It should affect them automatically. We know Bing uses Googles search -> page results coming from the Bing toolbar. If a site's ranking is reduced by Google, the same should happen (maybe to a much smaller extent) in Bing since that feature is now smaller.


It doesn't have to and I believe you are mistaken/oversimplifying.

As far as I understood the whole Bing ordeal was that users with the Bing toolbar reported not the links and ranking shown, but what the users chose as the "correct" hit for that search.

In that regard this doesn't really have to alter Bing's results in any way at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: