Reconsideration Requests
Show Video

Google+ Hangouts - Office Hours - 06 October 2015

Direct link to this YouTube Video »

Key Questions Below

All questions have Show Video links that will fast forward to the appropriate place in the video.
Transcript Of The Office Hours Hangout
Click on any line of text to go to that point in the video

JOHN MUELLER: All right. Welcome everyone to today's Google Webmaster Central Office Hours Hangouts. My name is John Mueller. I am a Webmaster Trends Analyst here at Google Switzerland, and part of what I do is talk with webmasters and publishers, like the people here in the Hangout. Today I picked as a topic something that comes up all the time, namely duplicate content. So I prepared a very brief presentation on what we consider duplicate content, where it might cause problems, what you can do to kind of resolve that. And with that, we can take a look at the questions that were submitted so that we can go through the open questions around that, too. All right. Let me try to find the right window, and I can share that with you. All right. So duplicate content. This is something that affects pretty much all websites. Pretty much all websites have some kind of so called issue around duplicate content in the sense that they pretty much all have something where you can have multiple URLs that lead to the same content. But let's take a look at some of the specifics. So first off, to get started, if you have questions around duplicate content, I highly recommend checking out our Help Center, which is under So you can search for duplicate content there, and you can search for duplicate content on our blog. We have a number of blog posts around duplicate content, because like I mentioned, it's something that does come up again and again. So talking about duplicate content, what do we call duplicate content? What do we see as duplicate content? Essentially, if a page or if something on a site is duplicated somewhere else. And a lot of times, the normal websites have duplicate content around www, non-www, or HTTP, HTTPS. Or maybe they have /index.html for the content. Or perhaps they have tagged URLs for analytics, or for conversion tracking or whatever. A lot of sites also use separate, mobile-friendly URLs or they have printer-friendly URLs, or they host their content on different CDNs to reach the audience a lot faster. All of these things are essentially duplicate content. That it's the same content and there are multiple URLs, but it's not something that I'd say is really problematic. Tag pages are often a source of duplicate content as well, where under multiple tags you'll have the same blog post. Press releases, if you give content out, if you syndicate content, all of these lead to duplicate content as well. And pretty much all websites have these kind of things. So it's not something that you can say, my website doesn't have any duplicate content. Or where you'd be able to go to someone and say, well your website has duplicate content, therefore you're doing something wrong. This is essentially a part of the way that the web works, and we have to be able to deal with that in various ways without causing any problems for the webs. Because we have to live with what we find on the web. Following that, things that aren't duplicate content, translations. So if you have content in German and English, even if it's the same content but translated, that's not duplicate content. It's different words. Different words on the page. These are translations, so it's not naturally a duplicate. Different pages that just have the same title or description meta tag, that's not really something that we really worry about. If you have content on your website, and in an app, or mobile-friendly pages, that's not something you'd really need to worry about there. And localized content is also generally not seen as duplicate content in the sense that, maybe it makes sense to have the same content available for multiple countries. We would try to recognize that it's the same text, but we wouldn't say, well, this is an exact duplicate. Because actually there's information about the users that you're targeting, which is somewhat different. And the reason I add sometimes here, is that sometimes we do think we can just fold them together and treat them as one anyway. So, just a very brief run through of some of the technical things around duplicate content, where that comes up, how that happens. This is a somewhat simplified view of web search. If you follow these steps, you can set up your own search engine. It's really quite easy. Essentially, we start off with a bunch of URLs that we know about. We schedule them for crawling. We can't crawl the whole web all the time. Then Googlebot goes off and crawls the internet, brings that to the parser. We use that for indexing, store it in our index, and then make it available for search. So, couple steps here. And duplicate content plays into multiple places here, which kind of makes sense once we see that. Because there are different ways that we can handle duplicate content, that we can recognize duplicate content. And at the moment we pick that up on the one hand in the scheduler. So before we actually crawl anything from the website or crawl those specific URLs. Then also for indexing, when we actually have the content from the website. We can of course recognize it's duplicate. And then, in the search results we can recognize that it's duplicate as well. So during crawling, or before crawling rather, the problem here is if we recognize that something is a duplicate, and we can save us from crawling that, we end up wasting fewer server resources. We end up wasting fewer of the crawl cycles that we allocate for your website. So we can use the URL parameter handling tool to recognize that these URLs are actually kind of the same. We recommend not using robots.txt, I'll get into that in a bit as well. And we have some smart systems that try to recognize the situation as well. Where we can recognizable well, these URL parameters or this kind of a path structure is probably a sign that it's duplicate, or a sign that we already have that content and we don't need to crawl it again. And what comes up again and again in this presentation is this is not a penalty. There is no duplicate content penalty for things that are duplicated like this. Duplicate content during indexing, so this is after we've taken the steps to crawl the content from the website. Keeping the duplicates there is a waste of our storage and the resources. We don't really need to keep a copy of exactly the same thing we've already seen before. So we fold that together and just keep one. And the tricky part here, as I mentioned in the beginning, is localization. What happens if you have exactly the same content, but you have one version for the UK and one version for Australia or for the US, or something like that? And in those cases, we do sometimes have to take into account signals that we have from other places and say, well, it's the same content, but actually it's for different countries. It's for a completely different audience, perhaps. So it's worthwhile to keep two copies of that content. And finally, we see duplication in the search results. So it might be that the whole page itself is kind of unique. But a specific block of text on the page itself is duplicated across multiple pages. That could be, for example, a description of a product. That could be that your article or your press releases is on one site, but actually also on other site. And in cases like that, what we'll try to do is filter out all but one of these copies, and show that one version in the search results. And then, at the end of the search results we'll have this link that says that we've omitted some entries that are really similar. If you want to see them anyway, you can click here. And again, this is not a penalty. It's just that we're trying to fold things together and pick one of those URLs from your website and show that. So we kind of heard there's no penalty around duplicate content. But, there still are some problems that could be resulting from duplicate content. So it's still recommended to clean these things up. On the one hand, a necessary crawling is a pretty big problem sometimes. Sometimes we'll see that we crawl 100 copies of the same URL from a website. And we try to index it like that. And that's a lot of unnecessary work that we do for a website that can cause a lot of load for the website. And in turn, can also result in us not being able to pick up new content quite as quickly. So that's one aspect that could play a role there. Another aspect is that it's really a lot harder to track metrics. Especially if you're not sure which URL is actually showing in the search results. So maybe you have www or non-www, and you have metrics setup separately on your server. Or did you use Search Console to track metrics, and some of the visitors are hitting the www version and some visitors are visiting the non-www version. And you have these two sites track the separate sites. So it makes it a lot harder to keep track of what actually is happening there. And sometimes you have a strong preference, and you say, well, this is the URL that I really like. This is the URL I use on all my advertising. Why are you showing the other version, Google? And if you do have preferences like that, by all means let us know. So getting back to the penalties, because this is something that people always worry about. We do have some things around duplicate content that we'd say are spam or penalty-worthy in the sense that the webspam team might take manual action on this, or our algorithms might take action. On the one hand, these are things like scraper sites that just copy content from various other sites. Content spinning and aggregation, automatic translation, or even manual rewriting of content, that can be pretty spammy. And of course, doorway pages, doorway sites gets into that same area as well. Which you take essentially the same content, you tweak it slightly, and you make it available in thousands and thousands of variations. And the thing about these things is, it's not that we're penalizing them because they're duplicate content. But we're penalizing them because it's just spam. It's just a bad tactic, bad thing to do. And it causes problems for our search results, so sometimes we do have to take manual action on it. Ways to recognize duplicate content is kind of useful as well. Because as a webmaster you might not even realize that you have duplicate content. I guess, in many cases if you don't realize that you have duplicate content, then probably it's not causing a big problem anyway. But it's always nice to clean these things up, I think. Sometimes we show the other URL in the search results. So if you search for something that you think your site should be showing up for, you see a URL that's not quite what you actually wanted, maybe that's a sign that we're picking up duplicates and picking the wrong URL. In Search Console you have the HTML suggestions feature that gives you information about duplicate titles and descriptions. Another aspect where you might see this happening is if we're crawling a lot more than your site actually has content. So sometimes we'll see things like, we're crawling 100,000 pages every day, but your website has just maybe under 1,000 pages actually. So this discrepancy between what we crawl and what we actually index, or what your site actually provides. If you recognize that kind of a discrepancy, then potentially there's something around duplicate content that could be causing a problem there. One question that comes up a lot as well is, what about affiliate content or content that you syndicate on purpose? So you bring it out a press release, or you provide it to partners, or you have to reuse content because you're working together with a provider that delegates or designs exactly the content that you have to provide. Our recommendation there is to make sure these pages stand on their own, so that they do provide some kind of unique value add by themselves. So if you sell products as an affiliate, and you have to use exactly the same description as other sites as well, then make sure that your shop provides some kind of additional value that's passed what everyone else is doing. If you can't provide additional value, and you really can't resolve the duplicate content issues, make it make sense to noindex some of that content. And finally, there are always some kinds of duplicate content that you can't really avoid completely. And generally, we just take care of these for you. So, as I mentioned before, we sort them out during indexing, or during the search results, during the serving of the search results. So we try to clean that up for you as well. Things you shouldn't do for duplicate content, before I get back to what you can do. We recommend not using robots.txt to block duplicate content. The reason behind that, which is perhaps hard to recognize the first, is that with a robots.txt block we don't know what's actually behind this URL. So we can't tell that it's a duplicate piece of content. We can't fold it together with other copies of this content, because we just don't see what's there. So if we can't see anything that's actually there, we won't know that we can fold the signals together. That we can track all the links as one URL. We don't really know what to do with that. So we'll probably go off and try to index this URL without actually being able to crawl the content. Which is even worse, I think, than having some kind of duplication in the search results. So I'd recommend using robots.txt in situations where you really have a severe server resource issue. In that crawling your server does create a problem for your other users of your site, and you really need to limit the crawling because it causes technical problems. But I wouldn't use it for just generic duplicate content that you might have on your website. We also don't recommend just artificially rewriting content. So if you can recognize that it's duplicate content, and you just tweak some of the letters around to make sure that it looks unique, that's probably not really the best use of your time. It quickly gets into spammy, rewriting content situations. And the thing there is also that if we recognize that it's duplicate content, and we fold those pages together, we'll make a much stronger page out that than if you dilute your content, dilute your signals across multiple URLs where we can't recognize that they're actually duplicates. We also don't recommend using the URL Removal Tool. So in Search Console there's a way to request urgent removal of individual URLs or parts of your website. And the problem here is that these tools just temporarily hide the URLs in the search results. They don't really change how they're indexed. So essentially, you're taking this duplicate content problem and you're just hiding the version that we'd show in the search results, and nothing shows up in search. Which is probably not what we're trying to achieve there. So, if you have a duplicate content problem, don't use the URL removal tools. Instead, we have a bunch of other suggestions here. So the first one that I think is really important is being consistent. Make sure that you have one URL per piece of content, and you use that exact URL everywhere. So within the site map, you use the same URL, for the canonical tag you use the same URL, hreflang, annotations, internal links, everything where you refer to that piece of content, use exactly the same URL. Because the more we can recognize that you really, really want this specific URL to be used, then the more we'll be able to follow your lead and actually do that. As much as possible, avoid unnecessary URL variations. It kind of goes into the previous one as well. If your CMS automatically generates slash node or slash index HTML or those kind of variations, then that's something that maybe they'll get linked to. Maybe we'll try to index them like that. If you can avoid that by using the right settings in your CMS, then that's something worth doing there. To clean them up after you've recognized that you have a duplicate content issues, using 301 redirects really helps us to pick up on those changes. Using the canonical tag helps us. Using hreflang helps us. This is kind of, again the same as the first point, making sure that all the signals that you provide us point at exactly the URL that you want us to use. Search Console has other settings that you can use on a broader level. You can set a preferred domain between www, non-www. You can used the URL parameter handling tool if you have complicated URL parameters in your URL. You say, well, this parameter is important, but this other one is less important. You can kind of fold that together. That's information you can give us there. And finally, using geotargeting and hreflang where relevant helps us to understand that these pages might be very similar, or even the same, but actually they're for different purposes. So that also helps us to understand a little bit better what you try to do there. All right. So, with that I'm kind of at the end of the presentation here. Again, these are the recommended searches to kind of look for. Duplicate content on site or site googlewebmasterc They'll lead to various help center articles and blog posts that we have out there that help you to tackle the duplicate content problem, to explain what might be happening there, and to figure out what you can do to improve things. All right. With that, let me click the stop button here. Do you guys have any questions about these slides for the moment?

MIHAI APERGHIS: John, some CMS's, especially e-commerce CMS's, have a bit of a issue in this department, especially when it comes to product URLs. When products are listed in multiple categories, some CMSs just create a different URL based on the category you accessed the product from. And usually they try to fix this by using a rel=canonical to a URL where no categories are present. However-- and I know we talked about this before, but since we're on the subject-- the URL without any categories isn't really accessible by a user-- by a normal user browsing the website from anywhere. So is this a good idea? Or should-- a better option would be just not to use any categories in the URL at all for anything? And just 301 redirect them to the no-- to the symbol you are--

JOHN MUELLER: Yeah. So ideally we really have all the signals come together and point at one version of the URL that we can kind of pick up. That means, the internal linking, the site maps, all that kind of comes together and falls into a single URL. Obviously in a case like this, with a e-commerce website where you have multiple categories for the same product, that's not easily possible. Because you want to go to that category page to that product, and then from there navigate back to the category page again without getting lost. So in a situation like that, using a rel=canonical is an alternative. That definitely makes sense. I guess it's just worth keeping in mind that depending on what other signals we pick up from that website, we might not always be able to follow that rel=canonical because maybe we'll see, well, this is the category that everyone uses, or people link to externally. And we'll just show that page instead of the generic page.

MIHAI APERGHIS: But will you still fold all of the signals in that URL?


MIHAI APERGHIS: OK. So it's just an issue of which URL you're going to show in the results.

JOHN MUELLER: Exactly. It's mostly just the URL that we show. It's not that your site will rank differently in cases like that.

MIHAI APERGHIS: OK. Yeah, good to know.

JOHN MUELLER: All right. More questions about this before I move on to try to go through the submitted questions?

ROBB YOUNG: I'll ask a question.

JOHN MUELLER: All right.

ROBB YOUNG: It's not going to be about my stuff. Are you more concerned with duplicate content? Or who created it first? Or both equally? Because I know you've said in the past, if a really good author posts something on his own blog, but it's a tiny blog with no followers, but that's his unique content. But then if the "Huffington Post" shares it, they might get 1,000 people commenting and joining in on the discussions. In that instance, you might get more benefit out of the repeat, duplicate article than the original. So how do you guys look at that?

JOHN MUELLER: So in a case like that, what we would do is we would obviously index both of these version. Because they're unique pages. They have unique value on their own. And we try to figure out, depending on what people are searching for, which of these versions is the best one to show in the search results. And it might even happen that we show both of them in the search results in the same time. So, that's kind of a trickier situation in the sense that we'll try to figure out, based on the query, based on what the user is actually looking for, which one of the versions is the most interesting one to show. And that could be, for example, maybe take an extreme example, an author publishes something in Italian. The "Huffington Post" in the US picks that up, and posts the Italian article with an English commentary. If someone is searching in English for that Italian text, then we might show that English article about that Italian text. So, we kind of try to pick up the additional information there. And similar things, for example, you will have around e-commerce, where maybe you have the same description as everyone else, but if your e-commerce site is in that country where the user is searching for, then we might be able to say, well, this is closer to the user. This kind of matches their intent a little bit better, therefore we'll show the local version, even though it has exactly the same description as, maybe a global version as well. So that's where we kind of try to pick up the subtle signals as well and bring that into the search results.

MALE SPEAKER: Hey John, do you measure any percentage of duplication of overall website as a negative factor?

JOHN MUELLER: Any percentage of the what? I'm sorry.

MALE SPEAKER: Of the duplicate content.

JOHN MUELLER: Of the duplicate content. I don't think so. I mean, we try to recognize situations where websites pretty much only consists of duplicate content. Which is a strong sign that it's probably just a scraper site. But if it's a normal website, and you have duplicate content, because that's how your CMS creates this content, then that's not something that we would say, well 20% duplicate content is better than 40% duplicate content. Because these are more technical issues, and they don't necessarily mean that the quality of the content is worse.

MALE SPEAKER: I'm coming from e-commerce background, where most of the products are pulled by the sellers in the marketplace. They always believe that having rich content copied from like, Amazon and different websites always give them more commission rate. So we can catch them if you have limited number of products? But like in millions, how we will reach that if it is like a product detail of e-commerce?

JOHN MUELLER: So this is something where I suggest that you provide some kind of unique value of your own within maybe this kind of given content that you have there. And that could be if you're an e-commerce platform maybe you have something that you can offer that other platforms don't have to offer. Or maybe you have other things that are unique in that maybe you're targeting a specific country. Or maybe you kind of handle all of the payments for the sellers and the buyers. Those kind of things kind of add up. So that when we look at this website overall, we see, well, there's a lot of duplication here. But actually these pages have merit on their own. And they're valuable for us to index like that and to show in the search results like that.

MALE SPEAKER: OK. And again, especially in clothing category, and a product category, we use size card as like a common boilerplate content, I would say here. Like across the website for millions of products. So I have like the full size card is duplicated. And below the product content is also duplicate content. And has I would say 1% only the use we have of those categories. So we can add all the information, but how Google brings that as a category?

JOHN MUELLER: I mean we try to figure out which parts are unique and which parts we need to kind of value separately. So that's really hard to say in general. It'd probably be worth taking a look at the website to see what exactly is happening there. So do you have reviews? Are people interacting with this content? Is this something where you provide something overall? That's a little bit more than just the product itself that's also duplicated.

MALE SPEAKER: Yeah, one last question. So content having above the fold and below the fold. How do you get the value, if it is a duplicate content above the fold? And what is in the content below the fold?

JOHN MUELLER: We don't look at that so much. So we try to look at it on a per block basis, kind of. Where if someone is searching for something within a text block that's duplicated, then we'll kind of put all those pages into the same bucket. And see which one is the best one to show them. And it doesn't really matter if that text block is above the fold, or below the fold, or somewhere in the middle. We'll try to figure out which of these versions of this content is the best one to show to your user.

MALE SPEAKER: OK, cool. Thanks.

JOHN MUELLER: All right. I count some of the questions out. They're kind of in this similar order to what we have here. So maybe I'll just run through them the way that they're submitted. On an e-commerce site is it OK to have a canonical tag on every page where necessary pointing to itself? Therefore enforcing the rule of the preferred page. We've done this to help tidy up duplicate content issues. Sure. That works. You can set the canonical tag to itself. That doesn't cause any problems for us. That kind of confirms to us that this is really the URL that you do want to have your text. We have a CDN, but have found Google has indexed some pages, which are prefixed directly. So kind of through the CDN. This caused duplicate pages. How can we fix this? And will it affect the rankings until the pages dropout? So as I mentioned, kind of before in the flow chart, we do try to recognize this early during crawling. But sometimes we recognize it during indexing as well. Or sometimes we can kind of filter these out during the serving of the search results. So that's something where off-hand, I wouldn't say that it's necessarily a problem, unless you really want to have your specific URL shown in search. And if you do have a strong preference for the URL that you want to have shown in search, then just use one of the techniques that we talked about before. Or as many as possible that you can combine there. So 301 to your preferred page, rel=canonical, all of those things add up for us to make sure that we picked the right version. We've set our internal website search pages to be noindex, follow. Is that correct? That's perfectly fine use case to use the noindex and follow. So what will happen there is we won't index those search pages. But we'll follow it to the individual article pages. Does Google's algorithm take this into account as soon as it spiders those pages? Or when they drop from search? This is generally something that just happens regularly as we recrawl and reindex those pages. It's not something that happens all at once where we say, oh we recognize that this whole block of pages has noindex on it. Therefore we'll drop it from one day to the next. It kind of happens gradually as we recrawl things. Our e-commerce site shows the same product in different sizes. In our case scaffolding and platforms, but with different size options. Therefore our landing pages are showing what might be classified as duplicate content. What should we do? So this is similar to the other situations. You can fold those together, if you want with a rel=canonical. Where you say, well, this is my preferred version of this page. I do want to have index. It might also be that you provide significant value of it's own for these individual versions to kind of merit them being indexed separately. So maybe you have different sizes for this scaffolding, but you can roughly separate from large to small. And you have one landing page for large. And one landing page for small. And on there you list the various options. So it kind of depends on how you want to fold these together. How you want to see that this is actually the same content or not. One thing I would avoid doing there is using the rel=canonical or other duplicate content kind of mechanisms for pages that aren't really duplicate. So for example, if you have your scaffolding in one size on one URL. And you have it in a different size on a different URL. Then by using rel=canonical across those versions will mean that we'll just have one size in our index. We won't know about the other size. So in those cases, the content really isn't identical. And by folding it together we kind of lose those variations. So if you can fold it to one version that has the different sizes on it, that's a great use. Or if you could say, well the different sizes are really, really important because people are really looking for this one specific size. Then maybe it makes sense to keep those sizes separately, actually. On an e-commerce site, when using pagination using rel=next and previous and a canonical tag, is there any benefit to using different title tags on the pages when using a list or on the viewall version of the page? So I try to stick to the same titles and same headings on paginated sets. Because that also helps us to confirm that this is actually paginated set. Whereas if you swap out the titles completely or the headings completely, then we'll look at and say, well, it says pagination. But actually these pages are very different. Or they could be very different based on the title alone. So make sure that the titles kind of match the paginated set that you're looking at. On an e-commerce site, is it a bad idea to show an excerpt product description, but then reveal the full description with the except included? It would be duplicate content, but the excerpt is used to drive viewer to the product. So it seems like something that you can do. I'm not quite sure how it's meant with revealing the full content. If you're using something like JavaScript or CSS to hide a part of the page and to make it visible afterwards. It might that we'll treat that hidden content as, well, hidden content. And then we'll say, well, this content is actually kind of hidden. So maybe it's not the most important content on the page. So if there's something within your product description that you say is really critical to be indexed and to be treated with a high value in search, then make sure that it's visible from the start. How to solve problems with duplicate pages for faster algorithm reaction? Should we set those duplicate pages to 404 or 500 or redirect them? So if you have duplicate pages, I'd recommend 301 redirecting them using rel=canonical. All of those kind of individual signals to let us know that you have one preferred version that you want to use. Using a 404 or 500 might result in those pages dropping out of search faster. But then we would also drop all of the signals associated with that. So for example, if you have a page that has a bunch of links and you use a 404 to kind of solve the duplicate content problem, then those links are suddenly not attached to anything anymore. And we kind of drop them. Whereas if you 301 redirect those products together to a single page, then we can forward those links as well. And kind of as I mentioned before this is mostly an issue of us showing the URL in the search results. It's not so much an issue of ranking. So just because you have some duplication doesn't necessarily mean that those pages will rank worse in the search results. We bought a website with recipes and all was fine. Beginning in 2015 we lost 50 percent of the traffic. Now we only have this little traffic. What can we do? I'd recommend posting to the Webmaster Help Forum for that. It's not really related to duplicate content. So I don't really have anything specific that I can share there at the moment. But I'd post in the Help Forum. And make sure you mention your site. Maybe some of the queries that were leading to your site before. And some of the queries that you're seeing now to let people see what's happening there. And usually they'll have some tips for you on what you can do to move forward. Which solution paginated product list is best for website visibility? Noindex on subpages two, three, four? Leave them all indexed and use rel=next and rel=previous? So mixing rel=next and rel=previous with noindex doesn't make sense. Because then it's not an indexed pair of pages. So that's one thing to keep in mind there. With regards to the best solution, it really depends on your website and what you have available there. So we have a lot of options for pagination and handling sets like that. I'd recommend checking out the blog post and the Help Center article that we have on batch for some tips on what could be done. And looking at your website and thinking about what actually does make sense or what doesn't make sense. For example, if you're selling thousands of different pairs of shoes. Then probably a view-all page is not going to be very useful on a website like that. We have a website that has many pages. One for each baseball player. Each page displays statistics and mostly the same text. The content would only differ on the statistics shown. How is this treated? Well we treat it as unique content. It's not that we just say, look at the text and see if the same words are on the page. We do try to understand, is this page unique? Or is this page not unique? And if the statistics are different, then we treat that as unique content there. What would probably trickier is if you're searching for something like batting average. And we know that all of the pages on your website have the word batting average on it. And it's all shown in exactly the same way. Then it'll be hard for us to figure out, well, which one of these pages is the most relevant for the words, batting average? But at the same time, for a user going to a website that has statistics about all players, then searching for the word, batting average, they probably don't have that much of an expectation with regards to which specific player they're actually looking for. If they're looking for a specific number, then maybe. But just the words batting average won't necessarily bring something really interesting to the user.

MIHAI APERGHIS: John, would this an example of like structure that you could use from a table, like you do with Wikipedia and other websites?

JOHN MUELLER: Sure. The easier we can recognize this information, the more likely we'll be able to guide users there. When we can recognize, well, this is talking about this attribute batting average. And it refers to the entity of this player, or this sport, or this country, or team, or whatever. And that can help us to pull that information together. Show it in the sidebar, perhaps. So that we can guide really well targeted users to those pages directly.

MIHAI APERGHIS: Also you mentioned about this last time that you generally recommend using table markup for Google to pick up this page. With the new conditional lists with defined term, would this work instead of the table?

JOHN MUELLER: Sure. That works too.

MIHAI APERGHIS: OK. It would be useful to have a support page describing this.

JOHN MUELLER: I think we have the definition lists on there too. But I can double check. I don't actually know very completely.


JOHN MUELLER: How we can recognize that our website is low quality or has a duplicate content problem? What should we look for? I kind of went into this already with regards to the search consult HTML suggestions. And seeing if we're crawling too much. And maybe seeing that we're showing the wrong URL in the search results. With regards to lower quality it's usually something that you, as an expert on that topic, will be able to recognize easier. Obviously there's a lot of room with regards to lower quality content, with regards to is this kind of OK or kind of bad. And usually what I recognize, or what I see from my classes in the forums is that they'll come to me and say, well, John, look all these other sites are just as bad as mine. Why are you showing them and not showing me? And from our point of view, that's a really bad argument because we can't take that to the engineers. We can't take that to the search quality team and say, well, we should obviously be showing this site too, because it's just as bad as all the other weird stuff we have in the search results. That's not really a strong argument to bring up. So in a case like that, where you see, well, everyone is just as bad as me. Then, take the effort to make sure that your site is a really significant step above everything else. So that we can recognize that by not showing your site in this set of search results, we're doing the user a big disfavor by kind of not providing something that's really relevant for them. Which pages are ignored by the algorithm? Those with noindex, robots tags, blocked by robot's text, or with headed 404? Should we have internal do follow links to those types of pages? So you have internal links to any kind of page on your website. That's really up to you. That's not something that I'd really worry about, if it's within your own website. With regards to what we ignore, if we don't index it, then we don't worry about it with regards to duplicate content. But at the same time, even if we were able to index it. And recognize that it is duplicate content, it's not really going to cause any problems. Because we'll just filter it out in one of the steps along the way. So if you can clean it up by using the techniques we talked about before, that's probably the best approach there. Compared to forcing those pages out of the index. I just make sure that we can fold them together properly. Is it true that the duplicate content penalty doesn't exist? The only thing Google is interested in is delivering the best results to [INAUDIBLE]. Let's say someone has the same content on a page as my page, but it's a site that's more authoritative-- and I think this question goes on somewhere else. But essentially, in a case like this, what will happen is it's not that we will manually penalize a website for duplicate content problems. Because our algorithms generally figure these things out on their own. But rather what will happen is we'll try to fold those pages together appropriately. And if we can recognize that these pages are really exactly the same content, then maybe it does make sense to fold them together. On the other hand, if we recognize that there's kind of a significant, unique value to these individual pages. Then maybe we'll treat them as separate pages and show them separately in search. So that's something that's not a penalty per se. We'll fold those pages together when we think that they're essentially equivalent or the same. And it's not something that you would need to worry about from a webstand point of view. Because this is more of a technical issue on our side. I bought a new domain, uploaded content, got in Search Console. Got the error pure spam. Under reconsideration request every one thing was said to be fixed. And still can't get the website to rank. Doing a site colon query doesn't show any URLs. And the site map doesn't show anything. So if you got a new domain and the Search Console says that it's flagged as pure spam. Then generally that means that the web spam team looked at the website probably before you bought it, at the content that was there before. And recognized that it doesn't make any sense to index this content at all. So that's where this pure spam label comes from. If you put something unique and valuable on your website and you go through the reconsideration process, then generally speaking, you should be able to get that resolved. The difficulty there, of course, is if you put the same content back online as it was before when it was flagged as pure spam. Then the website team will look at that and say, well, this is still kind of pure spam. What should we actually use our resources to try to index this content for? So that's something where you have to look at your new website with a critical eye and think about what the webspam team might look at when doing that there. If the reconsideration request go through, then generally speaking, your content will be indexable. And we'll be able to pick up that up and show that in search again. Sometimes this does take a bit of time, though. It's not the case that it happens from one day to the next. Sometimes it takes a week or so or maybe even a little bit longer for things to start picking up again. And for things to start being recrawled at a normal rate and indexed at a normal rate. So if the reconsideration request went through and you still don't see anything right away, maybe you just need to be a bit more patient. The other aspect is maybe you have a technical issue on your website there that prevents it from being indexed. So you might want to just double check that, at least from a technical point of view, everything's OK too. How's thee best way to handle duplicate content generated by different colors and sizes of a product between two methods? We kind of went into this before as well. And I'd recommend folding things together that are really equivalent. And keeping things separate that really have unique value of their own. For certain queries my site gets listed twice. Not sure what's causing this. Does this hurt my rankings? Would it be better to have one page rank? And would it rank a little bit higher? Yes. What generally would happen is if we can fold two pages together and say, well, these two individual pages are actually exactly the same. Then we can combine those signals as well and say, well, this is a much stronger individual page compared to these two individual, separate pages that we had before that kind of had to rank on their own. So if you can let us know that we can fold those pages together and treat them as one. We can combine those signals as well and use that for search for rankings as well. Let me see what else we have here. In terms of union categories and tag pages, what about that? So we kind of went into this in the beginning in the presentation as well. If you have the same content on these category pages on the tag pages, then we'll take those individual blocks and treat them as duplicate content. If someone is searching for something in that block, we'll try to show the appropriate page there. But it's not something that would cause any penalty for your website just because you have tagged pages set up. It just means that we just have to do a little bit of extra work on our side in the search results. And figure out, well, the user's actually searching for this piece of text that's on these 10 pages on this website. And we'll try to pick the most appropriate one out of that set and show that in the search results. Let's see where we have some more duplicate content questions. Somethings maybe we haven't picked up before. Is it true that if you're using the producer description on your products, is it considered duplicate content? Yes. Like I mentioned before, this block of text will be considered duplicate content. But the additional value that your site provides is something that can be separate there. And the question goes on does that mean you'll get hit by Panda and manual actions because your website is lower quality? And as I mentioned before, just because you have duplicate content, doesn't mean that the website is lower quality, or that it's problematic, or that the webspam team will say, well we need to take action on this. So if you're using the same description, but your site provides value of it's own, these pages makes sense on their own. Then that's perfectly fine. Then that block of text might be duplicate content. But that doesn't mean that the rest of your site is irrelevant. All right maybe we'll just open up for questions from you guys.



AUDIENCE: Hello. I have a question. We have a booking engine, so we provide a listing of things. And sometimes we have talked about that that's also content. And but my question is, Google able to crawl field to input city and find orders in that city. In order to recognize the difference of our listings, our results, in our booking engine, then the other results to discover that's remarkable, different, and provide more value to the user in order to have a better ranking? Or it's something you have to do manually?

JOHN MUELLER: So you mean, if Googlebot goes to your search forms and kind of tries to search. Sometimes.


JOHN MUELLER: So if we can recognize that we can crawl a website normally, by following the links through the website, then we'll try to do that. And there's some rare situations where we'll recognize, this is actually a search form. And none of the content is really properly linked within the website. So we kind of have to go through the search form and figure out what can we search for? To find links to that content. So that's kind of an extreme situation where I would say, if Google has to use a search form to find your content, then you could be doing a lot better by making sure that Google can kind of follow links normally to find that content directly. So that it doesn't have to guess which queries to enter into your search form.

AUDIENCE: My question, we are a booking comparison engine of all those. And we have landing pages with content. But why we are investing our effort is doing also good content and something remarkably different in the listings we provide when people search for [INAUDIBLE], I don't know. We're worried about if Google is able to find that in order to recognize, evaluate, if we deserve better rankings. And not only in or landing pages, plain HTML, with text, but also if it's important that our listings are not the same listings than other typical story booking engines for hotels.

JOHN MUELLER: So in a case like that we probably wouldn't treat that as a duplicate content problem, because it's essentially unique content. If we look at the individual words and we take this block of text. Then we can copy and paste that and search for it and it doesn't exist anywhere else. So it's not the duplicate content problem for us. It's more a matter of understanding, well, given this HTML page that we have here, how do we evaluate it? How do we recognize its relevance to individual queries? How do we figure out which queries are relevant here? How do we rank this pages among other pages that are similar? So it's not so much an issue for us that we say, well, there's duplicate content or not here. But really just that this is an HTML page that we have to figure out how to rank properly.

AUDIENCE: Thank you.

JOHN MUELLER: All right. More duplicate content questions. Go for it, Mihai.

MIHAI APERGHIS: I have one, but it's not duplicate content related.

JOHN MUELLER: Uh-oh. OK. We'll make an exception.

MIHAI APERGHIS: OK. It's actually regarding the API. So I noticed that you mention in your work, documentation thing, that when you group by query, you won't actually show all the queries and all the clicks since you're trying to protect some user privacy there. And I was wondering, why isn't the same for Google AdWords for example? Where every single click and every word is available [INAUDIBLE]. Do you have any idea why that is? Or is it just different policies in the different departments?

JOHN MUELLER: I don't know. I suspect they have a threshold there too. It's something where, if you use the user interface, then those will be filtered out automatically in Search Console. So you don't really recognize that there is this set of queries that is filtered out for privacy reasons. It might be that the AdWords API does a better job of hiding that it's filtering things out for privacy? I don't know the specifics of how AdWords actually handles that.

MIHAI APERGHIS: And should the filtering be severe? Because I'm usually seeing less than 50% of the clicks when I have tried to group by query than when I tried to group without it.

JOHN MUELLER: I guess it depends on your website and how people are actually searching. So in the Help Center we have some information about what we're filtering there. I think it's things like if we can recognize that these are unique queries that are leading to this content. Then that's something we might filter. So it's the situation if your website is targeting more the short tail or the long tail. And in some situations you'll have a lot more long tail queries in comparison to all of your queries. So it might end up that this set of queries is a bit bigger than for other websites. But it's not the case that we would say, well, it's always 50% or always this chunk.

MIHAI APERGHIS: Right. And is this something that's about to change or is this permanent? The filtering queries thing.

JOHN MUELLER: I think it's been like this since the beginning. It changed a little bit with regards to what we filter. But it's been like this since the beginning. I think maybe in the beginning we were just better at hiding that we were doing this. But, especially with the API, it's a lot more visible. So I think on one hand it's confusing at first when you see that. But on the other hand I think it's important that as a webmaster you also know this is happening. And that there might be some things that are filtered out there.

MIHAI APERGHIS: OK. And one last quick thing. One of the websites I manage is a pretty big publisher in the United States. So there are obviously a lot of keywords. There's a maximum 5,000 rows in the API that you're rigged up to expect. So when I group by query, I get 5,000. Because obviously you cannot show more. But when I group by query and page, I only get like 1,700 or something like that. So very small number. So I was wondering if you first group by query, take those 500 results, then group by page, then filter out whatever is too.

JOHN MUELLER: I think that's a bug on our side somewhere. So I haven't been able to track that down completely. But if you add a combination of factors, it should be kind of like a combination in the search results. It shouldn't be less than the individual factors. But haven't been able to track down what exactly is causing that or where that's coming from. But we've been looking into that.

MIHAI APERGHIS: OK. Well, here's the URL for the site that I noticed this issue. So.

JOHN MUELLER: Yeah, I've been able to reproduce that for my own site too.

MIHAI APERGHIS: Oh OK. That's about it. Thanks.

JOHN MUELLER: All right. One last duplicate content question, if anyone has anything.

ROBB YOUNG: I'll ask.


ROBB YOUNG: Have you looked at any kind of universal way that can't be treated to just timestamp content? So that you know for sure who published it first. Something that's not a wordpress plug-in that someone can manipulate. And say, actually I posted this a year ago. But would that even help you to determine absolutely unique and first publisher?

JOHN MUELLER: So I looked at something like that before joining Google way back in the day. And the big problem that we ran across there was that oftentimes spammers are more savvy than normal content providers. So we'll see things like spammers are able to get content actually indexed first ahead of some smaller mom and pop site that actually has this content. So that's kind of the tricky situation where, from a technical point of view, the spammer was able to get this content into Google quickly or into whatever system that's using it. But just because they were able to do that task first doesn't necessarily that they're really the original source. And that's--

ROBB YOUNG: No, but if you had-- And I don't want to give away a billion dollar idea here, but if you had let's say the 10 largest hosts in the world on board. And you said this is the code we need you to attach to each timestamp when it's uploaded. We don't care if we then spider the site three months after the first one. By seeing that timestamp that you've agreed upon Amazon, Yahoo, Go Daddy, et cetera, will then know that one did actually come first. And we can incorporate that into our algorithm.

JOHN MUELLER: I suspect you'd have the same problem. I don't know if you scrape some small, local websites and you put that content on a legitimate hoster and it looks like you're not a spammer, which is hard. Then it could look like you did that first. So it's a really tricky situation there. But I know this is something that various people are working on to try to recognize what's the original source of the content, where does this content actually come from, how can we treat the individual websites with regards to who's creating original content or who's just copying content. But it's a tricky problem. There's some sites like Copyscape that try to attack this from another side as well. So if you find a solution to handle this in a way that absolutely works than you'll be a big millionaire. I don't know.

ROBB YOUNG: If I find the solution, I'll give it to you, if you tell me the answer to my other problem.

JOHN MUELLER: OK. I probably can't make that deal.

MIHAI APERGHIS: John, a quick look at this question, actually. So Robb mentioned earlier a question regarding small publisher that maybe put something up. And then a very big publisher picks it up, like the "Huffington Post" example. And you're probably going to show "Huffington Post" first in the results because they're more authoritative. And you think that's more relevant results for the users. Is the small publisher in anyway identified as the original source that's get any credit? Or do you just rely on your linking algorithms to make that decision?

JOHN MUELLER: We use a number of different ways to figure out which one to show there. So it's not just the case that we say, well, this is a bigger website and it copied the other guy's content, therefore we'll show it for everything. We do try to take a number of different things into account from that. So it's not quite trivial, but it's also not something that we fold into something really basic.

MIHAI APERGHIS: Right. So even if you show the "Huffington Post" website as first. Does the small publisher get any value, any credit for that in your internal workflow.

JOHN MUELLER: It's hard to define what you mean with the value of credit. But we do try to recognize that they were the first ones or the original source of this content. We do try to treat that appropriately. Oftentimes there'll be also a lot of indirect results there too. Where if the "Huffington Post" writes about something that you put on your personal blog, then obviously lots of people are going to go visit your blog and check that out as well. But it's a tricky problem to figure out what is more relevant for the user at that point when they're searching. Are they looking for something mainstream? Or are they looking for the original source? Sometimes it's like, are they looking for a PDF? Or are they looking for a web page that's kind of describing the same content? It's hard to find that balance there. And it's usually not as trivial as saying, well, this content is on this unknown website. and it's also available on this really big website. And it's exactly the same content. Which one do you choose? Usually there are a lot of other factors involved there too.

MIHAI APERGHIS: Right. I was just wondering if you're showing the bigger publisher there, but you also identify the small publisher of the original source. If you think, OK, so this small website seems to put up really great content. Maybe next time they put out content, we'll take it more into consideration or.

JOHN MUELLER: Sure. That can certainly happen too. Yeah.


ROBB YOUNG: That's usually taken care of by linking. If the bigger publisher links back to the original source. Isn't it? As long as everyone's doing their job properly.

JOHN MUELLER: Sometimes.

ROBB YOUNG: Then it's just part of the normal algorithm.

JOHN MUELLER: Sometimes we can just pick it up like that. Yeah. All right. With that, let's take a break here. I set up the next couple of hangouts as well. They're not quite with this exact same rhythm because I'm traveling a bit next week, end of this week, and in two weeks again. So it's somewhat skewed a little bit. But we'll get back to the normal rhythm after this set of Hangouts. I wish you guys a good week. Thank you all for joining. Thanks for all of the questions. And if you have more questions about duplicate content, feel free to ask them in the next Hangout. Or jump to the Webmaster Help Forums and ask there, where there are always people willing to kind of help and jump in. Thanks a lot. And see you in one of the future Hangouts. Bye everyone!


MIHAI APERGHIS: Bye, John. Have a nice vacation.

JOHN MUELLER: Thanks. | Copyright 2019