Reconsideration Requests
18/Dec/2018
 
Show Video

Google+ Hangouts - Office Hours - 25 August 2015



Direct link to this YouTube Video »

Key Questions Below



All questions have Show Video links that will fast forward to the appropriate place in the video.
Transcript Of The Office Hours Hangout
Click on any line of text to go to that point in the video


JOHN MUELLER: OK, welcome, everyone to today's Google Webmaster Central Office Hours Hangouts. My name is John Mueller. I'm a webmaster trends analysts here at Google in Switzerland, and part of what I do is talk with webmasters and publishers like you all, like some of those here in the Hangout, I guess. Some of you may be watching this somewhere else. For today, I picked a topic that comes up every now and then. I thought it would be useful to do a brief presentation about it and to focus on it for the questions. I see some of the questions are focusing on it. But I might have scared people away with the topic in that there are not that many questions yet. So there's also a chance for you to get your questions in. But before we get started, if any of you new folks here in the Hangout have any questions, feel free to ask right now. No?

BARUCH LABUNSKI: I have a question.

JOHN MUELLER: All right.

BARUCH LABUNSKI: So can you tell us more about the new Google Bots, the mobile bot.

JOHN MUELLER: It's the same bot, but it just got a new iPhone. So I guess that's what changed.

BARUCH LABUNSKI: So that's pretty much it, yeah?

JOHN MUELLER: Pretty much it. So if you use the testing tools, you'll have the new resolution. It'll use the new settings to crawl for mobile-friendly pages, but other than that it's essentially the same as before.

BARUCH LABUNSKI: OK. Thanks.

JOHN MUELLER: All right, let me find my presentation. Feel free to interrupt if you have any questions along the way, if you want to add something. I'm sure there are some aspects that missed, but I hope this gives you some insight into robots.txt and how it works, what does, what to watch out for. All right. So this is my brief guide. So basically I think that the most important aspect of the robots.txt is that you don't really need one in most cases. So if you have a new website, and you don't really know what to do with the robots.txt file, chances are you probably don't need to do anything there. Either it'll be set automatically by whatever system you're using, or you can even live without one. The robots.txt file is used by well-behaved search engines like Google. There are, of course, other search engines that follow this as well. And it essentially tells them where they can and they can't crawl. So when they go through your website, they click on the individual pages, follow all the links on those pages, and if any of those links is blocked by robots.txt, then it won't actually look at that content. Things to keep in mind with regards to blocking crawling. This is something that's often confused, in that blocking crawling isn't really a security mechanism. It doesn't replace a password. It doesn't really replace server-side security. It doesn't replace the need for you to keep your system up to date. So it's essentially just a mechanism to tell search engines that they shouldn't crawl. It also doesn't prevent indexing of the URL. So just because the page is blocked from crawling doesn't mean a search engine won't be able to pick that up for indexing. Essentially, it doesn't know if there's anything there, but it could still show that in search if it thinks, well, very likely there's something here that the user might want to see, but I'm not allowed to take a look at it myself. Also, it doesn't fix canonicalization. So if you have an issue with www or non-www or anything else where you have different URLs for the same content, the robots.txt file isn't the place to fix that. What it does do is prevent crawling, of course. So if your server isn't able to handle a lot of requests to a specific part of your site, that's one good reason to use robots.txt. So about the indexing part, since that's another aspect that's often confused, we can index a URL without ever looking at the content. We might notice that there are lots of links pointing to a URL. We'll see an anchor text with those links and think there's probably something here that we would want to show in search, but we don't really know what it is. So it can still be indexed alone without the content, It can still show up in rankings. You can still find it with site query if you're specifically searching for that. And usually what we'll see is the title based on the anchors that we found, the URL, of course. But we don't have a snippet because we can't look at that URL, so we'll show something like Description isn't available because the site prevents us from actually looking at it. So often we'll hear about people who want to find the example of robots.txt files so that they can copy and paste it to their own server because they think this is important. And like I mentioned before, you don't really need one. And if your system-- your CMS is set up already, you don't really need to change anything there in most cases. If you do want to look at another person's site's robots.txt file, you just take the domain name and add robots.txt, and you will see what they're using. But again, I strongly recommend not copy and pasting other people's files, because your site is likely very different from theirs. I am going to go into a little bit of detail here. If you want to look at the documentation for this, you can search for controlling crawling, and it'll bring up the developers' Google.com page on controlling crawling and indexing, where you have all of the technical details laid. Out The first step when you do have a robots.txt file and you do want to have it used is it needs to be readable. So the robots.txt file is the first thing we look at whenever we try and crawl a website. Before we actually look at the home page, we'll double check that we're allowed to look at that home page. We usually check that maybe once a day, sometimes a little bit more often. And we'll crawl a number of pages during that time. So we'll look at it once in the morning, crawl a bunch of stuff during the day, and look at it the next day to see if it changed. That means if you need to make any urgent changes in the robots.txt file, you might not notice until we look at it again, which might be a day later. There's a tip coming up on how you can speed that up, though. We look at it on a per-host and per-protocol level, but not per directory. So if you're hosting your website under example.com/mywebsite/ in that directory, then we won't look at the robots.txt file in that directory but rather the one that's directly on example.com. If you don't give us a clear response to our request for the robots.txt file, we won't crawl at all. So if there's a server error, or if there's a timeout, if we can't reach the server at all, if you're telling us that you're not allowed to look at this file, all of these things tell us that we don't know where we are. We aren't able to crawl, so we're not going to crawl on all. We'll stay on the safe side. And as far as I know, most search engines handle it like this. But this is something to watch out for. When your site isn't being crawled at all, maybe your robots.txt file is blocked in a way that prevents us from actually looking at it. We will crawl if it's a 200 response code or if it's a clear response code that says this file doesn't exist. And if there's a clear response code that this file doesn't exist, then we'll assume that there are no restrictions at all on the crawling. In general, robots.txt file is a simple text file. You can open it up in a text editor. You shouldn't use a word processor for it, but any normal text file editor should work. It has different sections in it. So they're usually separated with a user-agent section where you have user-agent that says this is for this search engine or this crawler, some useragentname that you specify. Then you have any number of disallow lines or disallow colon and some path that's not allowed for crawling. And you might have allow directives as well, where you say, maybe disallow donotcrawlhere, but allow crawling in some exception that's within that section. So you can add several layers of allow and disallow in there. And sometimes that makes it harder to read and to understand what actually is being blocked here. And finally, you can have a sitemap listing where you say, for this website, I have a sitemapfile here. This is, of course, totally optional as well. But it helps us to crawl and index all the changes on your website just a little bit faster.

AUDIENCE: And John, just regarding crawling, you don't crawl all the pages, right? So if I have, let's say, 2,000 pages in my website, you would only crawl like let's say 1.900 of them. Are you going to get into that later as well?

JOHN MUELLER: So we try to crawl as much as possible from a website, and sometimes we'll crawl all of them, but it's never guaranteed. So we'll try to get through to everything that we can, but we can't really guarantee that we'll be able to make it to everything. So it would be completely normal if only 1.900 of those pages were actually crawled and indexed. It could also be normal that all 2.000 of them are indexed.

AUDIENCE: So even if I fetch it, it's not going to help?

JOHN MUELLER: It's not guaranteed. Let's put it that way. So it does help us to understand that there's something here. But if our systems say, well, this website isn't really worth spending its time on, then that can happen that we'll say, well, we've seen this URL, but we've noticed it hasn't changed, or it looks like an empty page, or it's not really interesting, or it's a calendar page that we've seen up until the year 99 million. And those are all kinds of pages that we can find, but we don't necessarily index them separately.

AUDIENCE: OK, makes sense. Thanks.

JOHN MUELLER: All right. With regards to the lines that you have here with the allow and disallow and user-agents, since you can have multiple of these sections, it's important to try to figure out which one is relevant. For the user-agent section, we have the rule that the more specific user-agent trumps a less-specific one. So if you have something-- a user-agent that finds a really good match there, that it'll follow only that section and it'll ignore everything else. So if you have a section that's for all other user-agents, then we see that as being for other user-agents, not for the one that's matched. Similar with allow and disallow line, we look at the best-matching line, and we ignore the rest. Sitemap mentions, as I mentioned before, can be in there as well. You can also link to sitemaps on other domains. And these sitemaps apply to all user-agents. They're not specific to individual ones. So if we look at kind of-- [GARBLED BACKGROUND SPEECH]

JOHN MUELLER: Let me just mute you for one second. So this is an example robots.txt file I made up. We have one section here for essentially all other user-agents, so with the asterisks. And they're not allowed to crawl anything with /chicken/. Googlebot is not allowed to crawl anything with /mouse/, except for this section, /mouse/for-computer, for example. And a more specific kind of Googlebot, the interstellar Googlebot, isn't allowed to crawl anything on /pluto/ and not allowed to go to /*/europa/. And the sitemap line here looks like it belongs to this section here, but it's actually generic for all crawlers that go to this website. So in this case, if maybe Googlebot for image search were to come along to this website, it would go through here and see, well, this googlebot interstellar doesn't match. This googlebot line looks like a good match, and this asterisks line would also match, but the googlebot line here is the most specific one that matches Googlebot for image search, so it would only use this section. So anything that's here maybe in the chicken section of the website, that could still be crawled because it's not explicitly disallowed in the section for Googlebot.

AUDIENCE: So, John, basically more specific user-agents do not inherit less-specific records, right?

JOHN MUELLER: Exactly. Yes. So if you have a set of URLs or folders that you want to block for all user-agents and you have some specific rules there, then you have to copy that set of directives to the more specific set as well. Matching URLs within those directives. We do the best matching by number of characters. That's something where, if you have a slash at the end of the URL or at the end of the directive, that can be another character and make it more specific. We don't explicitly look for folder names. So if you mention /folder as a directive, then that can also match /foldername/. So it's something where this part of the text is actually within here, so it would match. If you want to only match things that are in a specific folder, then you need to add a trailing slash. The directives here are case-sensitive. So if you have a file, if it's in uppercase and lowercase, then depending on how that's blocked in the robots.txt, it might or might not be indexed. Asterisks is a character we saw briefly here, and it matches any number of other characters. So in this case, it's essentially disallowing anything as long as it has /europa/ at the end of that URL or somewhere in between the URL. And the dollar sign matches the end of the URL. So if you explicitly want to match something that's right at the end of the URL, then you could use a dollar sign there.

AUDIENCE: John, if you can go back to the previous slide. Since you're using kind of a [INAUDIBLE]. attempt there. So the line disallow/*/europa, does the star also include any other slashes that might be inbetween?

JOHN MUELLER: Yes.

AUDIENCE: OK. So it might be /solarsystem/-- I don't know, Jupiter, I think-- or is it Saturn? I'm not sure-- /europa?

JOHN MUELLER: Yes, exactly, yeah.

AUDIENCE: OK.

JOHN MUELLER: OK, some of the things that we ignore in the robots.txt is usually ignore the order. Well, we always ignore the order in the file. So you can order them in whatever order makes sense for you, makes it easier for you to read. I know some other robot.txt parsers look at the order and try to figure out which one is more relevant based on the order in the file, but from our point of view, we don't watch the order. We ignore any unknown directives. So, for example, crawl-delay is something that some other search engines use. I believe Bing uses that. We ignore that. Also, if you have any other kind of random characters or anything that you forgot in the robots.txt file, we'll ignore that. The downside to this is also if you have a typo in the robots.txt file. So if you go back here and you have allow with one L or without the A, or something where disallow maybe with a dash in it, then that's something where we would ignore that holder. We also ignore any single byte order mark for UTF files. So if you save a file in UTF format, and you have the BOM at the beginning of the file, we'll ignore that. If you have two of them at the beginning of the file, then we'll assume that the second one was meant to be there as a character within the file. So that might be something to watch out for. I haven't seen any issues with that in a really long time, though. Best practices is, as you can imagine with the different orders of precedence within the file, try to keep it as simple as possible so that you can look at the robots.txt file and easily understand, is this specific URL blocked or not? We recommend using UTF-8 encoding in the file. You can also escape the URLs if you want. Sometimes that makes it a bit easier. One recommendation I have is to include some kind of a versioning information as a comment. So I did this in the example here in that I mentioned this is for this domain and maybe this change date when this file was changed. The reason I do that is because it makes it a lot easier to diagnose if something goes wrong. So if you accidentally push your developer server robots.txt file to your live server, or if you accidentally have some kind of a problem with your CDN that maybe the older version is being served and sometimes the newer version with the version information as a comment, you can easily see which version of the file is being served here without reviewing the contents of the file completely. In Search Console, we have a robots.txt testing tool which is really neat. It checks the validity of the robots.txt file to see if it complies with the directives that you have. You can see the old versions of the file on top. So it says here "latest version." You can click on this to get a drop-down of the previous versions that we've seen, especially if there were changes within the robots.txt file. You can double check to see is Googlebot always seeing the right version, or is it maybe swapping between two different versions? Or was it the wrong version accidentally uploaded on Friday and left until Monday until when it was fixed again? So you can double check that. You can also submit individual URLs to check them to see if they're crawlable. And then you can directly edit this file. You can then download it, upload it to your server, and submit it. If you submit with the robots.txt testing tool, then we'll pick that up essentially immediately. So you don't have to wait that extra cycle until the next day when we've recrawled the robots.txt file. So that can help if you need to make urgent changes in robots.txt file. Another neat tool we have in Search Console is the blocked resources tool. It shows you across your whole site which parts of the content is blocked by robots.txt. And that gives you a bit more insight into, are you blocking maybe specific CSS files that are used across your website? And the main reason why we give you this information is because if these files are blocked, then we can't use them for rendering. We can't use them for indexing either so we can't see what content you're pulling in there. So we really recommend unlocking those resources as much as possible, especially if they add to your pages, be it design-wise or content-wise. One reason we have that is we recently sent out a message to a lot of sites saying we detected an issue in that we can't access your JavaScript or CSS files. And, hey, what's up? Wouldn't it be nice if we could actually index your content properly? And the reason here is really because of embedded content on these pages that's blocking us from actually being able to view the whole page the a browser would view it. So here's an example that I pulled up on where this can really be a problem. So this is, for example, a chain, I think, of restaurants or some chain of businesses that has locations in lots of places across the world or at least across the US. And they use the Google Maps API to show the individual locations. So the address is here, phone numbers. They have a tab for the opening hours, where you can actually click on that within Google Maps to get the opening hours for this individual location. And the problem here is that, at least when I last checked this, since it uses the Google Maps API and the Google Maps API is blocked by robots.txt, we can't actually pick up the individual locations of this business. We can't pick up the opening hours. We don't really know what to show in search if someone is searching for this specific business in this specific location, because we don't have that information at all. So that's something where embedded content that comes in from other websites and it's being blocked, where it does provide significant value to your page's content, to your business perhaps, is causing problems on your website. So that's something where it might be worth looking at at your website to see if you have similar issues where you're pulling in content from other sources, be it with an API or a JavaScript file or some kind of a shared server response and where that content is actually blocked by robots.txt. So one idea that you could do here is work around this limitation of the Google Maps API and also include these locations in static HTML on the page, also include the opening hours for these locations in static HTML so that we can crawl those pages. We can render them. We'll see the map that we can't actually crawl because it's blocked by robots.txt, but we'll find all of the content that is also shown in the map within the HTML. So that goes into this as well. A good way to check for embedded resources that are blocked is to use the Fetch as Google feature in Search Console. It'll tell you which of the embedded resources are blocked. And you can go through those resources and try to figure out, does this significantly affect the content or the design of the page? And if so, maybe contact the owner of that site to see if that can be unblocked. We'll probably be going out and trying to reach out to some of the bigger sites that have content that's embedded on other people's sites so that we can make that more easily crawlable and indexable. But you can also do that on your site, especially if you're working together with maybe a local business provider, local business directory that stores the opening hours for lots of businesses and pulls them in with JavaScript. If the content is from embedding, then we won't indexed it separately. So it's not that you have to fear that we'll crawl and index your CSS files and suddenly they'll rank for your company name. If we know that this content is just from embedding, we'll just treat that as a part of what we use for rendering but not for a search indexing. And if you want to make sure that we don't index it, you can always use x-robots-tag http header. There's some simple ways to apply that across a whole website and say all of my CSS files allow crawling but block any indexing of them. All right. I think-- yes. That's the end. You made it to the end. Wow. That was pretty long, actually. All right. Any questions to the presentation or any parts in there, anything specific? None yet? OK.

MIHAI APERGHIS: Well, I have a few, but I wanted to give a chance to everybody else. But since no one--

JOHN MUELLER: Go for it.

MIHAI APERGHIS: Let me try to share something. Let me see if this works. Now can you see this?

JOHN MUELLER: Yes.

MIHAI APERGHIS: OK, so this is a problem that I've seen a lot of people on the product forums report. So the asterisk character, you said it matches any number of characters, but it only matches file names, not folders when you're talking about files. So I've seen if you have disallow/europa, that europa folder, and you want to allow the JavaScript files inside europa folder, you need the second allow line, not-- the first one wouldn't work, right?

JOHN MUELLER: Yes. But it's not because it's in a folder. It's because the second line is shorter. The asterisk.js is shorter than /europa/.

MIHAI APERGHIS: Oh. So it's matching first based on the number of the specificity, so number of characters?

JOHN MUELLER: Yes, exactly.

MIHAI APERGHIS: I see. So the first allow line wouldn't basically allow every JavaScript folder on the server. If there's any-- what if I delete this one and only leave it like this. Would this work?

JOHN MUELLER: Then it would disallow-- or, well, it probably wouldn't do anything except for disallow europa, because all files are by default already allowed. And any JavaScript files within europa would be disallowed because of the europa directive. The allow line there probably wouldn't do anything.

MIHAI APERGHIS: OK, so it has to be like this to allow the JavaScript files inside the europa folder.

JOHN MUELLER: Yes, exactly.

MIHAI APERGHIS: And one second question, is there any difference between these three, the first three allow lines?

JOHN MUELLER: The first two are equivalent. So we assume that URLs start with a slash because they have to, and the one with the dollar sign at the end means that it has to end with a .js. So sometimes we'll see things like versioning that you do with like a question mark and a parameter with a date Or a version number. So you'll see-- I don't know-- styles.css?version= whatever. And that wouldn't match, of course, if you have a dollar sign at the end.

MIHAI APERGHIS: So something like this, you're referring to that a lot of CMSes are using?

JOHN MUELLER: Exactly.

MIHAI APERGHIS: And what about this line if I don't use the star sign?

JOHN MUELLER: Then that's only URLs that start with .js.

MIHAI APERGHIS: Oh, OK.

JOHN MUELLER: Probably usually none, I would imagine.

MIHAI APERGHIS: OK. OK, that's about it. Thanks.

JOHN MUELLER: All right. Otherwise, I'll go through the questions that were submitted. And it doesn't look like we have tons of them, so we'll still have time for more questions if anything comes up along the way. "Your guidelines say that Googlebot could not honor a disallow directive if a page contains a +1 button. Is this rule still valid? Are there are other cases in which Googlebot would not respect the disallow directive?" I don't think that rule is really in place. So the tricky part with the +1 button isn't so much that we would crawl the page, but rather that based on clicking the +1, the user, or whoever is using the web page, is telling us that they think this is a good page, and then we'll fetch that page using whatever Google+ uses to fetch pages. And we'll pull out the title and the description, the snippet, and show that on the user's +1's page, which is something that can get crawled and indexed. So essentially it's not that we're crawling and indexing the page itself for search, but we're looking at the page based on the user's click, and we're pulling out the title and description and showing that on the user's page. So if someone is searching for that title or that description, then potentially that user's page, with the listings of +1's, could show up in search. So it's a bit of a tricky situation in the sense kind of like if you disallow crawling of one page and someone else copies and pastes it to a different page, than obviously we don't really know that this is the same content, and we could be indexing the content that way. "If example.com/robots.txt is a 301 redirect to example.com, can it cause confusion to Google, or will Google just treat it as if there weren't any robots.txt file?" We'll actually follow that redirect and see that it points at the home page. And if the home page is normal HTML, like most home pages are, I guess, then we won't find any relevant robots.txt directives on the page, and we won't find anything that we can use for the robots.txt. So in practice, we ignore that from happening. Theoretically, you could think of a situation where someone puts robots.txt directives on their home page. And we pick those up and we think, oh, this HTML page actually has contents that matches the robots.txt, and the robots.txt file redirects there. So we could pick that up for crawling and indexing. You can also 301 redirect to a different domain. For example, if you're moving your site from one domain to another, you would redirect your robots.txt file. And then we would take the final robots.txt file that we end up on as the robots.txt. file. Following up to, I guess, a different question. "If I disallow the folder in which my CSS and JavaScript files sit but allow .js and CSS files, won't Googlebot still honor the disallow as you always honor the most restrictive option if there are any contradictions?" This is kind of like Mihai's example in the sense that we look at the most specific match that matches the directives and URLs that we're trying to crawl. So it's not that we look at the most restrictive one, but rather that we look at the most specific one. As a follow-up to another question. "If index pages become blocked in robots.txt, are they ever dropped from the index, or do they need to be recrawled to read a no-index tag to de-index them?" Let's see. If an index page becomes blocked by robots.txt, then usually what will happen is we'll drop the information that we have from previous crawls, and we won't use that for search, for crawling, indexing, and ranking. So we would lose the information that we used to have there, and we'd just index the URL maybe with that generic snippet that we don't know what this page is about. But sometimes that also means that we just drop them from search completely. If we think, well, this URL without any additional information that we'd have about it, without any links pointing to that URL, doesn't really have any context that we could use for ranking anyway, so maybe we'll drop that from search. So they could theoretically be dropped from search, but it's not guaranteed. On the other hand, if you add a no-index tag to those pages and you allow crawling so that we can look at the no index tag, then that does tell us explicitly you don't want this page indexed at all. And then we'll drop that from search essentially the next time we process that URL properly.

MIHAI APERGHIS: Can somebody use the remove URL tool if the robots.txt isn't-- is allowing that?

JOHN MUELLER: Yes. Yes. So you can use the public removal tool if something is blocked by robots.txt. It'll be removed, I think, for 90 days from search. But it's something where if you use a public tool, you have to do that on a per-URL basis. So you can't say, oh, everything in this folder should be removed from search. It would have to be on a per-URL basis. And if this is your website, and you use the tool within Search Console, then you can say anything on your website should be removed from search, and it doesn't have to be blocked. It doesn't have to have no index on it. We'll essentially trust your input and say, well, you said you're the owner of this website, and if you don't want this indexed, then fine, we'll take that off search. That's also removed, I believe, for 90 days. And then if we recrawl it and actually do find content there, we'll bring that back in search. "Which one is the recommended method to block a page-- robots.txt or meta no index tag? Sometimes we see a page is blocked in robots but still crawlable due to index follow command." So in order for us to see the robots meta tag, we have to crawl the page first. So if a page is blocked by robots.txt, then we won't know what the robots meta tags are. We won't know if it says allow indexing or don't allow indexing. We can't know about that. We can't guess that, essentially. So if you have something specific in your meta tags that you want Google to follow, make sure that we can actually crawl that page so that we can see that content. In that sense, the robots.txt overrides any meta tags in that if we can't crawl, we don't know what's there. We can't follow those meta tags. "I got a warning for a BOM yesterday, so this is for the Unicode based files." I believe we can only show that in the testing tools so that you're aware of that. But it's not something that would block processing of the robots.txt file. "How to give access to crawl JavaScript and CSS files in robots.txt file." We looked at that briefly before with Mihai's great example, in the sense that you generally look at where those files are hosted and why they're blocked. A really interesting way to find out why they're blocked is to use a robots.txt testing tool. You can enter the URL on the bottom that you want to have checked for your site, and it'll highlight which line is actually blocking that URL from being crawled. So that makes it a little bit easier to figure out where you need to make adjustments in your robots.txt file. "When will you parse a robots.txt file with BOM properly? Most engineers don't understand the niceties of ASCII versus UTF-8." Yeah, like I mentioned before, we do follow robots.txt files if they have one BOM mark within the file, so in the beginning of the file. But if you have two of them, then that's something that is essentially not a part of the-- I guess the spec, and we would think that the second one is actually part of your content. But one of them is fine. We should be able to handle that perfectly. Wow, made it through all the questions. Let's see what we have in chat. Nothing crazy. All right. More time for you all.

BARUCH LABUNSKI: So this time goes specifically, ask away about robots until the end of the hour, or can we ask anything else?

JOHN MUELLER: Let's see if we have more robots.txt questions first, or robots meta tags.

BARUCH LABUNSKI: OK. I got a-- but I'll let Claire go ahead.

CLAIRE: Thanks so much. John, one of our clients has added a crawl delay in the robots file. What is it used for? Why do we still use it? Or should we take it off of the robots file?

JOHN MUELLER: What did they add again? I didn't understand that.

CLAIRE: A crawl delay.

JOHN MUELLER: Crawl delay. OK, that's something that I believe other search engines might use. We ignore that. So that's essentially a line that tells a search engine or the search engines that support it, that they should wait so many seconds between individual crawls of a URL. But we've noticed that most sites, when they implement this, implement it in a very broken way that would actually prevent crawling of a lot of their websites. So we tend not to use that information. I don't think we use it all. And I don't think we've ever really used that.

CLAIRE: Perfect. Good.

AUDIENCE: Hey, John. We have a webpage which is blocked in the robots.txt, but that page has highly-- got backings from [? created ?] websites. So in this case, will Google [INAUDIBLE] a robot and then index the page or just try to become and index because of more index meta tags?

JOHN MUELLER: In a case like that, it's very likely that we would index the URL. We wouldn't be able to crawl it, of course. We wouldn't see what's actually on this page. But we'll see, based on the links to that page, which anchor text was used. We can use some of that for a title that we can show in search. We can use some of that to figure out where we should rank this page. Is this a really important page, or is this just one of a million other pages that we've seen that are blocked by robots.txt? So it's not guaranteed that we would index it like that, but it's I'd say very likely.

AUDIENCE: OK. And if the same pattern is blocked in URL parameter in Webmaster, so how Google creates, like is it robots, no index, URL? What is the priority on that?

JOHN MUELLER: The parameter-handling tool essentially is a signal for us. It's not a directive that would say we always must follow the directive of the information in the parameter-handling tool. But rather, you're saying, oh, Google, this is probably not that important. You can probably skip this. And we might still look at those URLs. We might still double check to see if we're missing anything. But it's not a complete block like the robots.txt would be.

AUDIENCE: OK. Thank you.

BARUCH LABUNSKI: So if you have any errors in the sitemap, does that impact the ranking?

JOHN MUELLER: Errors in the sitemap. Not really. So the sitemap file would tell us which URLs exist on your website, which ones have recently been changed or added. And if you have errors in the sitemap file, then obviously we won't get that additional information. So maybe we won't be able to pick up the new or the changed URLs that quickly, but it's not that your website as a whole would rank differently because of errors in the sitemap.

BARUCH LABUNSKI: OK.

MIHAI APERGHIS: OK. I'll go ahead now. First of all, do you take into account anything other than the anchor text of links when you're trying to decide whether you should rank a page that's blocked by robots.txt for a certain query? Anything else you can--

JOHN MUELLER: I think we really mostly look into the anchor text of the links, the type of the links that we find for those pages. Sometimes what we'll find, for example, is that government sites might not be hosted on a very strong server. They might say, oh, well, Googlebot provides an extra load on our server, and we don't want that. So they'll block individual pages or maybe the whole site from being crawled. But that's still a very important site for us. So we still need to be able to figure that out and to show that in search.

MIHAI APERGHIS: OK, but you can't really be more specific about what we take into account? Just as long as there's nothing on the page. So it's not like you'll just take the type of meta tag but not look at what's in the body content?

JOHN MUELLER: No. No. I mean, if we can't crawl it, we can't look at that the content of the page. So that's something where if you see Googlebot crawling something that's blocked by robots.txt, then that would be a really critical bug on our side, and that shouldn't be happening. So that's something that the Googlebot team takes really seriously. They'd probably wake someone up and say, hey, you need to double check what's happening here. Most of the cases where we've seen that happening are actually situations where the webmaster did something wrong, or the wrong robots.txt file was suddenly visible, those kinds of things. But we take that really seriously. If Googlebot were to peek at a page and say, well, I won't use this for indexing. But let me just peek at the page's content and maybe use some of it for ranking.

MIHAI APERGHIS: Just to make sure. But do you also take into account internal links? So do you also look at the internal links of the site?

JOHN MUELLER: Sure. Sure.

MIHAI APERGHIS: If the anchor text is is just Here or Click Here. What will you show in the title tag in the results?

JOHN MUELLER: Probably, Click Here. I mean, you can try this out. I'm sure there are pages that are ranking like that, where the content is blocked from crawling, and we just rank for some terrible anchor text. Or we don't really have that much more information. I am sure there are lots of pages like that. Go ahead.

BARUCH LABUNSKI: I wanted to ask about a server issue, because we touched on it a couple of Hangouts ago. So if I have a server issue where a client is on a shared server, and it's still a reliable host, and it's basically the server hasn't been working the past 24 hours, and so the page no longer there, and then I fetch it again, that causes a ranking problem, right? Because it happened to a client where the host thing wasn't working correctly, and the site on page number six, and now it's on the second page, number five.

JOHN MUELLER: It's hard to say. In general, if we can't reach a server, then we can't get the robots.txt file, and we won't crawl anything else from the site. So that's also an extra protection in that if your server goes down, and we can't reach your robots.txt file, we're not going to hammer the server even more to try to get that content. Because if we can't get the robots.txt file, we won't be able to look at it anyway. So that's something that might be happening there with regards to ranking. Usually that's more of an indirect effect. So I don't believe we take into account any kind of server errors that happen on a website when it comes to ranking. Because this is a technical problem that can happen to all websites. It doesn't mean that the website is less relevant for users. But there might be indirect effects there. So if a large part of your website is dropped from search because we can't crawl it anymore, then obviously we won't be able to follow those internal links to the rest of your content, and we won't be able to rank that properly. And it might be that it does drop in ranking a little bit, because we lose track of the context of that individual page. But as we can recrawl and reindex the rest of the pages that were blocked, that were dropped, then that should pick up automatically. It's not something that lingers on where Googlebot has a grudge because your server went down.

BARUCH LABUNSKI: OK. What do you think Googlebot likes more, the meta tag or the actual file? What's better to use?

JOHN MUELLER: For what?

BARUCH LABUNSKI: Having the meta robots file or just having an actual meta tag on the actual page? Is it it better to have [INAUDIBLE] in the server or--

JOHN MUELLER: Is that with regards to the no index?

BARUCH LABUNSKI: Yeah, the no-index, no-follow.

JOHN MUELLER: No-index, no-follow. Well, we don't use that in robots.txt. So from that point of view, it's always on the page. But you can put it into the HTTP header. That's an option. Essentially, with the robots meta tags, we handle it a little bit differently than the robots.txt file does. So with the robots meta tags, we look at the most restrictive one, and we follow that. So if you have, for example, index follow in your HTTP headers for the robots tag, and you have no-index no-follow in the robots meta tag, then the more restrictive one will trump the less restrictive one there.

BARUCH LABUNSKI: OK. No, it's just because there's also global if the site is-- if there's other search engines around the world that still use that. So that's why I'm asking.

JOHN MUELLER: Yeah. We used to support the no-index directive in robots.txt as an experimental feature. But it's something that I wouldn't rely on. And I don't think other search engines are using that at all.

MIHAI APERGHIS: Is there a difference between using the robots.txt file and using a no-follow attribute or run a follow attribute on a certain link?

JOHN MUELLER: Well, the robots.txt file prevents us from crawling that page at all. And the no-follow just means that for that specific page, any links that will pass PageRank. So it's kind of similar in the sense that we won't be able to get there, but just because that one page has a no-follow on it doesn't mean that we don't find any other pages that have a PageRank passing link.

MIHAI APERGHIS: Right. But assuming that is the only page that links to a certain page, it would really have the same effect. You won't be able to get to that page.

JOHN MUELLER: Yeah. I mean, it's like if you have a rel no-follow on an individual link, and that link is only link leading to that page, then similar situation where maybe we would know about this link, this URL, and we could index it separately. But we don't have any signals that we can associate it with, so it's not really going to show up in search.

MIHAI APERGHIS: OK, one last thing. This is since you talked a lot about crawling. What would you say is a good architecture for a site to have so that the crawling is as efficient as possible, and Google doesn't waste its whole budget on something that's not really [INAUDIBLE]?

JOHN MUELLER: Anything where you have clean links between the pages and where you don't have multiple URLs leading to the same piece of content. I think those are essentially the main things that really help us there in the sense that we can follow those links really easily, we can find all of the content, and we don't run across a lot of duplication that we have to drop in the end.

BARUCH LABUNSKI: [INAUDIBLE], right?

JOHN MUELLER: Sorry?

BARUCH LABUNSKI: Navigation is key.

JOHN MUELLER: Yeah, I mean, that's the internal linking within the website. That really helps us to find all of the content and understand its context a little bit better. So this shoe is within this section of men's shoes, or blue shoes, or whatever. And then we understand the context of that page based on the internal linking.

MIHAI APERGHIS: Right. But I assume it's not really good to have a-- you have 100-page website, but it's takes seven or eight clicks to get from the main page to one of the other pages. Architecture is usually better than something that runs really deep.

JOHN MUELLER: I think you'd probably focus more on the user side there, on the user experience side, where anything that's easy to find for the user is something that's more useful for them. So from that point of view, those two goals align really well in that if you think about the user, then chances are you'll probably get it right for the search engines too.

MIHAI APERGHIS: OK. And since I'm working with a lot of ecommerce websites and ecommerce websites usually focus on having some sort of faceted navigation with a lot of filters, the combination of filters sometimes creates a lot of extra pages that aren't really useful for the user to find in search results, for example, especially if the e-commerce site already has some landing page or category that already helps the user. So you might have a 1,000-page website, but the filters create another 50,000 pages. Would you recommend to better go with the robots.txt or maybe no-index or canonicalize those pages? What would be your best approach?

JOHN MUELLER: I'd rather try to use no-index, rel=canonical, no-follow to prevent that. The thing with using robots.txt for canonicalization is that whenever we find links to those pages, any of the PageRank that we would pass through that link is essentially lost. So it goes to a URL that's blocked by robots.txt. We have no idea what it is. We don't know if it's a duplicate, if we can match that, and it's essentially lost. So if someone externally links to those pages, that link doesn't really help your website. Or if you do that internally, and you say, well, all the great shoes are in this section for special deals for blue shoes for men, that's something that essentially gets lost. So with the canonicalization, with no-follow links, no-index, all of that really helps us to understand, OK, this page we can crawl. We see it's a duplicate of an existing page that we have. Therefore, we can fold all the links into that version that we do keep and really help the site in that way. Whereas if it's blocked by robots.txt, we have no idea what it is. We don't really know what to do with that link.

MIHAI APERGHIS: OK. So as long as it's not really excessive in the amount of extra pages you're creating with the filters, canonicalization, or no-indexing those pages is a better approach?

JOHN MUELLER: Yes. Definitely.

MIHAI APERGHIS: [INAUDIBLE] my clients well.

AUDIENCE: Yeah, John, so if a page has no-index follow, and that has a canonically backed third-party page, the purpose of that is to ensure that follow from here, but create back what other value it has to a different page. Google understands their pattern, no-index follow and follow the canonical?

JOHN MUELLER: So that's I think a special situation. In that case, what we would practically do is just follow that canonical and use that. But the tricky part there is that with the canonical, you're telling us these pages are equivalent. And if actually one of the pages is indexable, the other page is no-index, then they're really not equivalent at all. So that's something where you're giving us conflicting information. You're saying, these pages are exactly the same, but actually, this page is very different because it should be indexed, and this one shouldn't be indexed at all. So as much as possible, I'd really recommend being as clear as possible with the directives, and say, This is indexable. This isn't indexable. This is indexable and has a canonical pointing at a different page.

AUDIENCE: Yeah. I think the situation comes because of some [? corporate ?] issue such [INAUDIBLE] effects in Google, right? So we are ranking really good, but because of the [? corporate issue ?] raised up, so we thought to ensure that, you know? We want to parse all the juice from here to page B, but at the same time remove from the Google index as per the [? corporate issue. ?]

JOHN MUELLER: Yeah. I try not to combine those two.

AUDIENCE: Yeah. Yeah. Make sense. Also, one thing, so for the apps, so we still use the same what it has on the mobile robots or on the web robots?

JOHN MUELLER: For apps?

AUDIENCE: Yeah.

JOHN MUELLER: I think we do that on a host-name basis. So depending on which host name you have linked or use for embedded content, we would do it on that.

AUDIENCE: OK. [INAUDIBLE]

JOHN MUELLER: Yeah, exactly.

BARUCH LABUNSKI: And John, by the way, your competitor does support robots meta tags.

JOHN MUELLER: Robots meta tags in the robots.txt file?

BARUCH LABUNSKI: No-index, no-file.

JOHN MUELLER: Well, those are meta tags, yeah. We do that.

BARUCH LABUNSKI: OK. I can just place it on the home page, yeah?

JOHN MUELLER: Sure. All right. With that, we're just out of time. Thank you all for dropping by. Thanks for wading through my presentation. I hope you found it interesting. And maybe we'll see some of you again in the future Hangouts.

AUDIENCE: Bye, John.
ReconsiderationRequests.net | Copyright 2018