Google has officially announced that from the 1st September it will no longer support robots.txt files with the noindex directive included within the file.
In the announcement, it said that “in the interest of maintaining a healthy ecosystem and preparing for potential future open source releases, we’re retiring all code that handles unsupported and unpublished rules (such as noindex) on September 1, 2019. For those of you who relied on the noindex indexing directive in the robots.txt file, which controls crawling, there are a number of alternative options.”
The robots.txt file and noindex directive give webmasters the power to tell Google which pages it should crawl and which pages it should index, and therefore display in the search results.
- Noindex: tells Google not to include your page(s) in search results
- Disallow: tells them not to crawl your page(s)
- Nofollow: tells them not to follow the links on your page
Noindex (HTML tag on the page) + disallow can’t be combined because the page is blocked by the disallow, and therefore search engines won’t crawl it and discover the tag advising not to index.
Noindex (robots.txt) + disallow was the way webmasters could prevent crawlability and indexability of certain content. With the new update, SEOs will only be able to disallow content that they don’t want to be crawled and indexed before it goes live. For content that’s been published for a while, there are a number of alternative options.
- Noindex in robots meta tags: Supported both in the HTTP response headers and in HTML, the noindex directive is the most effective way to remove URLs from the index when crawling is allowed.
- 404 and 410 HTTP status codes: Both status codes mean that the page doesn’t exist, this will drop the URLs from Google's index once they're crawled and processed.
- Password protection: Unless markup is used to indicate subscription or paywalled content, hiding a page behind a login will generally remove it from Google's index.
- Disallow in robots.txt: Search engines can only index pages that they know exist, so blocking the page from being crawled usually means its content won’t be indexed.
- Search Console Remove URL tool: The tool is a quick and easy method to remove a URL temporarily from Google's search results.
As the Robots Exclusion Protocol has never been official, there were no definitive guidelines on how to keep it up-to-date or make sure a specific syntax must be followed. Every major search engine has adopted robots.txt as a crawling directive and from now on it will be standardised. The first proposed changes came earlier this week:
- "Requirements Language" section will be removed
- txt now accepts all URL-based protocols
- Google follows at least five redirect hops and if no robots.txt is found, Google treats it as a 404 for the robots.txt
- If the robots.txt is unreachable for more than 30 days (5XX status code), the last cached copy of the robots.txt is used, or Google will assume no crawl restrictions
- Google treats unsuccessful requests or incomplete data as a server error
- "Records" are now called "lines" or "rules"
- Google doesn't support simple errors or typos
- Google currently enforces a size limit of 500 kibibytes (KiB), and ignores content after that limit
- Updated formal syntax to be valid Augmented Backus-Naur Form (ABNF) per RFC5234 and to cover for UTF-8 characters in the robots.txt
- Updated the definition of "groups"
- Removed references to the deprecated Ajax Crawling Scheme.
In fact, Google released its robots.txt parser as an open source project along with this announcement the other day. After all, Google has been saying this for years; back in 2015, John Mueller said “you probably shouldn't use the noindex in the robots.txt file”.
You can get more information on robots.txt and robots meta tags in this video from John Muller.