How to Block Google Images

Delete pages from Google: How to block and remove content

The basics of search engine optimization: Blocking content for search engines

Website operators do not always want to make all the content of their own website searchable and findable on Google & Co. But how can website owners delete content from Google or block pages from Google? And why at all?

A specialist article by Markus Hövener


The web crawlers, also called spiders or robots, from Google & Co. are not necessarily selective when indexing a website: They often start on the homepage of a website and follow almost all the links that they find here. When crawling the individual pages of a website, content is also recorded that is not relevant for search engines or for which it is not desired that certain content appear in the search results. Such content can be blocked or deleted afterwards.

Reasons for blocking websites or a single page on Google & Co .:

  • When developing or relaunching a website, the content of the new website is often set up under a test domain (e.g. test.mywebsite.de). In order for Google not to index the new content in the development phase, such websites are usually blocked from search engines.
  • It is possible that there is certain content on a website that is not relevant for search engines (e.g. traffic reports). If these are not protected by a login, they should be blocked.
  • Some websites have content in two forms: as an HTML file and as a PDF file. In this case, you can block the PDF files for crawlers to avoid duplicate content.

Benefits of locking content:

  • When google lots of irrelevant content picks up, binds that the "Crawler Energy" the search engine. A crawler can only index a certain amount of data per day. And that too Space in the index is limited. So whoever “floods” the index with irrelevant data has fewer chances care, its really important content timely to get into the index.
  • The latest updates from Google have shown that the search engine is quite rated all pages of a website. who else lots of irrelevant content has, runs the risk of one poor overall rating and thus bad rankings for getting its content.
  • Also double content (engl. duplicate content) can cause problems, even if not in the form of a punishment, as is often assumed. But if z. For example, if all content is offered as HTML and simultaneously as PDF files, it is possible for the search engines to list the PDF files in the search results. That often brings one bad user experience, but is also reflected in the traffic figures, because many web analysis services do not record a PDF download.

In many cases it makes sense to block content. But what options do website operators have to restrict search engines and their web crawlers when indexing? And which option do you use when?

Block content via the robots.txt

Since 1994 website operators have been able to block content for search engines via a central file, the robots.txt. This text file is located in the root directory of a website and can be accessed via the URL http://www.meinewebsite.de/robots.txt. Generally, search engines do this first thing before visiting a website.

 

Even if some SEO tools note the lack of a robots.txt file as an error, that is certainly not true. Because if no content is to be blocked, a robots.txt is unnecessary. In any case, the robots.txt should only be used to block content. Fortunately, the bad habit of filling up this file with search terms has not existed for a long time.

The line “User-agent:…” also offers the option of issuing different blocks to different search engines so that the Googlebot, for example, might see different content than the Bing web crawler. In our example, however, the blocks apply to all search engines ("User-agent: *").

 

Some typical entries in the robots.txt file are as follows:

  1. All content is blocked for all search engines:
    User agent: *
    Disallow: /

     
  2. The contents of a certain directory are blocked:
    User agent: *
    Disallow: / info /

     
  3. All PDF files are locked:
    User agent: *
    Disallow: /*.pdf

Wikipedia provides a good overview of all specifications for blocking content via the robots.txt file. There you can also see that, in addition to the instructions for blocking / unblocking content, there are two other items of information that can be placed in this file:

  1. "Sitemap: URL" - to pass the URL of a sitemap file to the crawler
  2. "Crawl-delay: X" - to signal the crawler that it should wait X seconds between two page requests

Block content using robots meta tags

While the robots.txt stores the blocks / releases centrally, the robots meta tag is used to block individual websites. For crawlers, the "noindex" attribute is of particular interest, which is noted as follows:

 

robots “content =noindex “>

 

If Google & Co. finds the “noindex” attribute after crawling a page, this signals the search engine that this page should not be included in the index. With this tag, it is possible to block specific pages without having to list them in the robots.txt file.

 

In addition to the “noindex” attribute, there are other instructions. These are separated by commas in the robots meta tag. (some of these are only interpreted correctly by Google):

  • nofollow: Do not follow any links on this page
  • noarchive: Do not create a cache entry for this page
  • nosnippet: Do not show a snippet (text excerpt) in the search results
  • noodp: Does not accept a description from the ODP / DMOZ directory as a snippet
  • notranslate: I do not offer any translation options for this page
  • noimageindex: Do not include any images from this page in the Google image index
  • unavailable_after: After the specified date, the page will no longer be displayed in the search results

Example of the robots meta tag with multiple statements

 

.

 

Here the web crawler shouldn't index the website, but should follow the links. In this way, the crawler does not reach a "dead end".

Block content using information in the HTTP header

The instructions that can be transmitted via the robots meta tag can also be transmitted in the HTTP header, but this is only implemented in the rarest of cases. Then the HTTP server would include the instruction in the HTTP header of its response, for example like this:
 

HTTP / 1.1 200 OK

X-Robots-Tag: noindex, nofollow

... [remaining header]

Comparison of robots.txt and robots meta tag

If you compare the options for locking content via the Robots meta tag with those of robots.txt fileAt first one has the feeling that with both possibilities the same thing can be reached. By and large that's true, butnot in detail.

The robots.txt file prevents a website, directory, or individual page from being downloaded by a search engine crawler. you prevented but Notthat the website is still in the Search engine index added becomes. Since the crawler cannot or should not provide any information about the page, the search engine result usually looks "empty" (see the screenshot). The page http://www.techfacts.de/presse/relaunch is blocked in the robots.txt file, but there is still an entry for it in the Google index.
 

The Robots meta tag however, the crawler requires that the Page first downloadsto find the tag there. But you can use the"Noindex" attribute prevented be that one Entry in the Google index is created.

 

who Block content consistently would like to, could come up with the idea of ​​using both methods in combination. However, this has the disadvantage that the crawler does not even download the page because it is blocked in robots.txt. And so he can't find the instructions in the robots meta tag either. So if you want to prevent a page from appearing in the Google index in any way, you can logically change the content not about the robots.txt file lock.
 

robots.txt and robots meta tag have another difference: In the robots.txt file, you can block content for certain crawlers. You could make a website accessible to Google, but block it from the Russian search engine Yandex. The robots meta tags do not offer this option.

 

robots.txt

Meta Robots Tag

  • Prevents a crawler from downloading content from a website.
  • Does not prevent empty index entries from being created for blocked content.
  • Allows content to be blocked only from certain crawlers.
  • Can be used very flexibly on website structures.
  • The crawler must be able to download the page in order to recognize the meta robots tag.
  • "Noindex" prevents the creation of empty index entries.
  • Offers many other options, e.g. B. "noarchive" or "noodp".
  • Cannot be defined for specific crawlers only.
  • Must be set individually on each side.

Delete pages from Google

Getting content into the Google index can be done in seconds. But it can take months to delete content from the index. Even those who block content via the robots.txt or the robots meta tag will often be able to find empty entries in the search engine index for months. In most cases this is not a problem, but sometimes it is - for example when there are legal disputes, as a result of which certain content has to be removed from the Internet and thus also from Google.
 

In these cases you can use the Google Webmaster Tools use. Here there is the option to individual Web pages, entire directories or the contents of a complete subdomainfrom the index to Clear (see the screenshot).

 

However, only content is deleted that is blocked either via the robots.txt file or via robots meta tags. The deletions are usually implemented within a few hours.

More tools

Incidentally, Google offers other interesting options for blocking content through its Google Webmaster Tools.

 

So there are Function "retrieval as by Google"with which you can have a website downloaded by Google. If the page is locked, this is also displayed directly (see the screenshot). You can do that check if the locking certain websites correctly is recognized.

 

There is also the option of checking the original content of a robots.txt file in the webmaster tools under “Status> Blocked URLs” or changing it temporarily and then entering certain URLs. Google then checks which of the URLs are blocked by which instruction.

 

Unfortunately, Google Webmaster Tools has discontinued the robots.txt generator. But there are a multitude of such generators on the Internet as the robots-txt generator.

Conclusion

SEMINAR TIP FOR TRAINING

Our cooperation partner eMBIS offers seminars on this topic.

The Blocking of content is at all not so easy. Depending on what goal you want to achieve, you should use the robots.txt file use or a Robots meta tag build into your own pages. With the help of Google Webmaster Tools, this can also be checked very specifically so that there is no unwanted damage. And who Delete pages from Google would like to find the solution for this in the Google Webmaster Tools.

About the author

Markus Hövener is editor-in-chief of suchradar magazine and managing partner of the search engine marketing agency Bloofusion Germany.