MediaWiki/Guide/SEO

From grantswebsite.net
Jump to navigation Jump to search


A Guide to Search Engine Optimisation for MediaWiki Sites in a hosted environment

This article is part of a series compiled as a guide to encourage and assist those building a MediaWiki-based website in a hosted environment.

Each article links to relevant documentation from the MediaWiki.org website and the Wikimedia.org website. Where the official documentation does not adequately cover the issues for a hosted site, or is too 'advanced', additional information, explanation and advice is provided.

Background

It is not easy to customise MediaWiki for search engine optimisation (SEO). The Manual:Search engine optimization is short and outdated. There are very few MediaWiki Extensions for SEO, and those that are available may not be currently maintained. MediaWiki articles or pages do not support the use of meta tags so descriptive page titles, page descriptions and lists of keywords cannot be easily added. A sitemap could be helpful but the maintenance script used to create a sitemap requires shell or command-line access which is not generally available on hosted sites.

These facts present challenges which are addressed in this article. This is not a treatise on SEO. It is a guide to doing the best with what we've got. Consequently, some topics already covered in this Guide are expanded here to include a focus on SEO. The value of creating a sitemap is discussed in detail, with a focus on correcting crawling errors. And a list of articles for Further Reading is provided.


Quality Content

The first step towards a higher page ranking is to develop quality content. This website, for example, is still quite small as many planned sections have yet to be developed. However, if you are reading this page it shows that 1) the search engines have located and indexed the articles in this Guide for Using MediaWiki in a Hosted Environment, and 2) the content has been evaluated and considered relevant, so the page URLs have been included in a Search Engine Results Page (SERP).

However, quality content alone does not guarantee success. If internal links are not followed by a search engine bot or robot; if there are errors such as 'page not found'; if there are server errors or DNS errors etc.; then the search engine ranking will remain low.

Without quality content it is likely that none of the following ideas will significantly improve SEO performance. But if you have quality content and are not getting good page rankings then the following strategies may be helpful.

TIPS:

  1. Develop quality content which is original, authoritative, useful and relevant.
  2. Eliminate 'page not found' errors (404). Use the Special pages:Wanted pages to identify pages for which there are links but the target page does not exist.


Short (and friendly) URLs

The article A Guide to using Short URLs for MediaWiki Pages explains the steps required to change the default format of a MediaWiki page URL to a more readable, and search engine friendly, URL.

However, a search engine bot or spider can still locate and index the longer form of the URL unless directed not to in a robots.txt file. Search engines like Google and Bing will generally obey the directives in robots.txt so to implement short URLs and ensure that only the short form of a URL is indexed it is necessary to:

  1. follow all the steps in the article A Guide to using Short URLs for MediaWiki Pages, and
  2. construct a robots.txt file which is accessible to the search engines and contains the directives which achieve the result you want for your own MediaWiki website.

TIP: Configure your MediaWiki installation to use short URLs and use page names which are relevant to the page content.


robots.txt

A robot is defined by The Web Robot Pages as "a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced."

The Robots Exclusion Standard was developed in 1994 and although the protocol includes directives to allow or disallow the robots access to website resources, compliance is not mandatory and the 'Standard' is not official. The Web Robot Pages site is probably authoritative but has not been updated regularly.

Meta tags can also be used to influence the behaviour of robots, as described by metatags.org, specifically the meta robots tag. This tag is used to define indexing permissions, using the terms index or noindex, and follow or nofollow. Since meta tags cannot easily be added to MediaWiki pages these directives could be placed in the robots.txt file. Again, compliance with these directives is not mandatory, although major search engines will comply. Alternatively, MediaWiki has several variables which can be configured to define indexing and follow/nofollow rules (see below).

The robots.txt file is not secure and can be read easily. In a browser simply insert the URL of a domain followed by /robots.txt, for example: http://www.wikipedia.org/robots.txt. In this example, many User-agents are Disallowed, but that may not be necessary on a smaller website which may be less prominent as a target.

Because the robots.txt file is so easily accessible it is unwise to list items that you do not want search engines or other spiders or robots to index. The Disallow directives intended for robots could simply identify the hidden folders or pages that a potential hacker may decide to access. Alternative ways to restrict access are explained below.

TIPS:

  1. Create a robots.txt file which, at a minimum, includes directives which prevent crawling of non-article pages, as described in the Manual:robots.txt.
  2. Do not use the robots.txt file as a security tool. Use variables in the MediaWiki LocalSettings.php file instead (see below for details).


Define Robot Policies in LocalSettings.php

The behaviour of robots, and therefore spiders or crawlers used by search engines, can be defined by configuring several variables in the MediaWiki file LocalSettings.php as described below:-

$wgDefaultRobotPolicy
What it does: Specifies the default robot policy for all pages on the wiki. Can be used instead of directives in the robots.txt file.
How to use it: If you set a default policy which disallows indexing and following of links site-wide, then you can define robot policies to allow indexing and following for specific content namespaces or articles.
$wgNamespaceRobotPolicies
What it does: Specifies the robot policies for each namespace. Includes a robot meta tag, for example <meta name="robots" content="index, nofollow" /> in each page within the namespace.
How to use it: Define the robot policy for each content namespace. If you have content namespaces which are not public and have defined a default robot policy to deny indexing then you do not include these private namespaces in the namespace array (because the default robot policy will apply to them). Alternatively, define the robot policy for each and every namespace, including NS_Talk.
$wgArticleRobotPolicies
What it does: Specifies the robot policies for specific pages.
How to use it: Can be used to over-ride the policy which applies to the namespace containing the article. For example, if all articles in a namespace are to be indexed then use this variable to list any exceptions.
$wgExemptFromUserRobotsControl
What it does: From the Manual:$wgExemptFromUserRobotsControl 'An array of namespace keys in which the __INDEX__ or __NOINDEX__ magic words will not function, so users can't decide whether pages in that namespace are indexed by search engines. If set to null, default to $wgContentNamespaces'.
How to use it: MediaWiki has magic words which can be used within an article to specify whether the article should be indexed, or not indexed. In a collaborative environment these magic words could be inserted on a page by an author contrary to the robot policy defined by the system administrator. To prevent this from happening, the system administrator could also define $wgExemptFromUserRobotsControl for each namespace to prevent the use of magic words __INDEX__ or __NOINDEX__.

TIP: Configure these variables in the localsetting.php file instead of writing directives in the robots.txt file. This will keep your settings private and will not reveal publicly the existence of namespaces which you want to keep private.


WikiSEO

The MediaWiki extension WikiSEO enables metadata to be added to each page including:-

  • a page title, which is the title shown in a browser, not the page name stored by mediaWiki,
  • a page description, which is useful from an SEO perspective, and
  • a list of keywords.

The additional information about each page should improve search engine page ranking if it is consistent with the article content.

Unfortunately (as at June 2019) this extension is described as functional but not actively maintained. And there is no better alternative.

TIP: If you use WikiSEO use replace instead of append when creating custom page titles.


Site Map Generator

SEO specialists debate the value of a sitemap. However, the process of creating a sitemap can identify problems with the website, and fixing those issues can result in a better SEO outcome.

There are different kinds of site map:-

  • Search engines like Google can read a file named sitemap.xml in the root directory of a website. This file is, obviously, written in XML and can be created manually if the site is not large. Usually, however, it is simpler to use software to crawl through the website and then create the sitemap file.
  • The site map could also be visual, which is useful for website users. A graphic can be produced by some site-mapping software. Alternatively, a sitemap graphic could be built from the sitemap.xml file using a different application. The graphic is frequently a .svg file.
  • Another popular format is html which can be included in a website like an index for users.

There are three different ways to create a sitemap file using software applications described as Sitemap Generators:-

  1. Online tools which may be totally free, or free to use for up to 500 URLs, or require an account and charge a fee.
  2. Software which can be installed on a server. These are typically not free of charge and not useful for a hosted MediaWiki site because the hosting provider generally does not allow the access required to run the software.
  3. Software applications which can be installed on a desktop computer. Some of these are free to use, but limited. Others may be trial versions of powerful applications and limited to 500 URLs. Most of the useful applications are expensive and marketed to corporations and professionals.

Prior to creating this article around 20 different sitemap generators (online and desktop) were tested on some MediaWiki sites. Some observations were:-

  • Most of the free or low cost software applications for generating sitemaps were created before 2015. Many were designed for Windows 7, XP or older and have not been updated and maintained. Low-cost tools generally do not produce useful information about crawling errors or page metadata.
  • Newer applications generally do provide more information, such as metadata - page titles, page descriptions, keyword lists, and status codes eg. 301 for pages redirected.
  • Some spiders completely ignored the robots.txt file and indexed URLs containing /w/index.php?
  • Some spiders - including some of the more professional online tools - were unable to crawl past the home page of the MediaWiki site!
  • Most spiders followed traditional wiki links (enclosed in square brackets) easily but some did not follow links written with html tags. Consequently on a MediaWiki site with custom navigation buttons containing html links only some of the pages were indexed. It is likely that these were identified as 'external' links even though they pointed to an internal article, and were blocked by a nofollow directive.

TIPS: Use a sitemap generator:

  • to see what a search engine will (or will not) index.
  • to test whether your robots.txt file is functioning the way you want it to. Check that you have not disallowed indexing of areas that you do want indexed!
  • to see whether a spider will follow internal links and find all the pages you want indexed
  • to identify pages with status code 404 (not found) so you can fix the issues
  • and open and read the sitemap.xml file so you can check its contents. There is software which will validate a sitemap file, but reading it yourself will show you the rating assigned to each URL.
  • and view a graphic of your website. Wiki sites can become messy with lots of cross-linking of articles, but seeing a representation of the site may indicate that some restructuring is desirable.
  • and when everything is as good as it's going to get, add the sitemap.xml file to the root directory of your MediaWiki installation (the same folder or directory as the .htaccess file and the robots.txt file).
  • and finally, repeat often. Keep the sitemap up to date!


Conclusion

The strategies described here assist with improving SEO by ensuring that your hosted MediaWiki site is free from errors, has appropriate policies defined for robots and search engines, ensures that internal links are followed so that pages can be indexed, and includes a sitemap which can also be uploaded to the search engines to ensure that your website will be included in search engine results pages.

To retain or improve SEO this whole process should be reviewed and repeated regularly. And, of course, new quality content should be added so that the website is of current interest.


Further Reading

Robots.txt
The following articles by Google for webmasters describe the purpose of the robots.txt file and how the Googlebot obeys the Robots Exclusion Protocol. Also describes nocache and nosnippet directives.
Search Engine Optimization (SEO)
Sitemap Generators
The following articles review the most effective sitemap generators. ScreamingFrog seems to rate highly. For personal use the A1 Sitemap Generator has similar functionality and costs less. Both worked well when tested against MediaWiki sites and both identified crawling errors which would negatively affect SEO. All errors in a MediaWiki site have to be corrected manually, one at a time.
Note the use of terms like Top, Best, Awesome, Proven in article names. This is a common strategy by bloggers to attract ratings.
Fixing Errors
To be really useful, a sitemap generator should identify crawling errors. Knowing what the errors are is valuable, but knowing their significance and how to fix them is even more important. The article below is comprehensive and provides explanations and solutions which can be implemented on a hosted MediaWiki site. If a sitemap generator identifies server errors or DNS errors on a hosted MediaWiki site, you should contact the hosting provider for support. Error codes or screen-shots from the sitemap tool will assist hosting support find a solution.

Disclaimer

The information or advice provided in this Guide is based on, or links to, official documentation for MediaWiki and was accurate when this article was created. However, some variation may occur between versions of MediaWiki; and the specifics of web hosting varies by service provider. Consequently, you should always create an effective backup before making any changes; ensure that you can restore your database and website; read the Release Notes before upgrading; and apply best practices to the management of your website. Any action that you take based on information provided here is at your own risk and the author accepts no liability for any loss or damage.