I have just released a new feature to Link Diagnosis Firefox Extension that will allow easy diagnostics of the indexed pages on your website.
Couple of weeks ago I was facing a tedious task of finding out which pages out of 100k on the site are indexed and which are not. I knew that some of them could have been marked as duplicate content or that Google simply didn’t indexed them because of the size of the website.
First I installed Google Webmaster Tools hoping that Google will tell me that. Unfortunately, the Indexed Pages tab just points me to use site: command.
I don’t trust site: command. Especially, the count of number of pages is very inaccurate. I know I have 100k pages and Google tells me I have 150k pages indexed.
Also, there is no easy way to see more than 1000 pages (you can play with inurl: commands but it takes ages and you can get banned).
Because of these problems I decided to code a tool which would automate it – Site Diagnosis.
The internal algorithm of the tool works as follows:
1. Go through every URL in XML Sitemap file and do a simple check inurl:http://www.samplesite.com/dir/url1
2. For every url that does not appear on inurl: command there is still a chance that page is indexed but does not appear with inurl
3. For every url in XML Sitemaps I get a title and perform this check site:http://www.samplesite.com sample title
4. If the page does not rank in top 10 for its title within the site then probably something is wrong.
This check is suprisingly accurate and most of the pages that don’t survive this check have some problems like duplicate content, missing titles, missing content or not enough content. These troubled pages usualy don’t appear in the search results if you search for any text on the page – not even when you enclose sentences in quotes.
Obviously, the goal of search engine optimization is to fix these pages so Site Diagnosis will hopefully be essential in identifying them.