Home | Blog | Report a bug

Site Diagnosis - are all your pages indexed properly?

March 30th, 2008

I have just released a new feature to Link Diagnosis Firefox Extension that will allow easy diagnostics of the indexed pages on your website.

Couple of weeks ago I was facing a tedious task of finding out which pages out of 100k on the site are indexed and which are not. I knew that some of them could have been marked as duplicate content or that Google simply didn’t indexed them because of the size of the website.

First I installed Google Webmaster Tools hoping that Google will tell me that. Unfortunately, the Indexed Pages tab just points me to use site: command.

I don’t trust site: command. Especially, the count of number of pages is very inaccurate. I know I have 100k pages and Google tells me I have 150k pages indexed.

Also, there is no easy way to see more than 1000 pages (you can play with inurl: commands but it takes ages and you can get banned).

Because of these problems I decided to code a tool which would automate it - Site Diagnosis.

site_diagnosis_screen1.jpg

The internal algorithm of the tool works as follows:

1. Go through every URL in XML Sitemap file and do a simple check inurl:http://www.samplesite.com/dir/url1
2. For every url that does not appear on inurl: command there is still a chance that page is indexed but does not appear with inurl
3. For every url in XML Sitemaps I get a title and perform this check site:http://www.samplesite.com sample title
4. If the page does not rank in top 10 for its title within the site then probably something is wrong.

This check is suprisingly accurate and most of the pages that don’t survive this check have some problems like duplicate content, missing titles, missing content or not enough content. These troubled pages usualy don’t appear in the search results if you search for any text on the page - not even when you enclose sentences in quotes.

Obviously, the goal of search engine optimization is to fix these pages so Site Diagnosis will hopefully be essential in identifying them.

Getting how old the page is easily

March 6th, 2008

For anyone who doesn’t know there is a date filter on Google which brings you only results that were created within the last xx period. See screenshot:

google_advanced_search1.gif

When you use that filter and see the search result pages then you will see that under every listing there is a date which shows you when Google saw that page for the first time. We can then use this information to find how old the page is if we craft the query to include the url of the page we are interested in.

inurl:http://blog.linkdiagnosis.com/?p=16

For that query it returns the date when my blog post about Amazon was created - 15 February 2008 - which is spot on.

Now to make it even easier to get the age of the page I have created a new feature to Link Diagnosis Firefox Extension.

Install the extension and you will be able to find out the age of the page with one click.

See the screenshot below:

get_page_age11.gif

Enjoy!

P.S Don’t do many queries at once otherwise Google will ban you (for me after about 70 inurl commands with 1 sec delay). If anybody has any tips how to prevent that from happening, then please let me know and I will code the age of the page to the main report so you can see both metrics - pagerank and age together.

Do you evaluate on-page link position when acquiring a link?

March 3rd, 2008

Over the last couple of years the trust of the website and page has been primary factor when evaluating the value of the link. Most of the people for simplicity use pagerank for evaluating how much benefit will give the link to their site. Bill from SEO by the SEA has just written an article how search engines are evaluating the links and content importance based on on-page factors.

The most important bits from Yahoo patent (and also from Microsoft research paper) are that search engines can and do create a visual model of the page to find out which is the “Most significant element”. Some of the factors that Yahoo confirms in the paper:

  • formatting of the text - bold , h1 etc. - GOOD
  • tables with data and other grid-like structures - GOOD
  • distance from the top and center. The most important ones are near the horizontal center of the page and also above the fold.
  • content / links that are at the absolute header/footer of the page - BAD

Microsoft research paper talks more about the fact that Pagerank model is extended where the atomic entity is not the page but page-block. One page can have multiple blocks with each having different semantics and importance.

This information can have serious impact on how the link juice is flowing from the pages and especially the part from Microsoft paper affects that. Most of the bought links are not placed in the “Most Significant Element” of the page - which is the main content bit. The bought links are placed in side bars, footers etc. These research papers say that these links are much less important than the links which are within the content block. This then just rises the value of bought content links of PayPerPost style where they are within the context of the blog post.

Another implication of these research papers confirms that search engines are able to reconstruct the HTML Document Object Model and know which content is important regardless of the position in the source code of the page. This contradicts with many SEO Experts which have been claiming in the past that you should put the most important content first just after the BODY element.

As always, your comments are welcome.

More bug fixes

March 2nd, 2008

I have just released 1.0.5 version which should fix the most annonying bug - report goes to 100% and then freezes. If you get this sometimes then please update to the new version (FF should automatically pickup the new version anyway). If you have any other problems, please fill the bug report with as much information as possible so we can make the tool more stable together.

More than 500 SEOs use this tool every day and  I am glad that numbers are raising. Please don’t be afraid to share this tool with others  - more people use it - more I will want to include better features and spend time maintaining it.

Cheers!

Amazon S3 down

February 15th, 2008

Its not Link Diagnosis related but one of my sites went down because of Amazon S3. Anyone else sees that ?

I can see that Colin Schlueter on Jaiku just ask if its down so it may be a start of major outage?

UPDATE: This thread on Amazon forums discusses over 35 minute down time.


You make me happy

February 5th, 2008

Just wanted to share the cheer when I see SEOs all around the world enjoying the extension.

Well, at least I hope they are enjoying it as I don’t understand a word in spanish, german or russian . :)

Another fix

February 1st, 2008

There was a serious bug that prevented from running a diagnosis on a subdomain. I have just released a 1.0.4 version which fixes that. Thanks to all who have reported that bug.

Bug fixes

January 30th, 2008

Finally, I got around to do some bug fixing. All the new users that I got after Patrick gave me traffic bump from Sphinn have made that lots of bugs started to appear and I got much more bug reports.

One of the biggest problems was that some people experienced sometimes that website didn’t detect that extension was installed and was only producing the basic data. Hopefully, that is fixed now. Please let me know if you get this error.

Also, I have tried to add support for Firefox 3 BETA. However, folks at Mozilla don’t make it very easy and there are some serious problems with that BETA. I have decided to wait with the Firefox 3 support until they sort out a more compatible release with the older code. Sorry :(

Link Diagnosis for SEOBook.com

January 24th, 2008

Aaron Wall is probably one of the most knowledgable SEOs today. How did he get such a reputation? Let’s try to examine his site and backlinks and we may get some answers.

Let’s then fire a link diagnosis tool on his blog SEO Book and see what happens. After about 30 minutes (yes he has lots of backlinks to process) we get a nice report like this.

We get a total of 5294 links in the report, whereas Yahoo SiteExplorer says it has over 200 thousand backlinks. The difference is that most of the links in SiteExplorer are from the same domains (sitewide links) , these would just make the analysis useless and slow.

So what can we see from that report? Let’s look at the backlinks. Here is the CSV report of all the backlinks sorted by pagerank. We immediatelly see the most valuable links that Aaron has.

seobook_backlinks1.jpg

From this report you can learn which most respectable sites had mentioned him and pushed him up in the search engine rankings.

Another useful report shows as the most popular anchor texts that people use while linking to SEOBook blog.
seobook_anchors.jpg

We can see here that most people link to the blog via his name. Then there are SEOBook keyword domain anchors which are in this case very helpful as it describes his business model and so he can rank easily for SEO books. If Aaron was getting more links with “seo” rather than his name then he would be even more killing the top positions for golden goose seo keywords.

Last but not least are these charts which give top level information about quality of the links and also the presence of the links and no-follows.

seobook_linktypes.jpg

seobook_pagerank.jpg

From the report and by analyzing the full CSV backlink report we can see that Aaron has put lots of work into quality content and links came naturally. His backlink profile shows it nicely.

In the next blog post I will try to analyze a site which uses gray-hat link building technique so we can compare the difference (and also get a few free backlinks at the same time ;)

Launch!

January 20th, 2008

After 6 weeks of development I think the product is ready. There are still many things I would like to improve but they are not essential. I would like to get some feedback first and then I will improve on it.

It has been a long way for me as I changed the whole technology about 3 times. I knew what I want to create but initially I thought the best way will be to create a whole application as a reporting tool where you initiate a request and the application will schedule your job and will email you the result. The amount of data that the app is gathering is huge so it wasn’t possible to get it all in real time. I created a windows service that was  doing all sort of black magic with hammering Yahoo/Google servers with requests through proxies in parallel threads. It was working fine, but I wasn’t satisfied that users would have to wait for the report.

So I had an idea to move all the requests to a desktop client. Every user who wanted to get a report would need to install a windows application which would do the requests from his machine. In this way the data would be available quicker for the user and I wouldn’t have to use proxies, which is difficult to scale.

I have almost finished the app but then, while swimming (all the good ideas come from swimming :) , I realized that I could try writing a firefox extension rather than desktop client. Since beginning I was a bit hesitant, whether people would want to install a desktop app. I knew that there were successfull SEO tools like SEOElite etc. but still I wanted to make the tool easier. I then did a quick research and found out from Matt Cutts blog that more than 65% of webmasters use Firefox .

The idea of Firefox extension started to be very interesting to me as I always want to learn new things and I haven’t created any extension before. Soon I realized that its just a Javascript + XUL and it will not be that hard to build it. The biggest downside was that I had to throw away a whole month of work. Ouch :(

The firefox extension is the easiest for the user and seamlessly works with the website, which is good. The only problem so far I see that javascript is not executed in parallel threads so there will be a performance hit.

Anyway, enough of the background. I hope its not my first and last post here :)

Enjoy and let me know how it works for you!