The life and thoughts of a bald SEO guy

Google Indexing of Pages Without Inlinks

I believe I can prove that Google is indexing pages that have no inbound links. Thus indexing pages NOT via a typical site crawl. While I can’t prove what they are doing to discover links outside of a typical site crawl, I can prove that they are at least doing it.

Review this query:

http://www.google.com/search?hl=en&q=site%3Aorigin-www.baltimoresun.com&pws=0

And as of today (January 9, 2009) there are 16,200 pages indexed on the origin-www.baltimoresun.com subdomain.

Reviewing this query:

http://www.google.com/search?hl=en&q=link%3Aorigin-www.baltimoresun.com&pws=0

Google states that there are inbound links from these pages to origin-www.baltimoresun.com, right? Well let’s look at the code of the cached versions of the pages and see where it is then.

It doesn’t exist.

 

Okay . . . but we all know that Google’s ‘links:’ doesn’t work well so let’s look at Yahoo’s tool instead which is considered reliable in the Search industry.

http://siteexplorer.search.yahoo.com/search?p=http%3A%2F%2Forigin-www.baltimoresun.com&bwm=i&bwmo=d&bwmf=u

Looking at this query, it proves that there are no links outside the subdomain or domain linking to the site. (There will be some from within the subdomain due to relative URLs of course and the spider crawling and finding them.)

What’s really interesting is that out of the multiple Tribune domains (8 domains have this subdomain) that have an origin-www subdomain indexed, the only one that Yahoo found was the baltimoresun.com version. Furthermore, these subdomains have been live for over a full year but I just realized them over the past few days (and can prove a few have been around for at least a month, shame on me–should’ve caught this sooner perhaps). This tells me that this is a fairly recent change by Google and possibly Yahoo! (though I think, for other reasons, that Yahoo is just crawling Google’s search results).

Here are some of my theories as to what Google may be doing:

  1. Google Toolbar tracking – Obviously several Tribune employees that hit this subdomain intended for internal use have the Google Toolbar installed.
  2. Google Personalization – Whether it is by browsing history, cookies, etc. I’m not sure but several of Tribune employees have Google accounts.
  3. G’Talk – Several Tribune employees use Google’s GTalk feature and we send links of these subdomains around through GTalk. Perhaps Google is tracking GTalk URLs for discoverability.
  4. Gmail – We have a lot of dedicated employees at Tribune perhaps one of them used their personal email address when working from home to send a link from this subdomain?
  5. ???

What do you think? What could’ve caused this problem?

Also . . . seriously, the duplicate content filter didn’t catch this? Why not? You’d think with discoverability methods such as this that’d be the first thing to check.

Note: The only difference between the normal subdomain and the origin-www.baltimore.com subdomain is a server configuration. There is nothing public facing that shares any proprietary information. We only kept it ‘internal’ to avoid this exact problem from occurring (creating duplicate content). Now that it has happened anyway, there is no issue with us sharing it publicly (especially considering all the origin-www.baltimoresun.com etc. will be removed via robots.txt early next week).

Disclaimer: Now that this post exists, some inbound links may develop to the origin-www subdomains but at the time of this post I went through over 20 results for the Google ‘link:’ results and checked the cached pages. No links to the subdomain.

Latest Tweet

  • Important: Family debate occurring. I need your help! Best sunglasses look for my future dad-in-law. Go to Facebook.com/BrentDPayne to vote! 4 hrs ago
  • Survey: Do you still have your Christmas tree up? (Marina and I do). 1 day ago
  • I just finished a 1.76 mi run with a pace of 9'39"/mi and a time of 17:00 with Nike+ GPS. #nikeplus 2 days ago
  • More updates...

Advertisement