Thursday, February 24, 2011

What Are Google Friendly Sites ?


Below are more detailed tips for creating a Google-friendly site.
Give visitors the information they're looking for
Provide high-quality content on your pages, especially your homepage. This is the single most important thing to do. If your pages contain useful information, their content will attract many visitors and entice webmasters to link to your site. In creating a helpful, information-rich site, write pages that clearly and accurately describe your topic. Think about the words users would type to find your pages and include those words on your site.
Make sure that other sites link to yours
Links help our crawlers find your site and can give your site greater visibility in our search results. When returning results for a search, Google uses sophisticated text-matching techniques to display pages that are both important and relevant to each search. Google interprets a link from page A to page B as a vote by page A for page B. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."
Keep in mind that our algorithms can distinguish natural links from unnatural links. Natural links to your site develop as part of the dynamic nature of the web when other sites find your content valuable and think it would be helpful for their visitors. Unnatural links to your site are placed there specifically to make your site look more popular to search engines. Some of these types of links (such as link schemes and doorway pages) are covered in our Webmaster Guidelines.
Only natural links are useful for the indexing and ranking of your site.
Make your site easily accessible
Build your site with a logical link structure. Every page should be reachable from at least one static text link.
Use a text browser, such as Lynx, to examine your site. Most spiders see your site much as Lynx would. If features such as JavaScript, cookies, session IDs,frames, DHTML, or Macromedia Flash keep you from seeing your entire site in a text browser, then spiders may have trouble crawling it.
Things to avoid
Don't fill your page with lists of keywords, attempt to "cloak" pages, or put up "crawler only" pages. If your site contains pages, links, or text that you don't intend visitors to see, Google considers those links and pages deceptive and may ignore your site.
Don't feel obligated to purchase a search engine optimization service. Some companies claim to "guarantee" high ranking for your site in Google's search results. While legitimate consulting firms can improve your site's flow and content, others employ deceptive tactics in an attempt to fool search engines. Be careful; if your domain is affiliated with one of these deceptive services, it could be banned from our index.
Don't use images to display important names, content, or links. Our crawler doesn't recognize text contained in graphics. Use ALT attributes if the main content and keywords on your page can't be formatted in regular HTML.
Don't create multiple copies of a page under different URLs. Many sites offer text-only or printer-friendly versions of pages that contain the same content as the corresponding graphic-rich pages. To ensure that your preferred page is included in our search results, you'll need to block duplicates from our spiders using a robots.txt file.

What is rel= nofollow ?

"Nofollow" provides a way for webmasters to tell search engines "Don't follow links on this page" or "Don't follow this specific link."
Originally, the nofollow attribute appeared in the page-level meta tag, and instructed search engines not to follow (i.e., crawl) any outgoing links on the page. For example:
<meta name="robots" content="nofollow" />
Before nofollow was used on individual links, preventing robots from following individual links on a page required a great deal of effort (for example, redirecting the link to a URL blocked in robots.txt). That's why the nofollow attribute value of the rel attribute was created. This gives webmasters more granular control: instead of telling search engines and bots not to follow any links on the page, it lets you easily instruct robots not to crawl a specific link. For example:
<a href="signin.php" rel="nofollow">sign in</a>

How does Google handle nofollowed links?

In general, we don't follow them. This means that Google does not transfer PageRank or anchor text across these links. Essentially, using nofollow causes us to drop the target links from our overall graph of the web. However, the target pages may still appear in our index if other sites link to them without using nofollow, or if the URLs are submitted to Google in a Sitemap. Also, it's important to note that other search engines may handle nofollow in slightly different ways.

What are Google's policies and some specific examples of nofollow usage?

Here are some cases in which you might want to consider using nofollow:
  • Untrusted content: If you can't or don't want to vouch for the content of pages you link to from your site — for example, untrusted user comments or guestbook entries — you should nofollow those links. This can discourage spammers from targeting your site, and will help keep your site from inadvertently passing PageRank to bad neighborhoods on the web. In particular, comment spammers may decide not to target a specific content management system or blog service if they can see that untrusted links in that service are nofollowed. If you want to recognize and reward trustworthy contributors, you could decide to automatically or manually remove the nofollow attribute on links posted by members or users who have consistently made high-quality contributions over time.
  • Paid links: A site's ranking in Google search results is partly based on analysis of those sites that link to it. In order to prevent paid links from influencing search results and negatively impacting users, we urge webmasters use nofollow on such links. Search engine guidelines require machine-readable disclosure of paid links in the same way that consumers online and offline appreciate disclosure of paid relationships (for example, a full-page newspaper ad may be headed by the word "Advertisement"). .
  • Crawl prioritization: Search engine robots can't sign in or register as a member on your forum, so there's no reason to invite Googlebot to follow "register here" or "sign in" links. Using nofollow on these links enables Googlebot to crawl other pages you'd prefer to see in Google's index. However, a solid information architecture — intuitive navigation, user- and search-engine-friendly URLs, and so on — is likely to be a far more productive use of resources than focusing on crawl prioritization via nofollowed links.

How does nofollow work with the Social Graph API (rel="nofollow me")?

If you host user profiles and allow users to link to other profiles on the web, we encourage you to mark those links with the rel="me" microformat so that they can be made available through the Social Graph API. For example:
<a href="http://blog.example.com" rel="me">My blog</a>
However, because these links are user-generated and may sometimes point to untrusted pages, we recommend that these links be marked with nofollow. For example:
<a href="http://blog.example.com" rel="me nofollow">My blog</a>
With rel="me nofollow", Google will continue to treat the rel="nofollow" as expected for search purposes, such as not transferring PageRank. However, for the Social Graph API, we will count the rel="me" link even when included with a nofollow.
If you are able to verify ownership of a link using an identity technology such as OpenID or OAuth, however, you may choose to remove the nofollow link.
To prevent crawling of a rel="me nofollow" URL, you can use robots.txt. Standard robots.txt exclusion rules are respected by both Googlebot and the Social Graph API.

Wednesday, February 23, 2011

What Is Google Page Rank Penalty ?


PR0 - Google's PageRank 0 Penalty
By the end of 2001, the Google search engine introduced a new kind of penalty for websites that use questionable search engine optimization tactics: A PageRank of 0. In search engine optimization forums it is called PR0 and this term shall also be used here. Characteristically for PR0 is that all or at least a lot of pages of a website show a PageRank of 0 in the Google Toolbar, even if they do have high quality inbound links. Those pages are not completely removed from the index but they are always at the end of search results and, thus, they are hardly to be found.
A PageRank of 0 does not always mean a penalty. Sometimes, websites which seam to be penalized simply lack inbound links with an sufficiently high PageRank. But if pages of a website which have formerly been placed well in search results, suddenly show the dreaded white PageRank bar, and if there have not been any substantial changes regarding the inbound links of that website, this means - according to the prevailing opinion - certainly a penalty by Google.
We can do nothing but speculate about the causes for PR0 because Google representatives rarely publish new information on Google's algorithms. But, non the less, we want to give a theoretical approach for the way PR0 may work because of its serious effects on search engine optimization.
The Background of PR0
Spam has always been one of the biggest problems that search engines had to deal with. When spam is detected by search engines, the usual proceeding is the banishment of those pages, websites, domains or even IP addresses from the index. But, removing websites manually from the index always means a large assignment of personnel. This causes costs and definitely runs contrary to Google's scalability goals. So, it appears to be necessary to filter spam automatically.
Filtering spam automatically carries the risk of penalizing innocent webmasters and, hence, the filters have to react rather sensibly on potential spam. But then, a lot of spam can pass the filters and some additional measures may be necessary. In order to filter spam effectively, it might be useful to take a look at links.
That Google uses link analysis in order to detect spam has been confirmed more or less clearly in WebmasterWorld's Google News Forum by a Google employee who posts as "GoogleGuy". Over and over again, he advises webmasters to avoid "linking to bad neighbourhoods". In the following, we want to specify the "linking to bad neighbourhoods" and, to become more precisely, we want to discuss how an identification of spam can be realized by the analysis of link structures. In particular, it shall be shown how entire networks of spam pages, which may even be located on a lot of different domains, can be detected.
BadRank as the Opposite of PageRank
The theoretical approach for PR0 as it is presented here was initially brought up by Raph Levien (www.advogato.org/person/raph). We want to introduce a technique that - just like PageRank - analyzes link structures, but, that unlike PageRank does not determine the general importance of a web page but rather measures its negative characteristics. For the sake of simplicity this technique shall be called "BadRank".
BadRank is in priciple based on "linking to bad neighbourhoods". If one page links to another page with a high BadRank, the first page gets a high BadRank itself through this link. The similarities to PageRank are obvious. The difference is that BadRank is not based on the evaluation of inbound links of a web page but on its outbound links. In this sense, BadRank represents a reversion of PageRank. In a direct adaptation of the PageRank algorithm, BadRank would be given by the following formula:
BR(A) = E(A) (1-d) + d (BR(T1)/C(T1) + ... + BR(Tn)/C(Tn))
where
BR(A) is the BadRank of page A,
BR(Ti) is the BadRank of pages Ti which are outbound links of page A,
C(Ti) is here the number of inbound links of page Ti and
d is the again necessary damping factor.
In the previously discussed modifications of the PageRank algorithm, E(A) represented the special evaluation of certain web pages. Regarding the BadRank algorithm, this value reflects if a page was detected by a spam filter or not. Without the value E(A), the BadRank algorithm would be useless because it was nothing but another analysis of link structures which would not take any further criteria into account.
By means of the BadRank algorithm, first of all, spam pages can be evaluated. A filter assigns a numeric value E(A) to them, which can, for example, be based on the degree of spamming or maybe even better on their PageRank. Thereby, again, the sum of all E(A) has to equal the total number of web pages. In the course of an iterative computation, BadRank is not only transfered to pages which link to spam pages. In fact, BadRank is able to identify regions of the web where spam tends to occur relatively often, just as PageRank identifies regions of the web which are of general importance.
Of course, BadRank and PageRank have significant differences, especially, because of using outbound and inbound links, respectively. Our example shows a simple, hierarchically structured website that reflects common link structures pretty well. Each page links to every page which is on a higher hierachical level and on its branch of the website's tree structure. Each page links to pages which are arranged hierarchically directly below them and, additionally, pages on the same branch and the same hierarchical level link to each other.




The following table shows the distribution of inbound and outbound links for the hierarchical levels of such a site.
Levelinbound Linksoutbound Links
062
144
223
As to be expected, regarding inbound links, a hierarchical gradation from the index page downwards takes place. In contrast, we find the highest number of outbound links on the website's mid-level. We can see similar results, when we add another level of pages to our website while the above described linking rules stay the same.
Levelinbound Linksoutbound Links
0142
184
245
324
Again, there is a concentration of outbound links on the website's mid-level. But most of all, the outbound links are much more evenly distributed than the inbound links.
If we assign a value of 100 to the index page's E(A) in our original example, while all other values E equal 1 and if the damping factor d is 0.85, we get the following BadRank values:
PageBadRank
A22.39
B/C17.39
D/E/F/G12.21
First of all, we see that the BadRank distributes from the index page among all other pages of the website. The combination of PageRank and BadRank will be discussed in detail below, but, no matter how the combination will be realized, it is obvious that both can neutralize each other very well. After all, we can assume that also the page's PageRank decreases, the lower the hierarchy level is, so that a PR0 can easily be achieved for all pages.
If we now assume that the hierarchically inferior page G links to a page X with a constant BadRank BR(X)=10, whereby the link from page G is the only inbound link for page X, and if all values E for our example website equal 1, we get, at a damping factor d of 0.85, the following values:
PageBadRank
A4.82
B7.50
C14.50
D4.22
E4.22
F11.22
G17.18
In this case, we see that the distribution of the BadRank is less homogeneous than in the first scenario. Non the less, a distribution of BadRank among all pages of the website takes place. Indeed, the relatively low BadRank of the index page A is remarkable. It could be a problem to neutralize its PageRank which should be higher compared to the rest of the pages. This effect is not really desirable but it reflects the experiences of numerous webmasters. Quite often, we can see the phenomenom that all pages except for the index page of a website show a PR0 in the Google Toolbar, whereby the index page often has a Toolbar PageRank between 2 and 4. Therefore, we can probably assume that this special variant of PR0 is not caused by the detection of the according website by a spam filter, but the site rather received a penalty for "linking to bad neighbourhoods". Indeed, it is also possible that this variant of PR0 occurs when only hierarchical inferior pages of a website get trapped in a spam filter.
The Combination of PageRank and BadRank to PR0
If we assume that BadRank exists in the form presented here, there is now the question in which way BadRank and PageRank can be combined, in order to penalize as much spammers as possible while at the same time penalizing as few innocent webmasters as possible.
Intuitively, implementing BadRank directly in the actual PageRank computations seems to make sense. For instance, it is possible to calculate BadRank first and, then, divide a page's PageRank through its BadRank each time in the course of the iterative calculation of PageRank. This would have the advantage, that a page with a high BadRank could pass on just a little PageRank or none at all to the pages it links to. After all, one can argue that if one page links to a suspect page, all the other links on that page may also be suspect.
Indeed, such a direct connection between PageRank and BadRank is very risky. Most of all, the actual influence of BadRank on PageRank cannot be estimated in advance. It is to be considered that we would create a lot of pages which cannot pass on PageRank to the pages they link to. In fact, these pages are dangling links, and as it has been discussed in the section on outbound links, it is absolutely necessary to avoid dangling links while computing PageRank.
So, it would be advisable to have separate iterative calculations for PageRank and BadRank. Combining them afterwards can, for instance, be based on simple arithmetical operations. In principle, a subtraction would have the desirable consequence that relatively small BadRank values can hardly have a large influence on relatively high PageRank values. But, there would certainly be a problem to achieve PR0 for a large number of pages by using the subtraction. We would rather see a PageRank devaluation for many pages.
Achieving the effects that we know as PR0 seems easier to be realized by dividing PageRank through BadRank. But this would imply that BadRank receives an extremely high importance. However, since the average BadRank equals 1, a big part of BadRank values is smaller than 1 and, so, a normalization is necessary. Probably, normalizing and scaling BadRank to values between 0 and 1 so that "good" pages have values close to 1, and "bad" pages have values close to 0 and, subsequently, multiplying these values with PageRank would supply the best results.
A very effective and easy to realize alternative would probably be a simple stepped evaluation of PageRank and BadRank. It would be reasonable that if BadRank exceeds a certain value it will always lead to a PR0. The same could happen when the relation of PageRank to BadRank is below a certain value. Additionally, it would make sense that if BadRank and/or the relation of BadRank to PageRank is below a certain value, BadRank takes no influence at all.
Only if none of these cases occurs, an actual combination of PageRank and BadRank - for instance by dividing PageRank through BadRank - would be necessary. In this way, all unwanted effects could be avoided.
A Critical View on BadRank and PR0
How Google would realize the combination of PageRank and BadRank is of rather minor importance. Indeed, a separate computation and a subsequent combination of both has the consequence that it may not be possible to see the actual effect of a high BadRank by looking at the Toolbar. If a page has a high PageRank in the original sense, the influence of its BadRank can be negligible. But if another page links to it, this could have quite serious consequences.
An even bigger problem is the direct reversion of the PageRank algorithm as we have presented it here: Just as an additional inbound for one page can do nothing but increasing this page's PageRank, an additional outbound link can only increase its BadRank. This is because of the addition of BadRank values in the BadRank formula. So, it does not matter how many "good" outbound links a page has - one link to a spam page can be enough to lead to a PR0.
Indeed, this problem may appear in exceptional cases only. By our direct reversion of the PageRank algorithm, the BadRank of a page is divided by its inbound links and single links to pages with high BadRank transfer only a part of that BadRank in each case. Google's Matt Cutts' remark on this issue is: "If someone accidentally does a link to a bad site, that may not hurt them, but if they do twenty, that's a problem." (searchenginewatch.com/sereport/02/11-searchking.html)
However, as long as all links are weighted uniformly within the BadRank computation, there is another problem. If two pages differ widely in PageRank and both have a link to the same page with a high BadRank, this may lead to the page with the higher PageRank suffering far less from the transferred BadRank than the page with the low PageRank. We have to hope that Google knows how to deal with such problems. Nevertheless it shall be noted that, regarding the procedure presented here, outbound links can do nothing but harm.
Of course, all statements regarding how PR0 works are pure speculation. But in principle, the analysis of link structures similarly to the PageRank technique should be the way how only Google understands to deal with spam.
PageRank and Google are trademarks of Google Inc., Mountain View CA, USA. PageRank is protected by US Patent 6,285,999.
The content of this document may be reproduced on the web provided that a copyright notice is included and that there is a straight HTML hyperlink to the corresponding page at pr.efactory.de in direct context.
(c)2002/2003 eFactory GmbH & Co. KG Internet-Agentur - written by Markus Sobek

What Is The PageRank Algorithm ?


The PageRank Algorithm
The original PageRank algorithm was described by Lawrence Page and Sergey Brin in several publications. It is given by
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
where
PR(A) is the PageRank of page A,
PR(Ti) is the PageRank of pages Ti which link to page A,
C(Ti) is the number of outbound links on page Ti and
d is a damping factor which can be set between 0 and 1.
So, first of all, we see that PageRank does not rank web sites as a whole, but is determined for each page individually. Further, the PageRank of page A is recursively defined by the PageRanks of those pages which link to page A.
The PageRank of pages Ti which link to page A does not influence the PageRank of page A uniformly. Within the PageRank algorithm, the PageRank of a page T is always weighted by the number of outbound links C(T) on page T. This means that the more outbound links a page T has, the less will page A benefit from a link to it on page T.
The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A's PageRank.
Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1. Thereby, the extend of PageRank benefit for a page by another page linking to it is reduced.
The Random Surfer Model
In their publications, Lawrence Page and Sergey Brin give a very simple intuitive justification for the PageRank algorithm. They consider PageRank as a model of user behaviour, where a surfer clicks on links at random with no regard towards content.
The random surfer visits a web page with a certain probability which derives from the page's PageRank. The probability that the random surfer clicks on one link is solely given by the number of links on that page. This is why one page's PageRank is not completely passed on to a page it links to, but is devided by the number of links on the page.
So, the probability for the random surfer reaching one page is the sum of probabilities for the random surfer following links to this page. Now, this probability is reduced by the damping factor d. The justification within the Random Surfer Model, therefore, is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random.
The probability for the random surfer not stopping to click on links is given by the damping factor d, which is, depending on the degree of probability therefore, set between 0 and 1. The higher d is, the more likely will the random surfer keep clicking links. Since the surfer jumps to another page at random after he stopped clicking links, the probability therefore is implemented as a constant (1-d) into the algorithm. Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank.
A Different Notation of the PageRank Algorithm
Lawrence Page and Sergey Brin have published two different versions of their PageRank algorithm in different papers. In the second version of the algorithm, the PageRank of page A is given as
PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
where N is the total number of all pages on the web. The second version of the algorithm, indeed, does not differ fundamentally from the first one. Regarding the Random Surfer Model, the second version's PageRank of a page is the actual probability for a surfer reaching that page after clicking on many links. The PageRanks then form a probability distribution over web pages, so the sum of all pages' PageRanks will be one.
Contrary, in the first version of the algorithm the probability for the random surfer reaching a page is weighted by the total number of web pages. So, in this version PageRank is an expected value for the random surfer visiting a page, when he restarts this procedure as often as the web has pages. If the web had 100 pages and a page had a PageRank value of 2, the random surfer would reach that page in an average twice if he restarts 100 times.
As mentioned above, the two versions of the algorithm do not differ fundamentally from each other. A PageRank which has been calculated by using the second version of the algorithm has to be multiplied by the total number of web pages to get the according PageRank that would have been caculated by using the first version. Even Page and Brin mixed up the two algorithm versions in their most popular paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine", where they claim the first version of the algorithm to form a probability distribution over web pages with the sum of all pages' PageRanks being one.
In the following, we will use the first version of the algorithm. The reason is that PageRank calculations by means of this algorithm are easier to compute, because we can disregard the total number of web pages.
The Characteristics of PageRank
The characteristics of PageRank shall be illustrated by a small example.
We regard a small web consisting of three pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. The exact value of the damping factor d admittedly has effects on PageRank, but it does not influence the fundamental principles of PageRank. So, we get the following equations for the PageRank calculation:

PR(A) = 0.5 + 0.5 PR(C)
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
These equations can easily be solved. We get the following PageRank values for the single pages:
PR(A) = 14/13 = 1.07692308
PR(B) = 10/13 = 0.76923077
PR(C) = 15/13 = 1.15384615
It is obvious that the sum of all pages' PageRanks is 3 and thus equals the total number of web pages. As shown above this is not a specific result for our simple example.
For our simple three-page example it is easy to solve the according equation system to determine PageRank values. In practice, the web consists of billions of documents and it is not possible to find a solution by inspection.
The Iterative Computation of PageRank
Because of the size of the actual web, the Google search engine uses an approximative, iterative computation of PageRank values. This means that each page is assigned an initial starting value and the PageRanks of all pages are then calculated in several computation circles based on the equations determined by the PageRank algorithm. The iterative calculation shall again be illustrated by our three-page example, whereby each page is assigned a starting PageRank value of 1.
IterationPR(A)PR(B)PR(C)
0111
110.751.125
21.06250.7656251.1484375
31.074218750.768554691.15283203
41.076416020.769104001.15365601
51.076828000.769207001.15381050
61.076905250.769226311.15383947
71.076919730.769229931.15384490
81.076922450.769230611.15384592
91.076922960.769230741.15384611
101.076923050.769230761.15384615
111.076923070.769230771.15384615
121.076923080.769230771.15384615
We see that we get a good approximation of the real PageRank values after only a few iterations. According to publications of Lawrence Page and Sergey Brin, about 100 iterations are necessary to get a good approximation of the PageRank values of the whole web.
Also, by means of the iterative calculation, the sum of all pages' PageRanks still converges to the total number of web pages. So the average PageRank of a web page is 1. The minimum PageRank of a page is given by (1-d). Therefore, there is a maximum PageRank for a page which is given by dN+(1-d), where N is total number of web pages. This maximum can theoretically occur, if all web pages solely link to one page, and this page also solely links to itself.
PageRank and Google are trademarks of Google Inc., Mountain View CA, USA. PageRank is protected by US Patent 6,285,999.
The content of this document may be reproduced on the web provided that a copyright notice is included and that there is a straight HTML hyperlink to the corresponding page at pr.efactory.de in direct context.

(c)2002/2003 eFactory GmbH & Co. KG Internet-Agentur - written by Markus Sobek