SEO Blog: 2012

Wednesday, October 3, 2012

Google crawlers

See which robots Google uses to crawl the web

"Crawler" is a generic term for any program (such as a robot or spider) used to automatically discover and scan websites by following links from one webpage to another. Google's main crawler is called Googlebot. This table lists information about the common Google crawlers you may see in your referrer logs, and how they should be specified in robots.txt, the robots meta tags, and the X-Robots-Tag HTTP directives.

Here several user-agents are recognized in the robots.txt file, Google will follow the most specific. If you want all of Google to be able to crawl your pages, you don't need a robots.txt file at all. If you want to block or allow all of Google's crawlers from accessing some of your content, you can do this by specifying Googlebot as the user-agent. For example, if you want all your pages to appear in Google search, and if you want AdSense ads to appear on your pages, you don't need a robots.txt file. Similarly, if you want to block some pages from Google altogether, blocking the user-agent Googlebot will also block all Google's other user-agents.robots.txt

But if you want more fine-grained control, you can get more specific. For example, you might want all your pages to appear in Google Search, but you don't want images in your personal directory to be crawled. In this case, use robots.txt to disallow the user-agent Googlebot-image from crawling the files in your /personal directory (while allowing Googlebot to crawl all files), like this:

User-agent: Googlebot
Disallow:

User-agent: Googlebot-Image
Disallow: /personal

To take another example, say that you want ads on all your pages, but you don't want those pages to appear in Google Search. Here, you'd block Googlebot, but allow Mediapartners-Google, like this:

User-agent: Googlebot
Disallow: /

User-agent: Mediapartners-Google
Disallow:

robots meta tag

Some pages use multiple robots meta tags to specify directives for different crawlers, like this:

<meta name="robots" content="nofollow"><meta name="googlebot" content="noindex">

In this case, Google will use the sum of the negative directives, and Googlebot will follow both thenoindex and nofollow directives

Controlling how Google crawls and indexes your site.

Search engines generally go through two main stages to make content available for users in search results: crawling andindexing. Crawling is when search engine crawlers access publicly available webpages. In general, this involves looking at the webpages and following the links on those pages, just as a human user would. Indexing involves gathering together information about a page so that it can be made available ("served") through search results.

The distinction between crawling and indexing is critical. Confusion on this point is common and leads to webpages appearing or not appearing in search results. Note that a page may be crawled but not indexed; and, in rare cases, it may be indexed even if it hasn't been crawled. Additionally, in order to properly prevent indexing of a page, you must allow crawling or attempted crawling of the URL.

The methods described in this set of documents helps you control aspects of both crawling and indexing, so you can determine how you would prefer your content to be accessed by crawlers as well as how you would like your content to be presented to other users in search results.

In some situations, you may not want to allow crawlers to access areas of a server. This could be the case if accessing those pages uses the limited server resources, or if problems with the URL and linking structure would create an infinite number of URLs if all of them were to be followed.

In some cases it may be preferable to control how content is indexed and made available in search results. For instance, you may not want your pages to be indexed at all, or want them to appear without a snippet (summary of the page shown below the title in search results); or you may not want users to be able to view a cached version of the page.

Warning: Neither of these methods is suitable for controlling access to private content. If content should not be accessible by the general public, it's important that proper authentication mechanisms are in place. Our Help Center has more information on blocking Google from accessing or showing private content.

Note: Pages may be indexed despite never having been crawled: the two processes are independent of each other. If enough information is available about a page, and the page is deemed relevant to users, search engine algorithms may decide to include it in the search results despite never having had access to the content directly. That said, there are simple mechanisms such as robots meta tags to make sure that pages are not indexed.

Controlling crawling

The robots.txt file is a text file that allows you to specify how you would like your site to be crawled. Before crawling a website, crawlers will generally request the robots.txt file from the server. Within the robots.txt file, you can include sections for specific (or all) crawlers with instructions ("directives") that let them know which parts can or cannot be crawled.

Location of the robots.txt file

The robots.txt file must be located on the root of the website host that it should be valid for. For instance, in order to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located athttp://www.example.com/robots.txt. A robots.txt file can be placed on subdomains (likehttp://website.example.com/robots.txt) or on non-standard ports (http://example.com:8181/robots.txt), but it cannot be placed in a subdirectory (http://example.com/pages/robots.txt). There are more details regarding the location in the specifications.

Content of the robots.txt file

You can use almost any text editor to create a robots.txt file. The text editor should be able to create standard ASCII or UTF-8 text files; don't use a word processor (word processors often save files in a proprietary format and can add unexpected characters, such as curly quotes, which may cause problems for crawlers). A general robots.txt file might look like this:

User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Disallow: /onlygooglebot/

Sitemap: http://www.example.com/sitemap.xml

Assuming this file is located at http://example.com/robots.txt, it specifies the following directives:

No Googlebot crawlers should crawl the folder http://example.com/nogooglebot/ and all contained URLs. The line "User-agent: Googlebot" starts the section with directives for Googlebots.
No other crawlers should crawl the folder http://example.com/onlygooglebot/ and all contained URLs. The line "User-agent: *" starts the section for all crawlers not otherwise specified.
The site's Sitemap file is located at http://www.example.com/sitemap.xml

For more information, see the robots.txt specifications.

Some sample robots.txt files

These are some simple samples to help get started with the robots.txt handling.

Allow crawling of all content

User-agent: *
Disallow:

User-agent: *
Allow: /

The sample above is valid, but in fact if you want all your content to be crawled, you don't need a robots.txt file at all (and we recommend that you don't use one). If you don't have a robots.txt file, verify that your hoster returns a proper 404 "Not found" HTTP result code when the URL is requested.

Disallow crawling of the whole website

User-agent: *
Disallow: /

Keep in mind that in some situations URLs from the website may still be indexed, even if they haven't been crawled.

Disallow crawling of certain parts of the website

User-agent: *
Disallow: /calendar/
Disallow: /junk/

Remember that you shouldn't use robots.txt to block access to private content: use proper authentication instead. URLs disallowed by the robots.txt file might still be indexed without being crawled, and the robots.txt file can be viewed by anyone, potentially disclosing the location of your private content.

Allowing access to a single crawler

User-agent: Googlebot-news
Disallow:

User-agent: *
Disallow: /

Allowing access to all but a single crawler

User-agent: Unnecessarybot
Disallow: /

User-agent: *
Disallow:

Controlling indexing and serving

Indexing can be controlled on a page-by-page basis using simple information that is sent with each page as it is crawled. For indexing control, you can use either:

a special meta tag that can be embedded in the top of HTML pages
a special HTTP header element that can be sent with all content served by the website

Note: Keep in mind that in order for a crawler to find a meta tag or HTTP header element, the crawler must be able to crawl the page—it cannot be disallowed from crawling with the robots.txt file.

Using the robots meta tag

The robots meta tag can be added to the top of a HTML page, in the <head> section, for instance:

<!DOCTYPE html>
<html><head>
<meta name="robots" value="noindex" />
...

In this example, robots meta tag is specifying that no search engines should index this particular page (noindex). The namerobots applies to all search engines. If you want to block or allow a specific search engine, you can specify a user-agent name in the place of robots.

For more information, see the robots meta tag specifications.

Using the X-Robots-Tag HTTP header

In some situations, non-HTML content (such as document files) can also be crawled and indexed by search engines. In these cases, it's not possible to add a meta tag to the individual pages—instead, an HTTP header element can be sent with the response. This header element is not directly visible to users as it's not a part of the content directly.

The X-Robots-Tag is included with the other HTTP header tags. You can see these by checking the HTTP headers, for example using "curl":

$ curl -I "http://www.google.com/support/forum/p/Webmasters/search?hl=en&q=test"
HTTP/1.1 200 OK
X-Robots-Tag: noindex
Content-Type: text/html; charset=UTF-8
(...)

For more information, see the X-Robots-Tag specifications.

Getting started

Most websites will not need to set up restrictions for crawling, indexing or serving, so getting started is simple: you don't have to do anything.

There's no need to modify your pages if you would like to have them indexed.

Wednesday, May 9, 2012

Video about pagination with rel=“next” and rel=“prev”

If you’re curious about the rel=”next” and rel=prev” for paginated content announcement we made several months ago, we filmed a video covering more of the basics of pagination to help answer your questions. Paginated content includes things like an article that spans several URLs/pages, or an e-commerce product category that spans multiple pages. With rel=”next” and rel=”prev” markup, you can provide a strong hint to Google that you would like us to treat these pages as a logical sequence, thus consolidating their linking properties and usually sending searchers to the first page. Feel free to check out our presentation for more information:

This video on pagination covers the basics of rel=”next” and rel=”prev” and how it could be useful for your site.

Slides from the pagination video

Additional resources about pagination include:

Webmaster Central Blog post announcing support of rel=”next” and rel=”prev”
Webmaster Help Center article with more implementations of rel=”next” and rel=”prev”
Webmaster Forum thread with our answers to the community’s in-depth questions, such as:

Does rel=next/prev also work as a signal for only one page of the series (page 1 in most cases?) to be included in the search index? Or would noindex tags need to be present on page 2 and on?

When you implement rel="next" and rel="prev" on component pages of a series, we'll then consolidate the indexing properties from the component pages and attempt to direct users to the most relevant page/URL. This is typically the first page. There's no need to mark page 2 to n of the series with noindex unless you're sure that you don't want those pages to appear in search results.

Should I use the rel next/prev into [sic] the section of a blog even if the two contents are not strictly correlated (but they are just time-sequential)?

In regard to using rel=”next” and rel=”prev” for entries in your blog that “are not strictly correlated (but they are just time-sequential),” pagination markup likely isn’t the best use of your time -- time-sequential pages aren’t nearly as helpful to our indexing process as semantically related content, such as pagination on component pages in an article or category. It’s fine if you include the markup on your time-sequential pages, but please note that it’s not the most helpful use case.

We operate a real estate rental website. Our files display results based on numerous parameters that affect the order and the specific results that display. Examples of such parameters are “page number”, “records per page”, “sorting” and “area selection”...

It sounds like your real estate rental site encounters many of the same issues that e-commerce sites face... Here are some ideas on your situation:

1. It’s great that you are using the Webmaster Tools URL parameters feature to more efficiently crawl your site.

2. It’s possible that your site can form a rel=”next” and rel=”prev” sequence with no parameters (or with default parameter values). It’s also possible to form parallel pagination sequences when users select certain parameters, such as a sequence of pages where there are 15 records and a separate sequence when a user selects 30 records. Paginating component pages, even with parameters, helps us more accurately index your content.

3. While it’s fine to set rel=”canonical” from a component URL to a single view-all page, setting the canonical to the first page of a parameter-less sequence is considered improper usage. We make no promises to honor this implementation of rel=”canonical.”

Remember that if you have paginated content, it’s fine to leave it as-is and not add rel=”next” and rel=”prev” markup at all. But if you’re interested in pagination markup as a strong hint for us to better understand your site, we hope these resources help answer your questions!

Google Search quality highlights: 52 changes for April 2012

We’ve had a zerg rush of 52+ launches this month in search. One of the big changes for me was our latest algorithm improvement to help you find more high-quality sites. But, that’s not all we’ve been up to. As you may recall, a couple months back we shared uncut video discussion of a spelling related change, and now that’s launched as well (see “More spell corrections for long queries”). Other highlights include changes in indexing, spelling, sitelinks, sports scores features and more. We even experimented with a couple more radical features, such as Really Advanced Search and Weather Control, but ultimately decided they were a little too foolish.

Here’s the real list for April:

Categorize paginated documents. [launch codename "Xirtam3", project codename "CategorizePaginatedDocuments"] Sometimes, search results can be dominated bydocuments from a paginated series. This change helps surface more diverse results in such cases.
More language-relevant navigational results. [launch codename "Raquel"] For navigational searches when the user types in a web address, such as [bol.com], we generally try to rank that web address at the top. However, this isn’t always the best answer. For example, bol.com is a Dutch page, but many users are actually searching in Portuguese and are looking for the Brazilian email service, http://www.bol.uol.com.br/. This change takes into account language to help return the most relevant navigational results.
Country identification for webpages. [launch codename "sudoku"] Location is an important signal we use to surface content more relevant to a particular country. For a while we’ve had systems designed to detect when a website, subdomain, or directory is relevant to a set of countries. This change extends the granularity of those systems to the page level for sites that host user generated content, meaning that some pages on a particular site can be considered relevant to France, while others might be considered relevant to Spain.
Anchors bug fix. [launch codename "Organochloride", project codename "Anchors"] This change fixed a bug related to our handling of anchors.
More domain diversity. [launch codename "Horde", project codename "Domain Crowding"] Sometimes search returns too many results from the same domain. This change helps surface content from a more diverse set of domains.
More local sites from organizations. [project codename "ImpOrgMap2"] This change makes it more likely you’ll find an organization website from your country (e.g. mexico.cnn.com for Mexico rather than cnn.com).
Improvements to local navigational searches. [launch codename "onebar-l"] For searches that include location terms, e.g. [dunston mint seattle] or [Vaso Azzurro Restaurant 94043], we are more likely to rank the local navigational homepages in the top position, even in cases where the navigational page does not mention the location.
Improvements to how search terms are scored in ranking. [launch codename "Bi02sw41"] One of the most fundamental signals used in search is whether and how your search terms appear on the pages you’re searching. This change improves the way those terms are scored.
Disable salience in snippets. [launch codename "DSS", project codename "Snippets"] This change updates our system for generating snippets to keep it consistent with other infrastructure improvements. It also simplifies and increases consistency in the snippet generation process.
More text from the beginning of the page in snippets. [launch codename "solar", project codename "Snippets"] This change makes it more likely we’ll show text from the beginning of a page in snippets when that text is particularly relevant.
Smoother ranking changes for fresh results. [launch codename "sep", project codename "Freshness"] We want to help you find the freshest results, particularly for searches with important new web content, such as breaking news topics. We try to promote content that appears to be fresh. This change applies a more granular classifier, leading to more nuanced changes in ranking based on freshness.
Improvement in a freshness signal. [launch codename "citron", project codename "Freshness"] This change is a minor improvement to one of the freshness signals which helps to better identify fresh documents.
No freshness boost for low-quality content. [launch codename “NoRot”, project codename “Freshness”] We have modified a classifier we use to promote fresh content to exclude fresh content identified as particularly low-quality.
Tweak to trigger behavior for Instant Previews. This change narrows the trigger area forInstant Previews so that you won’t see a preview until you hover and pause over the icon to the right of each search result. In the past the feature would trigger if you moused into a larger button area.
Sunrise and sunset search feature internationalization. [project codename "sunrise-i18n"] We’ve internationalized the sunrise and sunset search feature to 33 new languages, so now you can more easily plan an evening jog before dusk or set your alarm clock to watch the sunrise with a friend.
Improvements to currency conversion search feature in Turkish. [launch codename "kur", project codename "kur"] We launched improvements to the currency conversion search feature in Turkish. Try searching for [dolar kuru], [euro ne kadar], or [avro kaç para].
Improvements to news clustering for Serbian. [launch codename "serbian-5"] For news results, we generally try to cluster articles about the same story into groups. This change improves clustering in Serbian by better grouping articles written in Cyrillic and Latin. We also improved our use of “stemming” -- a technique that relies on the “stem” or root of a word.
Better query interpretation. This launch helps us better interpret the likely intention of your search query as suggested by your last few searches.
News universal results serving improvements. [launch codename "inhale"] This change streamlines the serving of news results on Google by shifting to a more unified system architecture.
UI improvements for breaking news topics. [launch codename "Smoothie", project codename "Smoothie"] We’ve improved the user interface for news results when you’re searching for a breaking news topic. You’ll often see a large image thumbnail alongside two fresh news results.
More comprehensive predictions for local queries. [project codename "Autocomplete"] This change improves the comprehensiveness of autocomplete predictions by expanding coverage for long-tail U.S. local search queries such as addresses or small businesses.
Improvements to triggering of public data search feature. [launch codename "Plunge_Local", project codename "DIVE"] This launch improves triggering for the public data search feature, broadening the range of queries that will return helpful population and unemployment data.
Adding Japanese and Korean to error page classifier. [launch codename "maniac4jars", project codename "Soft404"] We have signals designed to detect crypto 404 pages (also known as “soft 404s”), pages that return valid text to a browser, but the text only contains error messages, such as “Page not found.” It’s rare that a user will be looking for such a page, so it’s important we be able to detect them. This change extends a particular classifier to Japanese and Korean.
More efficient generation of alternative titles. [launch codename "HalfMarathon"] We use a variety of signals to generate titles in search results. This change makes the process more efficient, saving tremendous CPU resources without degrading quality.
More concise and/or informative titles. [launch codename "kebmo"] We look at a number of factors when deciding what to show for the title of a search result. This change means you’ll find more informative titles and/or more concise titles with the same information.
Fewer bad spell corrections internationally. [launch codename "Potage", project codename "Spelling"] When you search for [mango tea], we don't want to show spelling predictions like “Did you mean 'mint tea'?” We have algorithms designed to prevent these “bad spell corrections” and this change internationalizes one of those algorithms.
More spelling corrections globally and in more languages. [launch codename "pita", project codename "Autocomplete"] Sometimes autocomplete will correct your spelling before you’ve finished typing. We’ve been offering advanced spelling corrections in English, and recently we extended the comprehensiveness of this feature to cover more than 60 languages.
More spell corrections for long queries. [launch codename "caterpillar_new", project codename "Spelling"] We rolled out a change making it more likely that your query will get a spell correction even if it’s longer than ten terms. You can watch uncut footage of when we decided to launch this from our past blog post.
More comprehensive triggering of “showing results for” goes international. [launch codename "ifprdym", project codename "Spelling"] In some cases when you’ve misspelled a search, say [pnumatic], the results you find will actually be results for the corrected query, “pneumatic.” In the past, we haven’t always provided the explicit user interface to say, “Showing results for pneumatic” and the option to “Search instead for pnumatic.” We recently started showing the explicit “Showing results for” interface more often in these cases in English, and now we’re expanding that to new languages.
“Did you mean” suppression goes international. [launch codename "idymsup", project codename "Spelling"] Sometimes the “Did you mean?” spelling feature predicts spelling corrections that are accurate, but wouldn’t actually be helpful if clicked. For example, the results for the predicted correction of your search may be nearly identical to the results for your original search. In these cases, inviting you to refine your search isn’t helpful. This change first checks a spell prediction to see if it’s useful before presenting it to the user. This algorithm was already rolled out in English, but now we’ve expanded to new languages.
Spelling model refresh and quality improvements. We’ve refreshed spelling models and launched quality improvements in 27 languages.
Fewer autocomplete predictions leading to low-quality results. [launch codename "Queens5", project codename "Autocomplete"] We’ve rolled out a change designed to show fewer autocomplete predictions leading to low-quality results.
Improvements to SafeSearch for videos and images. [project codename "SafeSearch"] We’ve made improvements to our SafeSearch signals in videos and images mode, making it less likely you’ll see adult content when you aren’t looking for it.
Improved SafeSearch models. [launch codename "Squeezie", project codename "SafeSearch"] This change improves our classifier used to categorize pages for SafeSearch in 40+ languages.
Improvements to SafeSearch signals in Russian. [project codename "SafeSearch"] This change makes it less likely that you’ll see adult content in Russian when you aren’t looking for it.
Increase base index size by 15%. [project codename "Indexing"] The base search index is our main index for serving search results and every query that comes into Google is matched against this index. This change increases the number of documents served by that index by 15%. *Note: We’re constantly tuning the size of our different indexes and changes may not always appear in these blog posts.
New index tier. [launch codename "cantina", project codename "Indexing"] We keep our index in “tiers” where different documents are indexed at different rates depending on how relevant they are likely to be to users. This month we introduced an additional indexing tier to support continued comprehensiveness in search results.
Backend improvements in serving. [launch codename "Hedges", project codename "Benson"] We’ve rolled out some improvements to our serving systems making them less computationally expensive and massively simplifying code.
"Sub-sitelinks" in expanded sitelinks. [launch codename "thanksgiving"] This improvement digs deeper into megasitelinks by showing sub-sitelinks instead of the normal snippet.
Better ranking of expanded sitelinks. [project codename "Megasitelinks"] This change improves the ranking of megasitelinks by providing a minimum score for the sitelink based on a score for the same URL used in general ranking.
Sitelinks data refresh. [launch codename "Saralee-76"] Sitelinks (the links that appear beneath some search results and link deeper into the site) are generated in part by an offline process that analyzes site structure and other data to determine the most relevant links to show users. We’ve recently updated the data through our offline process. These updates happen frequently (on the order of weeks).
Less snippet duplication in expanded sitelinks. [project codename "Megasitelinks"] We’ve adopted a new technique to reduce duplication in the snippets of expanded sitelinks.
Movie showtimes search feature for mobile in China, Korea and Japan. We’ve expanded our movie showtimes feature for mobile to China, Korea and Japan.
MLB search feature. [launch codename "BallFour", project codename "Live Results"] As the MLB season began, we rolled out a new MLB search feature. Try searching for [sf giants score] or [mlb scores].
Spanish football (La Liga) search feature. This feature provides scores and information about teams playing in La Liga. Try searching for [barcelona fc] or [la liga].
Formula 1 racing search feature. [launch codename "CheckeredFlag"] This month we introduced a new search feature to help you find Formula 1 leaderboards and results. Try searching [formula 1] or [mark webber].
Tweaks to NHL search feature. We’ve improved the NHL search feature so it’s more likely to appear when relevant. Try searching for [nhl scores] or [capitals score].
Keyword stuffing classifier improvement. [project codename "Spam"] We have classifiers designed to detect when a website is keyword stuffing. This change made the keyword stuffing classifier better.
More authoritative results. We’ve tweaked a signal we use to surface more authoritative content.
Better HTML5 resource caching for mobile. We’ve improved caching of different components of the search results page, dramatically reducing latency in a number of cases.

And here are some other changes we’ve blogged about since last time:

What type of links should we be getting?

In the wake of Penguin, I’ve been getting asked this question A LOT. I think it’s finally time for my answer.

Penguin has changed everything – oh wait, no it hasn’t. All it did was penalize those for doing what they knew was wrong, but the only difference is that Google actually followed through with some of their threats on bad links.

So what SHOULD we be doing? We know what links we SHOULDN’T be pursuing, so it’s finally time to break down the links that you can get.

Non keyword rich anchor text for mid to low quality links

I’m not talking about an Xrumer blast, but for the lower quality links you can get, never go for anchor text. That era is over (see image above). Here are a few examples:

Directory links (Give me a minute before you bust my chops)

If you’re still going for directory links solely so you can get anchor text, then stop. Your goal for directories (stick to higher quality & niche related though) should be 1^st to make sure they get accepted into the directory and 2^nd that you can get branded anchor text, never keyword rich. You can get those elsewhere.

Dofollow blog comments

In limited numbers they work, and they work great, but once again, don’t go for exact anchors. Always use your name, because not only will they have a higher chance of getting accepted, but it’ll help diversify your anchor distribution.

Tumblr

You know you can still get easy followed links from Tumblr, right? I’m not going to reinvent the wheel, so go check out this video by Ana Hoffman (scroll down a little bit). To clarify: just use your name. I know you can get anchor text, but don’t (if you HAVE to, do partial). Also, don’t abuse this. You’re ratio of total links to linking root domains will start to look very unnatural very quick if you start getting a ton of these each day.

Easy content-led links

You hear a lot about content-based link building strategies, and a lot of them are vague and hard to actually put into play (not to mention completely hit or miss), so here’s a quick list of actionable content-based link building tips.

Research FAQs, build content around it, set up alerts, reference it when you enter the conversation

This one’s a little old in my book, but create content around certain questions that get asked a lot in your niche, create a thorough post on the topic, then set up Alerts so when it’s mentioned, you can swoop in & answer, as well as referencing your post either in the middle of your response or as a source at the end (i.e. Yahoo! Answers makes you cite the source at the end). They might be nofollow links in some cases, but seeing as they shouldn’t take you more than a few minutes to get each time, they’re worth it IMO.

Target natural linkers, build content that they share

Here’s the process for this strategy:

Identify those in your niche, or those in similar niches, that frequently link out freely.
Sort them by influence (for obvious reasons)
Build content around the topics that they most frequently share
Reach out to them via social channels or email (they’ll do the rest)

An easy identifier of those who link out freely are the ones who do weekly or monthly link roundup posts.

Also, I recommend (as you probably already know) to make sure you’re not just sending over a cold call email when you suggest the content. Make sure you develop those relationships, then start recommending that content. Sure, it might take a month or so, but if you do it right, you’ll be able to tap into these linkers time & time again.

Mega Broken Link Building

I call this “mega” because if done correctly, you might never run out of opportunities (depending on the size of your niche). Here we go:

Start with a list of 25 or so related blogs
Input them all into Screaming Frog, Xenu, or the IMJ Crawler (my new favorite)
Build a spreadsheet of all the broken links you find on those blogs (depending on the tool, you can segment & export)
Use the OSE API to sort the list of broken links by LRDs or Page Authority (to help find most linked to content)

You now have a sortable list of previously linked to content that no longer exists. The top of your list of broken links should have a few highly linked to pieces of content, so use the Wayback machine to find what was there, recreate it (but make it even better; even add other types of media like images, video, audio, etc), then reach out to everyone linking to the broken resource and ask for them to replace the old one with a link to you.

You can also do this for links pages:

Find 40-50 links pages
Find broken links using Check My Links or DHP
Copy & paste the broken URLs into a spreadsheet
Use the OSE API to sort them

Seeing as it was a links page, they’re probably broken homepages, so start out at the top of the list and go down until you find a website that’s completely broken. Put them into OSE and find their Top Pages, then recreate that content & reach out to the old linkers to replace it with a link to your new content.

Existing Opportunities

Reclaiming Twitter Mentions

If someone ever links to your Twitter page (i.e. twitter.com/pointblankseo), then do what Jon Cooper did hereby creating a page on your site for your Twitter account, then reach out to them and ask if they could switch out the Twitter.com link with one to the page you just created, it’s genius.

You can also do this with any other accounts you have online. Sure, they might not have a widget that Twitter has that makes it a quick & easy set up, but get creative (i.e. if they link to you in a forum, create a page on your site all about your activity and accomplishments on it, as well as a link at the top of the page to your account).

Brand Mentions

I know most of you’ve heard it before, but because it’s a quick set up & they’re practically free links, I’m hoping one person hears this for the first time:

Set up Google Alerts for your brand name
When it gets mentioned without a link to you, reach out to the webmaster/blogger and ask if they could please link

Super simple & super actionable.

Conclusion

I might have repurposed a lot of these strategies from past posts, but the point is that there are tons of ways to build links that are still OK in Google’s eyes, so if you stock up on these, you’ll never have to worry about algorithm updates again (actually, you’ll start looking forward to them!).

Thursday, March 29, 2012

Syllabus of SEO Training

Part I: Understanding SEO

Chapter 1: Search Engine Basics

What Is a Search Engine?
Anatomy of a Search Engine.
Query interface
Crawlers, spiders, and robots
Databases
Search algorithms
Retrieval and ranking
Characteristics of Search
Classifications of Search Engines
Primary search engines
Secondary search engines
Targeted search engines
Putting Search Engines to Work for You
Manipulating Search Engines

Chapter 2:

Creating an SEO Plan
Understanding Why You Need SEO
Setting SEO Goals
Creating Your SEO Plan
Prioritizing pages
Site assessment
Finishing the plan
Follow-up.
Understanding Organic SEO
Achieving Organic SEO
Web-site content
Google Analytics
Internal and external links
User experience
Site interactivity

Part II: SEO Strategy

Building Your Site for SEO .
Before You Build Your Site
Know your target
Page elements
Understanding Web-Site Optimization
Does hosting matter?
Domain-naming tips
Understanding usability
Components of an SEO-Friendly Page
Understanding entry and exit pages
Using powerful titles
Creating great content
Maximizing graphics
Problem Pages and Work-Arounds
Painful portals
Fussy frames
Cranky cookies
Programming Languages and SEO
JavaScript
Flash
Dynamic ASP
PHP .
Other Design Concerns
Domain cloaking
Duplicate content
Hidden pages
After Your Site Is Built
Beware of content thieves
Dealing with updates and site changes

Chapter 4: Keywords and Your Web Site .

The Importance of Keywords
Understanding Heuristics
Using Anchor Text
Picking the Right Keywords
What’s the Right Keyword Density?
Taking Advantage of Organic Keywords
Avoid Keyword Stuffing
More About Keyword Optimization

Chapter 5: Pay-per-Click and SEO

How Pay-per-Click Works .
Determining visitor value
Putting pay-per-click to work
Pay-per-Click Categories
Keyword pay-per-click programs
Product pay-per-click programs
Service pay-per-click programs
Understanding How PPC Affects SEO
Keyword Competitive Research
Keyword suggestion tools
Choosing Effective Keywords
Creating your first keyword list
Forbidden search terms and poison words
Forecasting search volumes
Finalizing your keyword list
Writing Ad Descriptions
Monitoring and Analyzing Results

Chapter 6: Maximizing Pay-per-Click Strategies .

Understanding Keyword Placement
Alt and Other Tags and Attributes
Title tags
Meta description tags
Anchor text
Header tag content
Body text
Alt tags
URLS and File Names

Chapter 7: Increasing Keyword Success . . .

Writing Keyword Advertisement Text
Create Great Landing Pages .
Understanding and Using A/B Testing
Avoiding Keyword Stuffing

Chapter 8: Understanding and Using Behavioral Targeting . .

What Is Behavioral Targeting?
Taking Advantage of Behavioral Targeting
Additional Behavioral Targeting Tips

Chapter 9:Managing Keyword and PPC Campaigns . . .

Keyword Budgeting
Understanding Bid Management
Manual bid management
Automated bid management
Tracking Keywords and Conversions
Reducing Pay-per-Click Costs
Managing PPC campaigns
Negative keywords .
Improving Click-Through Rates
The ROI of PPC

Chapter 10: Keyword Tools and Services . .

Google AdWords
Campaign management
Reports
Analytics
My Account
Print ads
Yahoo! Search Marketing
Dashboard
Campaigns
Reports
Administration
Microsoft adCenter
Campaign
Accounts & Billing
Research
Reports

Chapter 11: Tagging Your Web Site . .

What’s So Important About Site Tagging?
How Does Site Tagging Work?
Additional HTML Tags
No-follow
Strong and emphasis
No-frames
Table summary tag
Acronym and abbreviation tags
Virtual includes
Using Redirect Pages

Chapter 12: The Content Piece of the Puzzle

How Does Web-Site Content Affect SEO?
Elements of Competitive Content
To Use or Not? Duplicate Content
Stay Away from Search Engine Spam
Doorway pages
Hidden and tiny text
SEO over submission
Page jacking
Bait and switch
Cloaking
Hidden links
Considerations for Multi-Lingual Sites
Content Management Systems
When should you use CMS?
Choosing the right CMS
How CMS affects SEO
Understand and Use Viral Content

Chapter 13: Understanding the Role of Links and Linking . . .

How Links Affect SEO
How Links and Linking Work
Snagging inbound links
Creating outbound links
Taking advantage of cross-linking
The skinny on link farms
The Basics of Link Building
Using Internal Links
Judging the Effectiveness of Your Links
Part III: Optimizing Search Strategies

Chapter 14: Adding Your Site to Directories . . .

What Are Directories?
Submitting to directories
Major online directories
Paid vs. free directories
Geo-Targeting SEO Strategies
Using Submission Tools

Chapter 15: Pay-for-Inclusion Services .

When to Use Pay-for-Inclusion Services
Understanding the Business Model
Managing Paid Services
Hiring the Right Professionals
Contract Considerations
When the Relationship Isn’t Working

Chapter 16: Robots, Spiders, and Crawlers. . .

What Are Robots, Spiders, and Crawlers?
What’s the Robot Exclusion Standard?
Robots Meta Tag
Inclusion with XML Site Mapping
Creating your own XML site map
Submitting your site map

Chapter 17: The Truth About SEO Spam

What Constitutes SEO Spam?
Why Is SEO Spam a Bad Idea?
Avoiding SEO Spam

Chapter 18: Adding Social-Media Optimization . . .

What Is Social-Media Optimization?
What’s different about social-media optimization?
The Value of Social Media
Social-Media Strategies
Measuring Social-Media Optimization

Chapter 19: Automated Optimization

Should You Automate?
Automation Tools

Part IV: Maintaining SEO
Chapter 20: SEO Beyond The Launch

It’s Not Over
Using Content Management Systems
SEO Problems and Solutions
You’ve been banned!
Content scraping
Click fraud

Chapter 21:Analyzing Success

Analyzing SEO Successes
Managing SEO expectations
Find yourself
Analyzing web stats
Competitive Analysis
Conversion Analysis
Analyzing Server Logs

Optimization for Major Search Engines . . .
Optimization for Google
Understanding Google Page Rank
Calculate the value of each page
Algorithm of Page Rank
Optimization for MSN
Optimization for Yahoo!
The Yahoo! Search Crawler

SEO Software, Tools, and Resources

Major Search Engines and Directories
Secondary Search Engines
Meta Search Engines
Keyword Tools
Content Resources
RSS Feeds and Applications
RSS Feeds optimization
XML Feed Creation
XML Feeds optimization
Search Engine Marketing Resources and Articles
Registration Services and Programs
Link Resources and Software
Pay-per-Click..
Social-Media Tools
SEO Plan ..
SEO Checklist .
Current State
Keyword Research
Web-Site Design
Write Clean Code
Make Use of Tags and Attributes
SEO-Approved Content
Manual Submissions
Linking Strategies
Conversions
Keyword Worksheet
PPC Keyword Worksheet
Keyword Checklist
Keyword Performances Worksheet
A/B Testing Worksheet
PPC Competition Worksheet
Link-Tracking Worksheet
Rank-Tracking Worksheet

Wednesday, October 3, 2012

Google crawlers

robots meta tag

Controlling how Google crawls and indexes your site.

Controlling crawling

Location of the robots.txt file

Content of the robots.txt file

Some sample robots.txt files

Allow crawling of all content

Disallow crawling of the whole website

Disallow crawling of certain parts of the website

Allowing access to a single crawler

Allowing access to all but a single crawler

Controlling indexing and serving

Using the robots meta tag

Using the X-Robots-Tag HTTP header

Getting started

Wednesday, May 9, 2012

Video about pagination with rel=“next” and rel=“prev”

Google Search quality highlights: 52 changes for April 2012

What type of links should we be getting?

Non keyword rich anchor text for mid to low quality links

Easy content-led links

Existing Opportunities

Conclusion

Thursday, March 29, 2012

Syllabus of SEO Training

The Official Google Blog

Inside Search

SEOmoz Daily SEO Blog

Search Engine Land: News & Info About SEO, PPC, SEM, Search Engines & Search Marketing

SEO Book.com

SEO Blog by Dave Naylor - SEO Tools, Tips & News

SEOmoz User Generated SEO Blog

Mihmorandum: Local SEO + Web Design

Matt Cutts: Gadgets, Google, and SEO

SEER Interactive

Blogstorm

Distilled - Monitoring your Reputation Online

Conversation Marketing: Internet Marketing with a Twist of Lemon