8 Mbits on the left lane
Web development, Google, PHP, Firefox, HD DVD, Canon HV20, HTML, NAS, satay, privacy
I've recently gotten to work a bit on optimizing my site for search-engines, and more specifically how to have the best URL possible for search engines. Making your site user-friendly and accessible is a good way to get high ranking in search engines, and having clean and well layed-out URL will certainly help both your users and your ranking. However after searching through the forums I learned a few tricks:
Dynamic URL: pages such as article.php?id=3 are properly indexed, however search engine prefer static ones like Article-3.html.
URL keywords: keywords in URL do count, even more so if they are inside the domain name itself (but nowadays all good domains with valuable keywords are taken...). That's why most blogging platforms use the article title to build the URL, eventhough it makes it very hard to type manually.
URL depth count: it seems search engines penalize pages that are too deep in a directory structure. That was an issue for me because like many people I was using mod_rewrite to cleanly pass parameters to scripts (such as script.php/param1/param2/param3/ ).
Dashes are the word separator: another problem was that I was separating keywords in URL with underscore. I went this route because it is easier to read in the address bar, however it turns out that to Google, a dash is seen as a keyword separator whereas the underscore is just another character.
Of course I could have just changed the URLs and be done with it, with the old one returning a "404 Not Found" error. However, besides breaking external links, this will hurt ranking in search engine because the new URL will be considered like new pages, and might be considered duplicate content.
The fix to this is a permanent redirect, which is understood properly by search engines as "the page you look for has moved, but is otherwise the same thing". mod_rewrite can do just that, using something like :
RewriteRule ^old_url.html$ new-url.html [R=301,L]
You would obviously use regular expressions to handle multiple redirects within this single line. In my situation, regular expressions were not flexible enough to do my rewrite. I needed to process the old URL in something like PHP to be able to find the new URL and do a proper redirect. I found two roads to get there: the first is to use the RewriteMap instruction in mod_rewrite, which gives you the option of using some external program of your choice to handle URL rewriting. The other was a bit simpler, it involved changing this in the .htaccess file:
RewriteRule ^regexp_for_old_url$ fixit.php [L]
Which would silently have all old URLs handled by fixit.php. This would be a simple PHP script to do the rewriting work:
<?php
$newURL=$_SERVER["REQUEST_URI"];
// Process $newURL here
header('HTTP/1.1 301 Moved Permanently');
header('Location: '.$newURL);
die();
?>
Job's done !
There's a great (and useful) post from Hamlet Batista regarding how Google handles anchor texts of incoming links. Apparently, in an attempt to get around so-called Google bombing, repetitive anchor text of incoming links is penalized. That means that to get a good ranking the word A it is better to have incoming links with a mix of keywords ("A", "A B", "C A", etc.) than just "A" everywhere, which for Google is a hint that someone is trying to manipulate its results.
I always though that Google doesn't take into account accents on letters (i.e.: à is the same as a). Until today, when I realized that searching for "tai shogun" would return pages with "Taï Shogun" but not "Taï Shõgun" (to get those, you need to type in the õ - and then you only get pages with õ as a result).
I'm kind of confused about what causes this behavior but anyway, since most people will skip on the accent when searching, I'm replacing any õ on my site with plain o's.
I've long had an ambivalent position regarding Google: on one hand they are a great company, making great products (and free - it's hard to argue with that). On the other hand they are becoming a bit too important on the Internet. They not only run the most successful search engine, but they also have two important applications that appear on a lot of other people's web sites: their advertising program AdSense (visible on this very blog), and their web traffic analyzer Google Analytics. What this means is that not only do they know what you are searching for, but they also have the technical mean to know what sites and pages you are viewing even when you are visiting someone else's web site.
While Google is known for its don't be evil policy, it's not exactly clear what data they are collecting (or not) from the combination of these services. And if they do, in whose hands they might end up someday. In My Life Without Google someone decided to just not take the risk anymore and write about his live without Google. It's apparently not as easy as one might think.