Book Meme

  • Grab the nearest book.
  • Open it to page 56.
  • Find the fifth sentence.
  • Post the text of the sentence in your blog along with these instructions.
  • Don’t dig for your favorite book, the cool book, or the intellectual one: pick the CLOSEST.

So, my sentence is this:

"Spring Deer is almost like drinking flavored water; there's tons of flavor in a creamy and velvety package."

The book is Sake: A Modern Guide by Beau Timken and Sara Deseran (ISBN 0-8118-4960-0).

Debunking SEO

I've discussed previously how the SEO industry constructs its advice. But I now want to take them to task on actual advice I've received from SEO companies. SEO companies make claims that are poorly scientifically verifiable, because it's very difficult to distinguish causal factors in changes in search result positions.

To validate these claims we could imagine a study where we compare the rankings of two groups of websites, distinguished only by whether they implement a given SEO suggestion. If the hypothesised recommendation does affect ranking we would expect to see a statistically significant amelioration of search engine ranking.

I don't believe this is possible. For one thing, there are too many factors, given the complexity of the web, to be able to extract a clear picture, so any results would be unlikely to be "statistically significant". This means any effect noted would not be as great as the margins of error of the experiment. The results would be too muddied by independent and much more important considerations like inbound links and accessibility. Also you can't get a very good appreciation of how much a rank is affected: you only see the order of results, not how much better one result is considered than the next. Statistically that should widen the margins of error.

I am skeptical about a lot of these things. I don't think I can disprove them given the doubts I've expressed above, but I do contest them. I believe they are unlikely and I believe SEO people believe them for invalid reasons.

Using meta keywords tags increases ranking for those keywords.

No major search engine uses meta keywords. It's far too easy to manipulate and does not reflect the actual page content.

Using meta keywords tags increases search engine traffic if the keywords also appear in the page content.

I doubt this would predict relevance well enough to be useful. For example, anyone could mirror the content in the page into the meta keywords, and the page would rank higher. Meanwhile sites that omit keywords would be penalised.

Using meta description tags increases search engine traffic.

The meta description tag appears in place of an excerpt from the content in several major search engines. This undoubtedly increases the apparent quality and apparent relevance of a site in a search engine's result pages, and that could persuade more people to click on a link.

Using keywords in <title> increases search engine ranking.

Unfortunately, I think this is very plausible. However, I frequently see page titles ransacked by keyword stuffing. There's a trade-off here between providing something that reflects the content of your site and making your site untidy and the search result listing unclear. For example, I would prefer to see

<title>Fireplaces - Mobstone Marble</title>

in search results to

<title>Mobstone Marble for cheap fireplaces, fireguards, hearths, gravestones and more - Call 01234 567890</title>

which is the sort of thing I've seen recommended by some SEO companies. I think there's an argument that as long as you're putting the terms "Fireplaces" and "Mobstone Marble" into the title, you've covered the relevant keywords for that page, plus the page is described clearly and unambiguously in search results.

Keywords in <h1> tags are more heavily weighted for relevance than keywords in <h2> tags and so on down to <h6> and any other tag.

This is definitely a good predictor of relevance, but it should be remembered that it can be generalised to all tags, not just <h1-6> tags. You could deduce a weighting scheme like this through statistical analysis of a corpus of HTML. In particular, you might find that <th> or <dt> or maybe even just nested <b><i> trumps <h6>.

Putting keywords into bigger <h1-6> tags increases ranking for those terms.

This is the kind of thing that non-programmers/non-statisticians would assume is implied by the previous fact, but is not. Tag weights definitely guide how a search engine apportions weight but it would be fairly naïve if it just counted as simple boost in the rankings. Search engines strive to assess relevance from page content in the way humans do, and bigger titles don't imply more relevance to humans. They catch your attention more, but you assess their relevance in a more holistic fashion.

Putting keywords into URLs makes a page appear more relevant for those keywords.

It's a fact that URLs are intended to be opaque: there's no reason to believe http://work-safe-images.org/racoon.png is not a JPEG image of a vagina. Humans don't treat them this way of course. Filenames would be useless if they didn't help us to identify the content of a file. However, one problem with using something defined as not being relevant as relevant is that you would get a significant rate of misprediction. If you take a URL of /animals/racoon.html as lending credence to an assessment that it's a page about racoons, what happens when you discover it's not a page about racoons at all? In short, a search engine must assess relevance of a page based on the page itself. Since it has to do that, does it really get more information from the URL? Let's say URLs are relevant 70% of the time. Something that is wrong 30% of the time and otherwise merely confirms what you already know is pretty worthless. I think friendly URLs are good from a usability perspective, and they confer a certain element of quality as far as I'm concerned, but I don't think it's very plausible that they affect rankings.

Putting keywords in image URLs makes it appear more relevant for those keywords.

This is possibly a lot more plausible than with HTML. When spidering images you will really struggle to find enough information. Since we can't assess relevance directly with images, the example above changes. URLs may be wrong 30% of the time, but 70% of the time there is information you could not otherwise find. Still, one way around this, if I was writing a search engine, would be simply not to index images where I cannot gather enough information to assess relevance.

Putting keywords in URLs and alt tags will get images to appear combined searches and thereby boost conversions.

Images that are likely to produce conversions don't appear in combined search results. Have a look on Google now, if you want, and convince yourself of that. I suspect you would not be able to find any image that promotes one specific vendor. There must be some heuristic which ensures that images in combined search results are vendor-neutral encyclopaedia-type images. Googling "Britney Spears" gets you pictures of Britney Spears. Googling "Asus eee 700" doesn't get you pictures. I suspect there's a reason for that.

Providing buttons to "Bookmark this page" boosts conversions.

It obvious that users who bookmark pages come back more than those who don't, but I doubt a great many people use these, unless they are frequent enough users to know where the "bookmark this page" button is better than they know where the star is on their web browser. That kind of user doesn't need the encouragement to come back.

Opening external sites in new windows encourages people to return when they have finished reading an external page.

There's no doubt that if you can keep your site in a background window, it can allow visitors to pick up where they left off when they close the foreground window. However, the most heavily used navigational tool is the back button, not desktop windows, and opening a new window disables the ability to use the back button to return. Instead of closing windows to return to what they were doing, users fall into a pattern of piling up windows and then use "Close Group" from the Windows taskbar, or some just closing a batch of windows at once. Which approach is best can be established by usability research, and on this issue usability analyst Jakob Nielsen is unambigous: don't break the back button! Don't open new windows!

Absolute URLs are better than relative URLs.

Software can convert between relative and absolute URLs as necessary. This only affects broken software that needs to and doesn't. There is a lot of broken software in the world but any software that's been tested against the wild wild web shouldn't fall into this trap. The amount of software that's broken in this way is negligible in comparison to the amount broken for hundreds of other reasons that you also need not support.

Image spidering in Python

I have several useful tools in Python for working with websites. Today I needed a script to report the images on a website, along with their corresponding alt tags. The script was extremely quick to write using the available tools, which makes it a fairly good example of how powerful Python is.

I have based this script on a pre-existing webspider class I have written:

class Spider(object):
   def __init__(self, base_url):
      self.base_url = base_url

   def pages(self):
      queue = [self.base_url]
      seen = set(queue)

      while queue:
         url = queue.pop(0)
         f = urllib2.urlopen(url)
         if f.info().gettype() not in ['text/html', 'application/xhtml+xml']:
            continue
         doc = ElementSoup.parse(f)
         doc.make_links_absolute(url)
         for element, attribute, link, pos in doc.iterlinks():
            if not link.startswith(self.base_url):
               continue
            if element.tag == 'a' and attribute == 'href':
               l = re.sub(r'#.*$', '', link)
               if l not in seen:
                  queue.append(l)
                  seen.add(l)

         path = url[len(self.base_url):]
         yield path, doc

This class effectively wraps a generator which yields every pair of path and web page it finds on the site. Generators are incredibly useful for keeping code simple without being memory hungry. It's easier to type yield than building a list of items, but in this case it's better than that: this code returns one LXML ElementTree at a time, rather than reading and parsing them all up front.

Generators encapsulate state as local variables, which generally means you don't even need to wrap them in a class like I've done. I only do this because I like to add functionality by subclassing. This may be a throwback to my days of programming Java.

It should be noted that most of the heavy lifting here is being done by lxml and BeautifulSoup. lxml.html makes it extremely easy to work with HTML. BeautifulSoup's excellent broken-HTML parser is used not because my HTML demands it, but to allow this one script to work with any site I want to use it with.

class ImageSpider(Spider):
   def images(self):
      seen = set()
      for path, doc in self.pages():
         imgs = []
         for img in doc.findall('.//img'):
            src = img.get('src')
            alt = img.get('alt')
            title = img.get('title')
            i = (src, alt, title)
            if i not in seen:
               seen.add(i)
               imgs.append(i)

         if imgs:
            yield path, imgs

This is another generator that effectively filters the list of pages, yielding a list of images within each page. Generators calling generators is again very elegant. Each time the caller asks for the next page of images, ImageSpider will go back to the original Spider for a new page until it has one with images.

def text_report(self, out=sys.stdout):
   for path, imgs in self.images():
      print >>out, 'In', path
      for src, alt, title in imgs:
         print >>out, '- src:', src
         if alt is not None:
            print >>out, '  alt:', alt
         else:
            print >>out, '  alt is MISSING'
         if title is not None:
            print >>out, '  title:', title
      print >>out

Other methods of ImageSpider generate reports. Here I use the handy print chevrons to write to any file-like object. File-like objects are a particularly handy piece of duck typing. By default these methods will write to stdout, which is the same as printing normally, but you can pass in any other file-like object for very simple redirection.

def html_report(self, out=sys.stdout):
    from cgi import escape
    print >>out, """
    <html>
        <head>
            <title>Image Report for %(base_url)s</title>
        </head>
        <body>
            <h1>Image report for %(base_url)s</h1>
    """ % {'base_url': escape(self.base_url)}

    for path, imgs in self.images():
        print >>out, '\t\t<h2>%s</h2>' % escape(path).encode('utf8')
        for src, alt, title in imgs:
            idict = {'src': escape(unicode(src)).encode('utf8'),
                 'alt': escape(unicode(alt)).encode('utf8'),
                 'title': escape(unicode(title)).encode('utf8')}
            print >>out, '\t\t<img src="%(src)s" alt="%(alt)s">' % idict
            if alt is not None:
                print >>out, '\t\t<p><strong>alt:</strong> %(alt)s</p>' % idict
            else:
                print >>out, '\t\t<p><strong>alt is MISSING</strong></p>'
            if title is not None:
                print >>out, '\t\t<p><strong>title:</strong> %(title)s</p>' % idict
            print >>out
    print >>out, """
        </body>
    </html>
    """

Again, similar, but this method demonstrates a simple form of templating: the string formatting operator, %, allows you to retrieve values from a dictionary.

Finally, there's the commandline interface to all this:

from optparse import OptionParser

op = OptionParser()
op.add_option('-f', '--format', choices=['text', 'html'])
op.add_option('-o', '--outfile')

options, args = op.parse_args()

if len(args) != 1:
   op.error('You must provide a site URL from which to spider images.')

s = ImageSpider(args[0])

if options.outfile:
   out = open(options.outfile, 'w')
else:
   out = sys.stdout

if options.format == 'html':
   s.html_report(out)
else:
   s.text_report(out)

In a few lines, the amazing optparse module turns a quick script into a flexible commandline tool.

Download the source: siteimages.py

Bubble Background Animation

I was pondering concepts for interesting web designs when the idea occurred to me that an animated bubble effect might lend a peaceful ambience to a webpage. I experimented with placing a Javascript-controlled SVG animation into the background of a page. You might like to judge for yourself whether this is successful or not (SVG-enabled browser required and a reasonably fast CPU recommended).

If you were around at the dawn of dynamic HTML you will probably have stumbled across amateur websites who thought it was really rather stylish to add a Javascript snow or bubble effect over the top of the page content.

Fortunately, those days are gone. By and large, it seems that amateur webmasters today know that just a nice colour scheme and a consistent, simple style trump a jumble of styles, javascript effects and stock animated GIFs that we all remember too well. Nice graphic design is done for you if you just install a blog and browse existing themes. Some may not even remember effects like this (Warning: Not safe for work or indeed any other time you require functioning eyeballs).

It's well-known that animations draw the user's attention in webpages. That doesn't mean we always want to avoid them: sometimes we want to direct the user's attention in one direction or another, particularly when the page is being updated dynamically with Javascript. This is not one of those special cases. Since the goal of this experiment is to build a fully-animated webpage, we will have to ignore that inconvenient little fact. However, this suggests we need to keep the animation as unintrusive as possible. Keeping it nice and slow may help, and it should certainly be in the background and not the foreground.

SVG is useful for this kind of effect because it has a feature (<svg:use>) for manipulating independent clones of a symbol. It is therefore simple to draw the original shape using an SVG editor, and the Javascript merely needs to manage instances of the clones.

Using Inkscape, I drew up a bubble looks like this:

Bubble

There's a certain knack to drawing bubbles like this, of course. Air bubbles in water are colourless, but they are reflective due to total internal reflection. The amount of reflection increases as the angle of incidence increases, up to the critical angle, at which all light is reflected. At a water-air boundary the critical angle is 48.6° so actually the bubble should appear totally reflective from about 75% of the radius.

If this sends you into a bit of a panic as you struggle to remember your school physics lessons, don't worry. I'm not recommending a mathematically accurate implementation of Fresnel's Equations. With a lot of art (not just on computers), an appreciation of the physics can go a long way towards adding realism. But a 100% accurate simulation is not necessary for an effect to seem convincing - trial and error is much easier. The gradient as I've drawn it is not accurate but looks alright. Similarly, bubbles have two specular highlights corresponding to the water-air boundary and the air-water boundary.

As an aside, one day it may be possibly to depict fully reflective and refractive bubbles. Using SVG's incredible feDisplacementMap filter, you could distort the background using a pre-computed "lens" image. But that is unlikely to run at interactive speeds today, even if the filters required were fully and accurately supported, which they are not. The bubbles I've drawn are intended to be a compromise between rendering simplicity and attractiveness.

The bubble system (really just the SVG on its own) animates 20 clones of the bubble symbol. Again, this is based on some physical principles. The smaller bubbles are subject to less drag so have a higher terminal velocity, bubbles grow slightly as they rise and the pressure decreases and so on. One of the most effective things is that the bubbles drift with a random walk: they can randomly drift to one side or the other. They don't go straight up nor do they oscillate sinusoidally like the classic DynamicDrive script. For the most effective animation, bubbles would drift with the currents but this is simpler and reasonably effective.

I am quite pleased with the results. To really rid ourselves of the legacy of Javascript-animated GIF images, it would be important for this effect to tie in with the graphic design of the page, which I haven't shown.

I don't think this is realistically ready for production websites: Internet Explorer cannot display SVG, for one thing, and the intensive CPU requirement is also a problem. But I do think that sharp SVG graphics allow us to produce a wholly better standard of animation than what was possible before. With this, I think it's possible to make a bubble animation complement rather than detract from a web page.

SVG Buttons

With SVG filters, it's easier than ever to create stylish graphical buttons for the web.

Using images for buttons is a much more pragmatic approach than attempting to style buttons with CSS, at least until widespread support for CSS3's draft-but-stable border-image property is available.

Up until a couple of years ago, I had generally created buttons using a PHP script that glued them together:

Example of Add To Basket button

This was a useful when working with XSL, allowing me to simply call a template to include an arbitrary button text, rather than linking to a static button image.

Because I now use Django for most of my sites, this technique is no longer relevant. Because I'm not now producing templates to transform an arbitrary XML model, but producing templates to render specific models, I know when writing the template what buttons it will require. A typical button, designed for editing convenience, would look like this:

Example of Add To Basket button

This button is a rounded rectangle with a gradient. The label is typed twice to give it a slightly inset look. Even though you have to retype the label twice to change a button, it takes only a few seconds to change the label and adjust the width of the rectangle to fit.

Inkscape 0.46 provided access to a wide range of SVG filters, making the process even simpler. Buttons are now never more complicated than a rectangle, a label, and the SVG filters to make them look pretty and three-dimensional:

Example of View Products button

Changing a button is as simple as it can be. Or is it?

I sometimes like to connect adjacent buttons into one strip, something which will be familiar to Mac OS X users:

Example of connected buttons

SVG filters can make this a doddle too. By using SVG filters to create all of the graphical effects, including the rounded corners, these buttons can be dragged together and automatically connect with one another. The filter is applied to the layer, and the above buttons are editable simply as rectangles.

Try it: Download the SVG (Inkscape 0.46+ recommended).

Google Chrome

Over the past few months the web browser industry has shifted up a gear. After years of stagnation and limited diversity the market is blossoming, first with Safari opening up to the mass-market of Windows, then Opera 9.5, then Firefox 3 and now Google Chrome. And soon, Internet Explorer 8, which is verging on counting as a real browser (so tacit congratulations for Microsoft are probably in order).

If you look at the features promised by Google Chrome there's precedent for all of them:

  • One Box for everything - Opera 9.5 / Firefox 3
  • New Tab page - Opera 9.5
  • Application shortcuts - Mozilla Prism
  • Dynamic tabs - Opera 9.5
  • Crash control - Internet Explorer 8
  • Incognito mode - Stealther Firefox extension, InPrivate mode in IE8
  • Safe browsing - Firefox 2, IE7...
  • Instant bookmarks - Firefox 3
  • Simpler downloads - Ok, I don't know of a suitable precedent for this.

But that's not really the point. What Google have done is cherry-picked the features to adopt in order to paint a clear picture of the way they see the web developing (and they are probably right, not least because they are pouring money into making it develop that way).

I wrote a webrunner (to borrow terminology from Mozilla) way back in 2001 or thereabouts, embedding the Internet Explorer component CHTMLView in a chromeless frame. The merit of the site-focused approach was obvious then even though my implementation was mainly just an exercise in MFC, and it was merely for a forum I used to frequent. The point is that not all webpages are equal. People don't spend most of their web time "browsing": 90% of the time they are just logged into the same sites and applications. Chrome's user interface, architecture, and selection of features seem better prepared for this than any other browser on the market.

Google Chrome is a leap towards a much more minimal browser that knows you aren't running it so that you can use a browser, you're running it so that you can use websites. This web-centric future has been predicted so often, it comes as a shock to see that this is what it feels like, and that other web browser manufacturers have never taken the initiative to fully deliver on this promise.

SVG Goo

It's a well known computer graphics technique that blobby shapes can be drawn as the isosurface of a scalar field.

It's actually possible to create a similar effect using SVG filters:

Blobby, ketchuppy shapes

The field is created using Gaussian-blurred circles. Where these soft edges overlap, the alpha channels are composited and this creates the necking effect which is key to blobby shapes like this.

The thresholding is done using a high-contrast filter on the alpha channel. The specular highlight was added just to emphasise the gooey, ketchup-y effect.

If you have a copy of Inkscape, it's fun to play with dragging the circles. Feel free to download the SVG.

Seam Carving

I've just come across this awesome technique:

One of the reasons designers like to use fixed rather than fluid website layouts is because of the difficulty in providing attractive images at unknown aspect ratios. This technique offers a really beautiful solution.

The presentation shows that all that is needed to apply the effect is an image plane containing the priority of each pixel or effectively in which seam removal it is to be eliminated.

I demand that:

  1. There be a PNG extension chunk defined to encode this plane.
  2. Web browsers support this new chunk when non-proportionally scaling images.

Model Downcast in Django

Django has had multi-table inheritance for three months now, but I haven't used it because I can't picture many use cases for inheritance which don't rely on dynamic polymorphism.

In my current project, however, I've found a fairly convincing use case. I can do all of the basic listing and manipulation with the base class, and I only need the superclass for one particular operation.

Rather than writing this in a different way, which would be less intuitive, I pushed ahead with the conceptually simple approach and fought with Django to make the inheritance polymorphic. The way I did this was brute force it: check all of the relations for one that exists.

from django.db import models

cls = inst.class #inst is an instance of the base model

for r in cls._meta.get_all_related_objects(): if not issubclass(r.model, cls) or \ not isinstance(r.field, models.OneToOneField): continue try: return getattr(inst, r.get_accessor_name()) except models.ObjectDoesNotExist: continue

There are faster ways of doing this if you do it at the database level (left joining all tables in the inheritance hierarchy and sorting out the mess into the correct subclasses), but this is simple and self-contained and forces me to think about when I need this. And looking at the query cache, the overhead is not too bad anyway.

Social Calendaring

The current generation of social networks are based on an assumption that giving you reams of data about what other people are doing and have done really puts you in touch with other people. As a user, I do get the impression that I am in touch with people, even though I may not actually be communicating with them. So we don't always bother to make stuff happens.

There are social networks that tell you where friends have been, some that tell you where they are, some that tell you what they are doing right now and some that tell you where they will be. But very few, it seems, that tell you where to be.

Taking Facebook as an example (as it's the only social network I'm using heavily at the moment), it does not make it easy to set up things to do. Creating events is a very laboured process. It takes perhaps 15 minutes to set up an event. It's an individual rather than a collaborative task. Invitations to events get ignored because of the way they are delivered. People get blanketed with invitations that they don't want. And I can't even set up an event until I've set up a group to arrange the event.

Social calendaring is not a new concept as the promise of electronic calendars has always included ease of scheduling, via e-mail invitations of some sort. Google Calendar and 30 Boxes represent the state of the art in this regard, which is simply calendar sharing and event invitations.

The thing social calendars really should address is scheduling of events, because I'm lazy and also busy and I always say things like "we really should ..." but it never happens.

My ideal social calendar could fulfil these user stories:

  • Find me something to join in with on any given day.
  • Schedule things with the knowledge of when I'm most likely to be free or busy, even when nothing is scheduled.
  • Arrange online games with my brother in Australia (7 hours ahead in summer or 9 hours in the winter).
  • Pick a date and time to do a thing I want to do, with friends who want to and who may have to travel to do it.
  • Remind me to book the venue for an event.
  • Book out an event I'm hosting.
  • Nail down the fuzziness inherent in saying something like "Let's have dinner on Thursday evening" so that we can say "Dinner at 8pm, and Alice will be joining us at around 10pm for drinks".
  • Suggest when to actually go to bed so that I can get up next morning.
  • Pin-point exactly where an event is so that I can work out how to get there.
  • Don't keep trying to schedule things my skint friends can't afford to do.
  • Suggest things I might like to do.
  • Mildly favour a schedule where I can watch my favourite TV programmes.
  • Create entirely new groups of local people with similar tastes (say Buffy, or Linux) in such a way as to be actually kind of fun and neither awkward or annoying.

As is perhaps evident, I'm a strong believer in heuristic tools that do really innovative stuff. What are the chances that all of the above are possible in a calendar application that doesn't automatically book me to go pole-dancing in Alaska moments before it has me watching Mork and Mindy with total strangers in my home on a Friday night? Hmm.