Caching and Expires

How does the Expires HTTP header work together with browser cache settings? I did some tests today because I thought I was going crazy. A customer of mine questioned my advice on installing mod_expires on their central server to avoid long load times on static content on high latency times. I tried to find a weblog on the subject, but could not find it. Most people seem to simply explain how mod_expires works, and that is should be installed. They do not explain how it cooperates with the different (Internet Explorer) browser cache settings. They specifically asked the question:

If the IE browser cache setting is set to 'check for newer versions of stored pages' 'every time I visit the page', then does IE actually do a conditional GET request of cached images, if the Expires HTTP header is set to a time in the future?

A good question. And hard to find the answer to. Well, the first question is of course: why on earth would you have this setting on? Starting with IE 5.5, the browser setting 'Automatic' has become so good and intelligent that you should not need it anymore. Well, it turns out this customer has legacy intranet applications that do not work correctly if you do this... at least that is what they tell me. This of course is a problem of those other applications (maybe incorrect mimetype setting of dynamic pages? Generated javascript with .js extension?), and probably this is either not true anymore today, or can be fixed by adding an Apache in front of it. Whatever. Customer claims it is necessary to have all IE caches set to 'check every time'. We'll just have to deal with that.

I am assuming you all know how this works of course. Please first read this excellent tutorial on caching first. Let me try to recap.

If a browser requests a page that contains some images (wihout any Expires headers set), the first time
  • it will request that page with a GET HTTP request.
  • The server will respond with a 200 OK and the content of the page.
  • the browser reads the data (saves it to the browser cache), understands that it now needs to request the images too, and sends the additional GET requests for the images
  • the server responds with 200 OK responses and sends the bytes of the image
  • the client renders the page
If then the user goes again to that page (the same day, or in the same session), using 'automatic' (default) settings in IE, this is what happens:
  • the client sees that it has that page already in the cache. But maybe the page has changed since it has last accessed it. It uses some logic to determine whether to ask to the server. Most often images do not change so often, so if you request the same .html page back (it looks at the mime type and extension), it will not ask back to the server, and simply serves from cache. Really fast:

So what happens if you go the same page later, the next morning?
  • client sees it has the page already in the cache, but it may have changed. IE thinks: let's ask the server if it has changed.
  • So it sends a conditional GET request with a header If-Changed-Since
  • the server looks up the page on disk, checks the last-modified date, compares with the date in the request, and if nothing is changed, it responds with a 304 response: no; the file was not changed, please mr browser feel free to use your cached version. Note the server does not send the bytes!
  • the browser uses the locally cached version of the file and uses that to render
I used the excellent HttpWatch Basic plugin for IE and FF to see what happens.
Do you see the three 304 responses there? Browser cache works here to save bandwidth and byte loading time, but the request and response still has to travel to the server and back. Normally this is not a big issue, but if you're in China and the main server is in Europe, it can easily take up to 200ms to travel back and forth (that's called latency) and if your page has a lot of images (or js), this can easily cause your cached (!) page to load in seconds, not milliseconds. That is with cache on. Note you can buy bandwidth, but you cannot buy latency (it simply takes some time for light to travel across the earth, so until someone invents antimatter TCP/IP packets that travel faster than light, there is 'nothing' you can do about poor latency).

So what is the difference when you set IE cache to 'every time'? This:
(note that the reload of the root page from cache came from me traveling 'back' on the browser)

The result is: the server checks back if the file and images have changed every time you visit the page. Well isn't that magic... that's just what you set IE to do ;-)

But now. To fix the problem for IE automatic settings for loading static content, it is highly recommended (and incredibly easy to do) to set up the mod_expires module on Apache to set the expires date on the content. This saves bandwidth costs and frees up the webserver connections. This is what I changed in the httpd.conf file of Apache:

uncommented this line:

LoadModule expires_module modules/mod_expires.so

and added

ExpiresActive On
ExpiresDefault "access plus 1 week"

This is the web server saying: please assume that all static content will not change in the next week, no need to bother me with conditional GET requests all the time. Let's see what happens. The behaviour of a correctly set IE cache does not change that much; only if you revisit the page the next day, no more 304-requests are sent. Excellent. Load times of the page for users far away will be much better. The response of the server for an image now becomes:

Date: Thu, 23 Jul 2009 08:43:03 GMT
Server: Apache/2.2.8 (Win32)
Last-Modified: Fri, 05 Dec 2008 13:27:52 GMT
Etag: "1700000005e472-56789-45d4ca4e2dbd5"
Accept-Ranges: bytes
Content-Length: 354185
Content-Type: image/jpeg
Expires: Thu, 30 Jul 2009 08:43:03 GMT
Cache-Control: max-age=604800

200 OK
instead of

Date: Thu, 23 Jul 2009 08:41:22 GMT
Server: Apache/2.2.8 (Win32)
Last-Modified: Fri, 05 Dec 2008 13:27:52 GMT
Etag: "1700000005e472-56789-45d4ca4e2dbd5"
Accept-Ranges: bytes
Content-Length: 354185
Content-Type: image/jpeg

200 OK

But now the big question. What does IE do if you set to 'every time'... will it actually check every time? Or will it listen to the server? The proof of the pudding is in the eating. Lets try out:
So: IE does not check back to see if the page is changed, even if we set it to 'check every time'... well that is exactly what I would expect of course, because the server previously said so...

Conclusion: using mod_expires is a no-brainer to install on all high traffic sites, especially if you have users that liver further than 20ms away. And it has a dramatic page load effect, even if you have to have IE cache settings to check 'every time'.

Bouncy

Eliza is still alive! Turns out my little quick fix yesterday was not good enough. I mentioned I used a static singleton and recognized there was an issue if the server restarted? Well, it turns out this is just what happens on AppEngine; Google decides to restart the VM, probably once every 24 hours, so using VM statics for persistency is not very good. Yes, I know, I know. But I had very little time yesterday and tries to make a quick fix.

Well, I'll resolve it, I can even kill Eliza completely and remove her from AppEngine, but it is also interesting to see how the Google Wave community responds to this. This type of problems is exactly why Google opened up Wave in this alpha release... to see how it would work and be prepared for a global release, and to get developer buy-in of course.

Google just added a new robot called Bouncy (bouncy-wave@appspot.com) that can actually ban robots from a wave. I believe it has special rights that no other robot has, and I'm not sure how you can handle a bounce that should be undone, or is not correct according to the 'controllers' of a wave. Look here for a brief discussion about it on Wave and an actual bouncing of Eliza:

So I think even though Eliza can be irritating, she is actually helping pointing out an issue with the open design of Google Wave: spambots like her, and I'm surely keeping a close eye on how 'we' as the wave community will find out the best way to deal with this.
Look here for people making jokes about Eliza on the Google Wave daily office hours.

Shut up, Eliza!

Ha! Something happened that I had not anticipated. Remember I told you about the Eliza shrink robot I created a few weeks ago for the Google Wave?

Well, she is still alive, and used every day. However, because in the beta release of Wave it is not yet possible to remove participants from a Wave, a single funny user who adds Eliza to a public wave, will cause enormous amount of Shrink Spam in a wave, because she responds to every blip of every user to ask if he feels okay :-D

Well, upon request I changed it today, so that if you type in "bye", "quit" or "shut up", she stops in that particular wave. Well, until the next version of her VM that is (implemented by a simple static singleton HashSet on wave-ids for now).


This is one of the lessons learned. An exit strategy is helpful. And the Internet is a place for spam of course. Still smiling... :-D

My first Android app: Shackr

One of the things I wanted to try out was programming for the Android platform... to see how easy it is to create native applications for my new HTC Magic smartphone. The SDK is based on Java, and with a nice Eclipse integration.

I decided to create a Flickr application, similar to my successful Flickr Wallpaper program. It allows you to set a tag or term (or series of tags), and it will use the handy Flickr API to return an XML document with the top 500 'most interesting' images for that tag. The interestingness is a nice computed feature that will return really nice images, and get rid of photo spam, but mostly professional looking, beautfil photos (that people favorite, view or discuss a lot).

From this list, it will download (in thumbnail format, to save bandwidth) a random image. If you then shake the phone (I really wanted to try out some of these new input methods like shaking), it will display a different photo.

A first version was created in about 2 hours last Friday, showing how easy indeed it is to get it working. It took me a while to understand the way the view and widget IDs work together in Eclipse, but the end result after some more tweaking looks like this:
on the Android emulator of course, but similar on my actual phone. I could not get the shaking input working on the emulator, so for now it also works if you just click on the image.

I had some issues with displaying a progress indicator, the little icon in the title bar, because I was downloading the XML (which can take up to 15-20 seconds) in the main thread, and my call to

setProgressBarIndeterminateVisibility(true);

had no effect because the screen did not upload. So I needed to do that in a background thread. Sure, no problem and thanks to Javas unique capability of allowing for anonymous inner classes, that was done easily. But then: when the thread is ready, it needs to signal the UI thread that the image has to be drawn again. And my initial implementation yielded an error where Android says that only the original thread can interfere with the graphic canvas. Sure, that is logic. How to do this? Fortunately, a nice signaling method is built in, using a Handler callback mechanism:

final Handler mHandler = new Handler();

// Create runnable for posting
final Runnable mUpdateResults = new Runnable() {
public void run() {
selectPhoto();
}
};

And in the thread:

//post back
mHandler.post(mUpdateResults);

This does the trick just fine.

Another hurdle to take was that when I actually shook the phone, sometimes the screen would go blurry and after that, nothing was shown. This turned out to be the automatic orientation change of the Android: it thinks you are changing its orientation, and it will change the app from portrait to landscape mode. In fact, it turned out that the VM simply kills my Activity object and creates a new one! Again, a nice trick is available, using a method

@Override
public Object onRetainNonConfigurationInstance() {
return (photostate);
}

that allows you to pass on cached information to your future self.

I had fun, the application sort of works, still has some issues with double refreshing, and if you shake too much, it will download the XML too often in multiple threads, especially after you change the tags. I will clean that up, some people are beta testing now

It was great fun, and I can feel the potential of being able to program for this new type of device, with new input devices and events... It is the end of the keyboard and mouse, the only input devices we have been using for the past 30 years...

I do have a great new idea for a game with the Android, looking into what it would take to program that, will keep you posted

Nonsense generator

I created a small little program called Monkeys a few months ago, that can generate nonsensical documents based on word ngrams. It requires some sample text as input, and will generate new texts based on the word by word transition frequencies so the text looks like normal text. It works best on large English business type documents, but this morning I tried to use some Blof lyrics as input. I've always found many of their lines to seem completely random and nonsensical deep, so it seems fitting.

The monkeys came with the following philosophical lyrics to the next Blof hit single "Altijd vanavond":

onder de wereld draait door
alles moet weg, alles moet weg
wat eerst prachtig was
en wat ik langzaam doodga
het geeft niet
mijn engel voor de stoep
bewaar niets in mijn engel
gedenkt de hemel
Altijd vanavond

Deep. Very deep.

to DDMMYY or MMDDYY, that's the question

besides the little and big endian dispute in computer land, one of the largest problems with computers that have to cooperate over the world (i.e. the Atlantic): is the date format problem.

Americans have a completely stupid system of saying it is 7/3/2009 meaning July 3rd, so they use MM/DD/YYYY. This is stupid because it does not make sense as the faster changing day-of-the-month is now in between the month and year. It is like having a time of 06:00:15 meaning quarter past 6 (using HH:SS:MM). Having the month as the first is not logical, as if that is more important or 'key' than the d-o-m.
Fortunately in Europe we have all the ISO standards and in Holland 7/3/2009 would mean March 7th, 2009. Wow, I just noticed, the Americans do it with the MMM months too: March 7th 2009.

A better way is DD/MM/YYYY. But even better is

YYYY-MM-DD because that allows for alphabetical sorting to come out right as well... That is the programmers way. And programmers rule. Or should rule at least.

Why am I telling you this? Well, today I had the nice task of entering my time and expenses for my trip to France last week in an American system. In the manual it says you should have your date format settings set to US, and I have had terrible problems with it when I did it wrong; it all looked okay on the screen, but any time entered in March for 200x-03-08 would end up in August of that year, and the day after that in September...

Wiser, I now knew what to do. I changed my settings from this to this:
And then it sort of works. Well, I still see this in my Internet Explorer:

With dashes? But still d-m-yyyy?

Whatever. It works. Sort of.
And I hope I get my money back.

It's a shame that we cannot get this to work correctly. Why aren't the Americans doing it the right way, they crashed a space vessel once already.
And while we're at it, why do the Americans have different timezones than we have, why can't everyone just work from 9 to 5 CET like ordinary people do ;-)

beyond tree structures

I'm currently working on optimizing my Linx scalable network engine based on Lucene. I believe it has some interesting features that no other engine has, and may be helpful if
  • you have a large (>1M nodes) network
  • that changes continuously
  • you have many queries that are more complex than just asking "who are your 1st level connections"
Anyway, I'm having great fun in implementing this. Now I am working on a complex cloud merging algorithm using triple-sorted linked lists on disk (without requiring to having a cache or memory structure of all nodes). And breaking my head on it. No worries, I'll sort it out (literally ;-))

Reading interesting articles on the ways to store trees and networks or graphs in old fashioned relational databases:
We'll see. Initial tests of my engine are quite good. Looking for clients with large test networks to validate the scalability of Linx.

Chrome OS


Last week Google announced a long expected an announcement (we were actually expecting it at Google I/O): they are entering the operating system market with an OS that consists of merely a browser: Google Chrome OS.

I think it was about time. The computer experience is more and more about what happens inside the browser, connected to the cloud of the Internet, and less about the apps that you have installed on your system itself. Sure, some apps are still better than their online counterparts: Paint Shop Pro, video processing (though did you see the HTML 5 online video tracing demo in client side Javascript?), programming IDEs like Eclipse, and some more. But the mainstream of what people do with computers: communication and text processing, information gathering, that is done inside the browser. So why indeed do we have to pay a lot of money for a program that takes 4 hours to install, eats up half of my C drive and half of my RAM, has an enormous amount of features I do not need, is prone to viruses and gets slower over time nomatter what I do, causing me to do a clean install every year?

It's a shame it takes until 2010 for it to be released. The new windowing system on top of Linux will hopefully end the stupid Linux graphic war. And what I think would be interesting, if the OS has an inbuilt BigTable implementation, and even may use some distributed computing or allow to share computation with other users?

Many people do not believe that the OS no longer matters. And it will continue to matter for a few years to come. But this is definately the direction we'll go. And I want to take the ride.

Nice evaluation of open source search engines

Interesting article on the comparison of a couple of open source search engines, comparing indexing and search performance for a set of Twitter data.

Of course Lucene comes out as the fastest. And the indexing performance measurements are not really comparable because Lucene allows for incremental indexing (only indexing the changes or adds) and some of the other tools only allow you to do a full reindex in case something changes.