A Primer on Web Caching

Multithreaded JavaScript has been published with O'Reilly!

A lot of this article will apply to any network based application, but since I'm a web developer, I'm going to put a web spin on things. Also, I couldn't help but get into compression a little while discussing browser caching.

Caching is a vital part of any high traffic web app, especially ones which send a lot of data over the wire. I don't just mean communication between the browser and the web server, but really any traffic over the local network, or even traffic between different services on the same machine. Network latency and disk I/O are the slowest things we can do on our machine, and we can sacrifice a little RAM to stop ourselves from doing it multiple times.

In-Memory Data Caches

Storing data in memory is the fastest way to retrieve it. There are several different in-memory caching tools depending on your environment. If you are doing PHP development, APC can be used for storing data in memory. For more generic caching, there is memcached, which can be distributed across different machines. There's also a cool new tool called Redis, which is closer to the metal than a lot of these tools and comes with a lower-level API. Redis is also more persistent than memcached is (memcached doesn't guarantee data will be persisted, something I never considered a limitation).

Disk Caches

Sometimes other operations can be so slow that caching data to disk is acceptable, such as grabbing a list of recent tweets over HTTP. If you are developing an app using an MVC framework, the framework might provide a mechanism for caching partially rendered views to disk. Or, if you just want to cache data, you can use a tool like sqlite to write and query the data. You can always roll your own cache for writing serialized objects to text files if need be.

In general, disk caches are useful for making up for slow network requests, but using it to cache database requests probably won't provide a lot of efficiency gain.

Cache DB Results in RAM

The most commonly cached type of data is what we retrieve from the database. Database queries can be expensive, and almost always involve disk I/O, so caching results can be extremely beneficial. Also, the database often isn't on the same machine as the app, and these caches can reduce network I/O. Here's some example code from the NeoInvoice project. The multicache class stores data in memory, using either memcache or APC.

<?php
/**
 * @param int $company_id The ID of the company
 * @return int The number of clients belonging to this company
 */
function get_total($company_id) {
    $count = $this->multicache->get("count_client_by_company:$company_id");
    if (!$count) {
        $sql = "SELECT COUNT(*) AS count FROM client WHERE company_id = " . $this->db->escape($company_id) . "";
        $query = $this->db->query($sql);
        $data = $query->row_array();
        $count = $data['count'];
        $this->multicache->set("count_client_by_company:$company_id", $count);
    }
    return $count;
}

Caching DB results can be a little trick though. When you update database data, you want to be able to clear the related data from the cache. In the example below, notice how I delete any data from the cache which could be related to the deleted entries:

<?php
/**
 * @param int $client_id The ID of the client to be deleted
 * @return bool True or False for Success or Failure of delete
 */
function delete($client_id) {
    $sql = "DELETE FROM client WHERE id = " . $this->db->escape($client_id) . " LIMIT 1";
    if ($this->db->simple_query($sql)) {
        $this->multicache->delete("client:$client_id");

        $company_id = $this->session->userdata("company_id");
        $this->multicache->delete("count_client_by_company:$company_id");
        $this->multicache->delete("list_client_by_company:$company_id");
        return $this->db->affected_rows();
    } else {
        return FALSE;
    }
}

Notice how our keys have a common naming convention, so that we can remember how to clear out our cache in different locations.

APC PHP File Caching

Another thing which can be cached, if you are doing PHP development, are the actual PHP files. Using the file caching features of APC (Alternative PHP Cache), you can actually cache your compiled PHP opcode in RAM. The effect of doing this is two fold; your scripts don't need to be compiled with each execution, and disk I/O is reduced.

You then get two options from here, each time a script is executed the APC cache will check to see if the file has been modified. If so, the cache is cleared and the file is read again, and if not, the file in RAM is executed. This is the easiest to setup, however there is still some disk reads with every script execution.

Alternatively, you can have APC cache files when first read and then keep them in RAM without checking for the file's updated time for changes. When the files do change, you'll want to tell APC to clear out the cache and rebuild it again when each script is loaded. If you are uploading PHP files via git pushes, you can setup a URL which git will hit with each push (using a post commit hook). This URL could be a simple one line PHP script which clears the cache (seen below). Of course, this is a little more complex, and when your updated site doesn't act any different, it might take a few more minutes of debugging before you figured out what went wrong.

<?php
apc_clear_cache(); # Clears cached PHP scripts

When working with the APC cache, you'll want to either blacklist or whitelist different directories. For example, with the NeoInvoice project, I had it set up so that the server administration scripts (e.g. phpMyAdmin) weren't cached, otherwise this would have increased the cache size in RAM dramatically. The NeoInvoice project itself, once every page had been hit, only used 48MB of RAM. Make sure your APC cache is bigger than what your project requires, otherwise, it'll have to swap out cached files and you'll lose any efficiency.

Here's an example of the APC cache config file as used by the old production NeoInvoice.com server:

apc.enabled=1      # Enable APC
apc.shm_size="64M" # Cache Size in MB
apc.stat=0         # Check if file has been modified (default = 1)

Check out the APC Configuration page for more settings.

Caching files in Browser

Once a file is served up to the client, we usually want them to keep it for a while, since there is no reason for them to download the same CSS file multiple times. We can do this by setting the cache header expiration times to sometime in the future. Depending on the time you anticipate a change will be made. You could have your application code set these headers when rendering a page, but really, you want your web server to handle this stuff for you.

Here's a truncated example of the lighttpd.conf file used for the NeoInvoice project for handling different cache times for different directories:

server.modules = (
    "mod_expire"
)

expire.url = (
    "/css/"     => "access 1 days",
    "/scripts/" => "access 2 days",
    "/images/"  => "access 3 days"
)

Caching data in Browser

Your JavaScript objects can be cached in the browser using the localstorage API. This API allows you to store key/value pairs, where the key and the value are strings. If you want to store an object using the api, you can first convert it into a JSON string and store that (while decoding it when retrieving it). Localstorage is really useful for offline apps or as a general replacement for cookie data, but keep in mind older browsers don't support it.

You can even do some crazy stuff like cache CSS and HTML in there as well. I personally feel this might be a little too intense, but here's a link to an explanation if you're curious.

Keep in mind that you can store up to 2GB of data, and that this data doesn't get cleared away unless the user manually clears it, or you clear it. So if you find yourself caching a ton of data in there you may be doing your user a disservice. Only store data which the user dynamically changes and needs to be available to the client, don't go sticking just anything in there.

Compressing data for Browser

Before sending data to the browser from the server, it can be gzip compressed. When doing this, there is some more CPU overhead (for not only the server compressing the data but also the client decompressing), but the time spent doing this compression is usually dwarfed by the time it takes to send the data over the network. You don't want to worry about doing this yourself, have the web server do it for you.

Here's a truncated example of the lighttpd.conf file from NeoInvoice. Notice how it needs to know a directory for storing the compressed files, which URLs to match, and which mime-types to check.

server.modules = (
    "mod_compress"
)

compress.cache-dir = "/var/cache/lighttpd/compress/"

$HTTP["url"] =~ "^(scripts|css)/" {
    compress.filetype = (
        "text/plain",
        "text/html",
        "application/javascript",
        "text/javascript".
        "application/x-javascript",
        "text/css",
        "text/xml"
    )
}

For more info on lighttpd configuration, check out the NeoInvoice lighttpd configuration.

Thomas Hunter II Avatar

Thomas has contributed to dozens of enterprise Node.js services and has worked for a company dedicated to securing Node.js. He has spoken at several conferences on Node.js and JavaScript and is an O'Reilly published author.