Photodb Architecture

by Jin Choi, jsc@alum.mit.edu, and Philip Greenspun; November 2006 Rajeev Surati,
This is a draft specification for servers to support the photo sharing module of photo.net (http://www.photo.net/photodb/ is the URL for the currently running system).

Requirements

How Everyone Else Does It

[[there are many sites doing photodb and large amounts of disk today -- EMC is definitely not what these people use [raj]] It seems as though the standard approach to this is a fancy disk array from a company such as EMC or HP. The array would be loaded up with drives and would talk to multiple separate servers via SCSI or fiber channel. Data security would be provided via RAID 5 at a cost of 33 percent of capacity.

Problem with this approach: As far as we've seen, these disk arrays are extremely expensive and require expensive disk drives that have limited capacity. Good if you're running an RDBMS and need very fast drives and the data are precious. Not good for storing digital photos that might be seldom examined after being uploaded. Possible exception? "FATA" drives? A 500 GB FATA drive is $1300+ compared to $300 for an SATA drive of the same size. photo.net could only dream of being profitable enough to be an EMC customer...

Advantage of this approach: One big logical volume exported to the servers, the fancier disk arrays send email to the vendor when a disk drive fails, and a replacement appears

Benchmark Price

Amazon S3 charges 15 cents/month per GB ($1800/TB per year for reliable storage). If we are spending more than this on the hardware and sysadmin/management time, we should question our engineering talent!

If we buy 500 GB SATA drives at $300 each and use RAID 1, it will cost us about $1200 just for the disk drives to support 1 TB. So we can't spend too much more on the servers, etc., without busting through the $1800/TB/year limit. [[If disks last 3 years... then you have a budget of 5400.00]] [[how much does amazon charge for the bandwidth]]

Proposed Architecture

Our proposed architecture is a front-end load balancer that recognizes requests for photos and sends them directly to the computers whose disk drives hold the photodb. These machines could be running any Web server program, i.e., not necessarily AOLserver. This avoids the data going through the switch and through multiple servers more than necessary (compare to a system where the photodb is available via NFS to all the front-end servers). If a photodb machine fails, the load balancer should be smart enough to take the machine out of its rotation. If a disk drive fails within a photodb machine, that machine's internal monitors should be smart enough to shut down the HTTP server, which would thus cause the machine to be taken out of the load balancer's rotation. [[very unclear to me that the raw images can't use NFS thumbnails are a different question entirels]

It would be nice to have one big logical volume spread among all the hard drives of a photodb machine. I.e., if there are 6 500 GB drives, the computer should see one directory with 3 TB of free space. On the other hand, this might not scale well as we keep adding disks, on the assumption that the failure of one disk drive in a logical volume will require us to rebuild the entire volume. [[yes] but you can do this in a meta way if necessary]]

We want to have at least two computers inside our colo cage and one offsite in a friend's cage. The offsite machine can lag behind the live servers and be kept up to date with rsync or similar.

Given that we will have at least three copies of the data, it would be nice to avoid the overhead of RAID 5 within a photodb machine. If a hard drive fails, we take the machine offline and rebuild it within a day or two.

What would be nice is a "network RAID 1" file system running across the two live servers. An update to a directory on Machine A is immediately applied to the corresponding directory on Machine B. The file system shouldn't return to the application program until an update has been made on both sides of the mirror (i.e., on both computers). [[This is overkill, you can make two copies available, and make the servers smart betwixt themselves for photodb ids > then say the highets in the last 10 minutes]

Hardware Basis

It is fairly cheap to buy a 2U rackmount server with 8 SATA slots as the base unit for the photodb, filling 5 slots now, leaving 3 to be filled in the future when we need the capacity (and, presumably, when hard drive capacities will have grown).

Architecture Specifics

The drives will be mounted singly, without RAID or LVM, and the photodb directories will be allocated among them automatically, using symlinks to abstract away their physical location (we want to be able to recover from a drive failure by replacing it with another drive with same data from a backup. We want to rely on redundancy at the machine-level for fault tolerance, not at the drive-level). This restricts the straightforward growth of the photodb to the capacity of 8 drives (see next section), but that will give us 4-6 TB at current single drive sizes, more in the future.

We would like to do the file system replication (network RAID 1) with some sort of well-tested layer of software such as Andrew File System (AFS). The off-site backup could be done with rsync. If a user deletes a photo in the live cluster, we want that photo deleted in the offsite backup. If a disk drive dies in the live cluster, we do not want rsync noticing that a whole series of directories have disappeared and deleting those from the backup!

Growth Beyond 8 drives

We can replicate this entire architecture and program the load balancer to look carefully at the actual ID of the photo. These are allocated sequentially. We can say "if the request is for a photo with an ID > 3,000,000, send it to Photodb Cluster #2".

Criticism

This is beginning to sound like a sysadmin headache. Skilled tech people should have gotten cheaper after the dotcom bust, but instead what seems to have happened is that so many people left the field that it is no easier or cheaper to hire good tech help. [[this is needless posturing]]

Questions for Investigation

Does any network filesystem exist that would be suitable? We do not really care about remote-client access to the filesystem, as each server will have their own copy. We would be using the network filesystem only to provide a transparent replication facility. Possibilities:

What journaled file system do we run on each drive? ext3?