Photodb Architecture
by Jin Choi, jsc@alum.mit.edu, and Philip Greenspun; November 2006
Rajeev Surati,
This is a draft specification for servers to support the
photo sharing module of photo.net (http://www.photo.net/photodb/
is the URL for the currently running system).
Requirements
- Scale: We currently store approximately 700 GB of photos. Usage of
the site is growing and we are planning to allow people to upload larger
images (30" LCD monitors are down to $1200; these folks should have
something to fill their screens). We want to be able to scale at least
to 2 TB without buying any new hardware.
- Fault-tolerance: Failure of a single drive or server should not
bring down the photodb.
- Efficiency: data should not be copied needlessly, and should be
served as close to where it resides as possible. All servers hosting
data should be helping to serve that data, except any off-site backup
machines. [[why is this such a big deal]]
- Expandability: it should be easy to add capacity to the photodb.
- Quick to Restart: no fsck'ing! (implies a journaled file system)
How Everyone Else Does It
[[there are many sites doing photodb and large amounts of disk today --
EMC is definitely not what these people use [raj]]
It seems as though the standard approach to this is a fancy disk array
from a company such as EMC or HP. The array would be loaded up with
drives and would talk to multiple separate servers via SCSI or fiber
channel. Data security would be provided via RAID 5 at a cost of 33
percent of capacity.
Problem with this approach: As far as we've seen, these disk arrays are
extremely expensive and require expensive disk drives that have limited
capacity. Good if you're running an RDBMS and need very fast drives and
the data are precious. Not good for storing digital photos that might
be seldom examined after being uploaded. Possible exception? "FATA"
drives? A 500 GB FATA drive is $1300+ compared to $300 for an SATA
drive of the same size. photo.net could only dream of being profitable
enough to be an EMC customer...
Advantage of this approach: One big logical volume exported to the
servers, the fancier disk arrays send email to the vendor when a disk
drive fails, and a replacement appears
Benchmark Price
Amazon S3 charges 15 cents/month per GB ($1800/TB per year for reliable
storage). If we are spending more than this on the hardware and
sysadmin/management time, we should question our engineering talent!
If we buy 500 GB SATA drives at $300 each and use RAID 1, it will cost
us about $1200 just for the disk drives to support 1 TB. So we can't
spend too much more on the servers, etc., without busting through the
$1800/TB/year limit. [[If disks last 3 years... then you have a budget of 5400.00]] [[how much does amazon charge for the bandwidth]]
Proposed Architecture
Our proposed architecture is a front-end load balancer that recognizes
requests for photos and sends them directly to the computers whose disk
drives hold the photodb. These machines could be running any Web server
program, i.e., not necessarily AOLserver. This avoids the data going
through the switch and through multiple servers more than necessary
(compare to a system where the photodb is available via NFS to all the
front-end servers). If a photodb machine fails, the load balancer
should be smart enough to take the machine out of its rotation. If a
disk drive fails within a photodb machine, that machine's internal
monitors should be smart enough to shut down the HTTP server, which
would thus cause the machine to be taken out of the load balancer's
rotation. [[very unclear to me that the raw images can't use NFS thumbnails are a different question entirels]
It would be nice to have one big logical volume spread among all the
hard drives of a photodb machine. I.e., if there are 6 500 GB drives,
the computer should see one directory with 3 TB of free space. On the
other hand, this might not scale well as we keep adding disks, on the
assumption that the failure of one disk drive in a logical volume will
require us to rebuild the entire volume. [[yes] but you can do this in a meta way if necessary]]
We want to have at least two computers inside our colo cage and one
offsite in a friend's cage. The offsite machine can lag behind the
live servers and be kept up to date with rsync or similar.
Given that we will have at least three copies of the data, it would be
nice to avoid the overhead of RAID 5 within a photodb machine. If a
hard drive fails, we take the machine offline and rebuild it within a
day or two.
What would be nice is a "network RAID 1" file system running across the
two live servers. An update to a directory on Machine A is immediately
applied to the corresponding directory on Machine B. The file
system shouldn't return to the application program until an update has
been made on both sides of the mirror (i.e., on both computers).
[[This is overkill, you can make two copies available, and make the servers smart
betwixt themselves for photodb ids > then say the highets in the last 10 minutes]
Hardware Basis
It is fairly cheap to buy a 2U rackmount server with 8 SATA slots as the
base unit for the photodb, filling 5 slots now, leaving 3 to be filled
in the future when we need the capacity (and, presumably, when hard
drive capacities will have grown).
Architecture Specifics
The drives will be mounted singly, without RAID or LVM, and the photodb
directories will be allocated among them automatically, using symlinks
to abstract away their physical location (we want to be able to recover
from a drive failure by replacing it with another drive with same data
from a backup. We want to rely on redundancy at the machine-level for
fault tolerance, not at the drive-level). This restricts the
straightforward growth of the photodb to the capacity of 8 drives (see
next section), but that will give us 4-6 TB at current single drive
sizes, more in the future.
We would like to do the file system replication (network RAID 1) with
some sort of well-tested layer of software such as Andrew File System
(AFS). The off-site backup could be done with rsync. If a user deletes
a photo in the live cluster, we want that photo deleted in the offsite
backup. If a disk drive dies in the live cluster, we do not want rsync
noticing that a whole series of directories have disappeared and
deleting those from the backup!
Growth Beyond 8 drives
We can replicate this entire architecture and program the load balancer
to look carefully at the actual ID of the photo. These are allocated
sequentially. We can say "if the request is for a photo with an ID >
3,000,000, send it to Photodb Cluster #2".
Criticism
This is beginning to sound like a sysadmin headache. Skilled tech
people should have gotten cheaper after the dotcom bust, but instead
what seems to have happened is that so many people left the field that
it is no easier or cheaper to hire good tech help.
[[this is needless posturing]]
Questions for Investigation
Does any network filesystem exist that would be suitable? We do not
really care about remote-client access to the filesystem, as each
server will have their own copy. We would be using the network
filesystem only to provide a transparent replication facility. Possibilities:
- http://www.linux-ha.org/
talks about DRDB
- AFS: has replication
features, but may not be appropriate. Requires atomic updates of
all read-only volumes, and seems inappropriate for frequent
updates. AFS is also complex to set up and administer, and it stores
files in its own format.
- Coda:
supports read/write replication servers. Not very widespread. As of a
year ago, was unstable and not very active; status may have changed
since then?
- NFSv4: replication features not scheduled to be available in Linux until v. 2.6.19, according to this
- Userspace replication: manually transfer files to all servers. Not
transparent, subject to failure. Could be backed-up by a periodic
rsync job. Deletions could be manually controlled to omit offsite
backup, in case of runaway deletion.
What journaled file system do we run on each drive? ext3?