$Cambridge: hermes/doc/newsletter/2003-06-hermes.txt,v 1.3 2003/07/14 09:24:35 dpc22 Exp $ Hermes restructuring ==================== Executive summary ================= Over the last twelve months a substantial amount of work has gone into overhauling the decade old Hermes design and replacing the ageing hardware. The objective is a 100 MByte default quota for Hermes users and no hard limits on individual folders within that 100 MByte quota. (Individual mail messages will still be limited to 10 MBytes, at least for the time being). The good news is that this work should be largely transparent to end users: user agents which have been correctly configured should not need to be changed at all. It is however important that people use the canonical names for Hermes services as advertised in Computing Service documentation: IMAP: imap.hermes.cam.ac.uk POP: pop.hermes.cam.ac.uk SMTP: smtp.hermes.cam.ac.uk The past ======== The existing Hermes architecture is now rather dated as a message store. Conventional Berkeley format mailfolders were the only really sensible option for a Unix based mailstore when Hermes was established in 1993, but this format is increasing a liability as flat text files perform very poorly with large messages and large numbers of messages in a single file. The current Hermes system deals with a number of the most serious issues using locally developed index and cache files for each mail folder which were introduced in 1999. Some update operations, specifically expunging messages from the start or middle of a mail folder rather than the end (common for POP user agents running in leave mail on server mode), are still very expensive for large folders. This is the reason for the current, very tight, limits on individual mail folders. The existing physical hardware is also now getting rather old: the two NetApp F740 file servers that are the backbone of the current Hermes design were installed in 1999 and are reaching the end of their natural life. The future ========== The long term plan for a number of years has been to migrate from legacy Berkeley format mailfolders to a Cyrus mailstore. Cyrus is a well established open-source IMAP server platform which is used at many sites and has an active development community. Cyrus is a dedicated database design specifically optimised for IMAP and POP traffic. (It is a file per message store where each folder has an index and two associated cache files). It also provides safe concurrent access to mail folders which is a feature sadly missing from Berkeley format folders. The new service consists of large numbers of relatively small and cheap Intel-based rackmount servers (running Redhat Linux). Consequently, the failure of a single system will be much less disastrous than the current system which has two large fileservers which are single points of failure. A small downside of running large numbers of relatively cheap systems is that individual failures are likely to be more common. Each of the servers will have a RAID disk array and redundant power supplies and cooling which should reduce the mean time between failure. The commodity nature of the hardware means that we will be able to have an spare system running as a warm standby: the solution to many simple hardware faults will be to simply move the disks from the affected system to the spare. Compatibility Notes: -------------------- The Cyrus mailstore does not provide a vacation log mechanism and there is no sensible way to add one. It does however provide a duplicate supression system which does more or less the same thing without any user interface: only a single message will be sent to a particular correspondant in a given time frame. PINE on Hermes no longer has direct access to mail folders and runs as an IMAP client. Consequently it prompts for a password to use against the IMAP mailstore. There is unfortunately no sensible, reliable, way to pass on the password may have already been quoted to the SSH, telnet or rlogin service. The migration to a physically separate mail store within Hermes will mean that people will have to use the correct names for various Hermes services: imap.hermes.cam.ac.uk, pop.hermes.cam.ac.uk and smtp.hermes.cam.ac.uk. Up until now IMAP and POP traffic to hermes.cam.ac.uk will have worked simply because the same set of physical hardware was involved (this has never been supported or encouraged by the Computing Service). We will start to contact people who are currently using incorrect names in phases before they can be moved into the Cyrus world. It is also likely that we will introduce some infrastructure which allows us to lock down use of the old IMAP and POP servers to prevent accidental use of the wrong names. Local Modifications =================== A large amount of work has taken place in the last 12 months to modify the basic Cyrus distribution to meet our requirements on Hermes. These modifications fall into four broad classifications: Compatibility: -------------- The major obstacle that we have faced in moving from the UW server to Cyrus is that the two servers have very different ideas about folder namespaces. The UW server that we currently use on Hermes presents folders as a conventional filesystem hierarchy: it distinguishes between folders and directories and uses "/" as a Unix-style hierarchy separator, for example "~/mail/ARCHIVE/personal". In contrast the Cyrus server follows the presentation of Usenet news, where a mailbox can contain both messages and subsidiary mailboxes and the hierarchy delimiter is '.'. For example "user.dpc22.mail.ARCHIVE" could be a mailbox which contained a subsidiary mailbox "user.dpc22.mail.ARCHIVE.personal". We have installed a compatibility layer so that clients configured to use the existing Unix namespace will continue to work without reconfiguration. We also distinguish between folders and directories so that user agents such as PINE which struggle with dual-use mailboxes will continue to work without reconfiguration or problems. Performance: ------------ The Cyrus cache system has been expanded to cache most message headers. This is particularly important for IMAP clients such as PINE running on Hermes which requests large numbers of headers for each message when displaying lists of messages, which would cause an expensive cache miss. This does mean that the cache files are larger than on a vanilla Cyrus installation, but a separate optimisation means that the cache files are rewritten far less frequently, which is particularly important for POP clients running in leave mail on server mode. A vanilla Cyrus installation implements folder renaming by creating a target folder and copying messages across one at a time just in case the source and target folders live on separate disk partitions. This isn't going to happen in the foreseeable future on Hermes so this slow copying process has been replaced by a fast rename operation which will be particularly important at the start of each month given the folder rotation system used by Mulberry IMAP clients and PINE running on Hermes. Data recovery ------------- A vanilla Cyrus installation removes messages immediately when they are expunged and removes entire mail folders immediately when they are deleted. This compares unfavourably with the snapshot system which is available on Hermes and the CUS. Consequently we have introduced a "two phase expunge" system where messages and folders not removed from disk immediately (though they are hidden from clients) and can be recovered by the system administrator. Eventually we hope to add options into the Webmail interface will will allow people to unexpunge messages and undelete folders without needing to get in touch with postmaster@hermes.cam.ac.uk. Disaster recovery ----------------- At the moment Hermes contains a pair of large file servers which are single points of failure: loss of either fileserver probably means total loss of service as a single fileserver would be unable to cope with the load that we see in 2003. In addition, data is only dumped to backup media (relatively slow tape devices) once a day. Consequently, any disaster recovery process which involved installing new fileserver hardware and recovering from tape could take several days to complete despite the very good support contracts that we have with Network Appliance. The fact that our Cyrus mailstore will consist of large numbers of small systems reduces the impact of any single system failing. To help protect us further we have invented a data replication system for Cyrus which means that all data on one system is copied to completely separate hardware in a way that ensures that the the database on the target system is always self consistent though it may be a few seconds (currently 15 seconds) out of date compared to the master copy. The long term hope is that we would be able to install half of Hermes in one machine room and half in a separate machine room which would allow us to recover reasonably gracefully from fires and natural disasters which destroy an entire building. Additional safeguards should be possible on PPSW to ensure that mail which has been recently delivered to a Cyrus machine which dies is redelivered to the replica system. This would rely on the duplicate suppression system in Cyrus to remove messages which have already been replicated. Messages uploaded to sent-mail folders represent a far more substantial challenge. The same replication engine will be used for migrating users around between different backend Cyrus systems for load balancing and for regular nightly dumps to a backup spooling system for tape dumps as the second and third level backup systems in case a system and its replica fail simultaneously. Quotas ====== The goal of this exercise is a mail system which has a 100 MByte default quota and no limits on individual mail folders within that 100 MByte quota. Larger quotas will be available on request. The limit on individual messages is currently 10 MBytes to match CUS. The limit on individual messages will probably be increased when the service is up and running. Hardware ======== The first block of hardware has already been installed for testing purposes. A second block of hardware should arrive mid-July. Together this represents a total of total of 16 2U Intel servers for the live service plus a single warm standby server. Each server will have 3 GBytes of RAM, hardware RAID and 6 x 72 GByte active disks (plus a hot spare) providing 350 GBytes of RAID 5 protected spool space. This represents the minimum comfortable level for a 25,000 user system with a 100 MByte default quota given the overhead associated with a file per msg mailstore which uses extensive indexing and caching for performance and has at least a factor of two overhead for the replication and two phase expunge subsystems. It is possible that an additional block of 8 systems will be purchased next year, which would give us approximately 1,000 active users on each system, and 250 active users per inch of rack space in the Computing Service machine room. We are currently in the process of specifying a backup spool system for tape backups when substantial quantities of data are stored in the new systems: this will probably be a Intel server running Linux attached to a large FCAL <-> IDE-RAID external disk array and an LTO tape stacker. Timescales ========== At the time of writing about 30 people have been moved to use the new system in order to test a wide variety of mail user agents. This figure will increase in exponential steps over the summer and we hope to have substantial numbers of users by the start of the Michaelmas term. At the same time it is important that the Cyrus mailstore is properly tested before large numbers of users are moved: any complications may delay this timetable. The good news is that the existing Hermes system doesn't appear to be in imminent danger of overload, so the only hurry here is the desire to provide a better service to our users.