$Cambridge: hermes/doc/talks/2003-11-cyrus/BrainDump,v 1.2 2003/11/19 09:36:39 dpc22 Exp $ Introduction ============ Overview of Talk ================ Introduction Summary of existing Hermes system Requirements for new system Possible directions The Cyrus Mailstore History. Design. Mailbox structure. (XXX trim this down Local Customisations Overview of the new system (XXX dump this if we run out of time) Functional Components Hardware Operating system issues. Future directions. Existing Hermes system ======================= Diagram showing NetApps, Suns, maroon, PPSW Explain from bottom up: NetApps Installed 1999. Webmail bolted on as afterthought :: complicated design. 2 x scale model of CUS, 25 x active users - differences: Menu system + index files Note that PPSW separate from Hermes in this design. Problems with design ==================== NFS efficiency verus NetApp cost. NetApps are single points of failure Problems with Unix format mail folders. SELECT cost (offset by index files) Lack of BODYSTRUCTURE/header cache. Expunge can be expensive POP leave mail on server for N days worse case. Consequences: 10 MByte quota, 4 MByte message size limit Hermes load ============ Active users in 7 day period: Webmail: 17000 (1700 concurrent, 3700 logins/hour) IMAP: 7800 (5000 concurrent sessions) POP: 6700 Terminal: 5580 (500 concurrent) Total: 24700 Message volume Approx. 300k messages (550k deliveries) a day: 10 GBytes a day volume total volume Messages sizes: < 4096 bytes :: 76% < 8192 bytes :: 90% < 16384 bytes :: 93% < 32768 bytes :: 95% Note: Hermes underwent exponential growth between 1996 and 2002 Total concurrent users and messages/day now more or less stable. Remaining pressure is on message size, archive. Objectives for new system ========================= Larger quotas and message size limits. Quota: 100 Mbyte -> 1 Gbyte range on current hardware. Better data recovery options Users are very good at throwing away their mail by accident. Better disaster recovery options What happens if A301 burns down? Store messages on multiple independant hardware. Good recovery tools for corrupt database (e.g: software bugs). Remove single points of failure. Reduce pain/stress of any single node failing. Move from exotic hardware to commodity hardware (and Linux/FreeBSD rather than Solaris: kernel memory leak sob story) Constraints: Cost. Manpower. Not addressing: High availability Power. Cost. STOMITH complexity & amusing failure modes. Possible directions =================== Eliminating NFS (and expensive NetApps) Two approaches: IMAP proxy verus dpc22.hermes partitioning Mailstore technology: UW varients: Unix + index MBX. Mythical Son of MBX. Maildir :: File/msg store without index/cache files. Great for POP and disconnected mode IMAP clients Poor for online IMAP clients which do not maintain local caches: PINE, Mulberry, Webmail clients. Cyrus: Maildir on steriods. Rest of this talk. Blue Sky options: Store mail in SQL database. Abstraction for abstractions sake: IMAP is all about sequences of messages, not relations between messages. None of the major (Unix) players do this. Recovery options? Home brew design: Did paper design for own IMAP server. Huge amount of work involved. Exchange Cost. Management. Unix sys. admins. /* ====================================================================== */ Cyrus Design ============ Cyrus History ============= Developed by CMU as replacement for their Andrew message system 1994 First public release. Cyrus IMAP client planned, later dropped. First (open source?) IMAP4 implementation 1997 First year full intake of CMU undergraduates went into Cyrus store. Jan 2001: Cyrus 2.0 released Global Mailbox list moved from single flat file to Berkley DB. Initial Cyrus Murder prototype design (refined during 2001, 2002) - Used ACAP (dead end, removed) Replaced Tcl management tools with Perl! Jan 2002: Cyrus 2.1 released altnamespace and unixhiersep options (will describe later) Squat indexes. Skiplist database backend Current stable version: 2.1.15 Late 2003: Stable Cyrus 2.2 expected Virtual domain support NNTP <-> IMAP gateway Byte compiled Sieve files Cyrus Design ============ Blackbox server: Only user access to mailstore is IMAP + POP. File/msg store where each folder has two index files and a cache. (more detail later) No (Unix) user accounts on the system: Everything runs as single UID, GID All files owned by this single user. Single Instance store: Mail messages can be shared using simple hard links Large mailing lists e.g: ucam-itsupport: 50 links to single message. Access Control and Quotas implemented by application not O/S. - Much Greater flexibility - Removes O/S security boundaries between users. Excellent shared folder infrastructure - Any user can access any folder given suitable access permissions - Not visible yet on Hermes: need full Cyrus Murder or equivalent. - Each user has separate database of Seen flags. Supports dual use mailboxes (mailboxes can contain submailboxes) - Hacked out at the moment for compatibility with PINE, Prayer. Uses mmap() heavily for performance. Apache style prefork process model: Individual IMAPD and POPD processes reused many times (for many users) Only a few thousand new processes each day on 1,000 user system. Management Interface: Collection of Perl scripts which talk modified IMAP dialect. Very useful for small system. We use replication engine instead. Namespaces ========== Cyrus canonical namespace based on USENET news - User mailfolders stored under "user." - Shared mailfolders (or newsgroups) can be anywhere else. - Hierachy separator is "." All mailboxes "dual use", may contain subsiduary mailboxes. e.g: user.dpc22, user.dpc22.saved-messages, user.dpc22.sent-mail Shortcuts: INBOX --> user.dpc22 INBOX.saved-messages --> user.dpc22.saved-messages. Altnamespace and Unixhiersep introduced in 2001. - raise user folders to same level as inbox - Check Hierachy separator from "." to "/" - need escape mechanism to get at other users and shared folders Shared Folder/fred Other Users/dpc99/sent-mail ACL system ========== Each mailbox has its own Access Control List. Example: user.dpc99 "dpc99 lrswipcda anonymous 0 " l lookup - The user may see that the mailbox exists. r read - The user may read the mailbox. The user may select the mailbox, fetch data, perform searches, and copy messages from the mailbox. s seen - Keep per-user seen state. The "Seen" and "Recent" flags are preserved for the user. w write - The user may modify flags and keywords other than "Seen" and "Deleted" (which are controlled by other sets of rights). i insert - The user may insert new messages into the mailbox. p post - The user may send mail to the submission address for the mailbox. This right differs from the "i" right in that the delivery system inserts trace information into submitted messages. c create - The user may create new sub-mailboxes of the mailbox, or delete or rename the current mailbox. d delete - The user may store the "Deleted" flag, and perform expunges. a administer - The user may change the ACL on the mailbox. 0 Finger Utility Daemon is allowed access. Shared Folders ============== Each mailbox can provide access to arbitary number of users No netgroup style aliases available. ACL manipulation done using IMAP ACL extension. - Mulberry one of the few user agents which supports this - will have to add support into Webmail interface :( "admin" users have automatic "administer" ACL on all mailboxes. Subsiduary mailboxes inherit the ACL of the parent Includes all mailboxes creates under a given user inbox Admin user can override ACLs. Cyrus Murder ============ Required to implement shared mailboxes in cluster configuration. Design: Murder Backend servers standalone server for fraction of users. Murder Frontend Servers intelligent proxy server - have full list of mailboxes and their locations - will disconnect and reconnect as needed on SELECT Mailbox list distributed from master using homebrew MUPDATE protocol. MUPDATE Master server is single point of failure: - unable to create, delete or rename mailfolders if it fails. - Not end of the world, but painful for PINE, Prayer Currently running own our simple (stateful) IMAP proxy server Code available in 2001 was immature Numerous problems especially with altnamespace and unixhiersep set. Stopgap measure Will come back and think about this next year. /* ====================================================================== */ Cyrus Mailbox Structure : IMAP fundamentals =========================================== UIDvalidity Unique Identify for given mailbox name. Typically 32 bit timestamp. UID ID of message in given folder. Monotonically increasing counter. UIDvalidity + UID 64 uniquely identify message in given folder name. Mailbox UniqueID: UIDvalidity + 32bit hash of folder name. Used as key for Seen database lookup - otherwise we would have to rewrite all SeenDB keys on folder rename Cluster wide Message UUIDs for replication. UniqueID + UID should suffice UniqueID not quite Unique enough: potential for collisions - Collision Not end of the world for SeenDB - minimal problem for single users who reads a few thousand mailboxes. - big problem if we need guarantees about Uniqueness. Fine for single user who has access to a few thousand mailboxes. - hash algorithmn has potential for collisions - Have to build own, orthogonal Cyrus Mailbox Structure : On disk representation ================================================ /var/spool/imap-hermes/ stage./ user/ dpc22/ 1. 3. 19. 26. ... cyrus.header cyrus.index cyrus.cache cyrus.squat sent-mail/ 1. 2. 3. ... cyrus.header cyrus.index cyrus.cache cyrus.squat saved-messages/ 1. 9. 17. ... cyrus.header cyrus.index cyrus.cache cyrus.squat dpc99/ stage./ diectory used for staging on message delivery and APPEND - Message gets fsync()ed to disk before being linked into place. Numbered files correspond to Message UIDs. Messages in CRLF canonical IMAP format. Index and cache files ===================== cyrus.header contains User flag names and folder UniqueID backup. ACL backups. cyrus.index contains (read-write) IMAP FAST information: Timestamps. System flags bitmask (\Deleted \Answered \Draft) User Flags bitmask (e.g: Mozilla uses $MDNSent $Forwarded NonJunk) Message size. cyrus.cache contains (readonly) cache information for each message: IMAP ENVELOPE IMAP BODYSTRUCTURE allows IMAP clients to pick out individual MIME bodyparts. Assorted message headers cyrus.squat. Hash table system for SEARCH TEXT prefilter. Generates list of messages which might generate hits, search them by hand Approx 200k messages/sec from a standing start given useful input. Updating index and cache files ============================== Mixture of rewrite and rename and lock & update - self consistent state important for concurrent access. Append to end of file. Rewrite entire file by writing to tmp file and rename Flag bitmasks in index can be updated in place Assumes small write to disk are atomic (though it wouldn't be a disaster if it fails part way through). Index and cache files contain generation numbers. Note generation numbers, what happens if power fails between two renames. fsync() critical! => need sympathic filesystem or battery backed write-back cache on RAID controller. Locking XXX Waste of time... ======= 3 separate flock() or fcntl() locks used for sequencing. cyrus.header Mailbox meta lock (for e.f: flag updates) cyrus.index Mailbox update lock cyrus.cache POP lock Database abstraction layer used for Ancillary databases ======================================================= Simple key-value associative array, not relational DB. Databases backends implement transaction updates or atomic commit. Three different backends available: Flat file (one entry per line) Transactions implemented by writing temporary file and rename. Low performance Fine for small files (one or two disk blocks). Berkeley DB Fast inserts and lookups. Multiple concurrent readers and writers. Slow at enumerating keys. High complexity (and rumours of problems on info-cyrus list) Skiplist. Fast enumeration. Single lock for database update - single writer, multiple readers. - Fine with lots of small databases (e.g: \Seen data) or infrequent updates (mboxlist). Ancillary Databases =================== Mailboxlist Complete list of mailboxes --> ACL. Shared database: Skiplist preferred. Seen Database Per user database: Skiplist or flat file preferred. Maps mailbox name to list of Message UIDs for messages which we have seen + some timestamps. IMAP Subscription List Per user database: Flat file fine (few dozen entries at most). Duplicate supression system Single shared database: Berkeley DB preferred. TLS session cache Single shared database: Berkeley DB preferred. (Not actually used in Cyrus backends, but code is ripped out and used in Proxy server and Prayer). Sieve filtering. ================ Cyrus implements Sieve filtering language (RFC 3028 plus numerous extensions) Designed to work on blackbox servers. No logging primitives or other assumptions about file access Sieve file either suceeds entirely or defers without effect No need for local message queuing or retry Designed to work on LMTP target Each user has collection of potential Sieve files Single Active file We use inactive files to store Webmail filter metadata Problems with the vanilla Cyrus design ====================================== Compatibility: Old versions of PINE (< 4.50) and Prayer can't cope with dual use mailboxes. No support for "mail/", "~/mail", "~dpc22/mail/" forms of addressing. Messages/Mailboxes removed immediately from disk on expunge/delete - compares poorly to snapshot system available on NetApps. - LVM snapshots possible but fairly weak in comparision. Single copy of the data for each user (eggs and basket). Standard Unix backup tools have problems with lots of small files. Cyrus designed for well behaved IMAP clients: Few IMAP user agents are well behaved. Some POP clients are particularly obnoxious (leave mail on server for N days severe pain for cache file). /* ====================================================================== */ /* ====================================================================== */ Local Customisations (introduction slide) ==================== Topics: Performance Compatibility Two phase expunge Replication Miscelleaneous. 1.2 MByte patch submitted back to CMU. CMU particularly keen about replication patch Largest single patch, likely to take some time to integrate. /* ====================================================================== */ Performance 1 : HERMES_CACHE_MOST ================================= Vanilla Cyrus cache only contains limited number of headers: In-reply-to: Priority: References: Resent-from: Newsgroups: Followup-to: Most MUAs fetch headers which are not in this list. PINE 4.56 fetches: (UID ENVELOPE BODY.PEEK[HEADER.FIELDS (Newsgroups Content-MD5 Content-Disposition Content-Language Content-Location resent-to resent-date resent-from resent-cc resent-subject List-Help List-Unsubscribe List-Subscribe List-Post List-Owner List-Archive Followup-To References)] INTERNALDATE RFC822.SIZE FLAGS) Causes expensive cache miss (essentially maildir emulation). Solution: store everything except: Received, Envelope-To, Return-Path, Delivery-date, Mime-Version. X- headers. Cache files grow by 10% to 20%. CMU have applied improved version of this patch to Cyrus 2.2 BETA: - backwards compatibility for existing cache files - 5 x performance increase in synthetic benchmark. Performance 2 : HERMES_LAZY_CACHE ================================= Vanilla Cyrus always rewrites entire cache files on expunge event. - expensive because fsync() required. - Compares poorly to UW which only rewrites end of mail folder - Particular problem with large folders where only a few messages change: People who archive everything in single INBOX folder. POP leave mail on server for N days Solution: Leave dead space in cache file Garbage collect files overnight or when given threshold reached (currently 32 KBytes of dead space) CMU currently have test version of this patch installed Really needs two phase expunge patch (covered in a minute) to fly Overwise unlink() on message files is a major bottleneck. Performance 3 : HERMES_FAST_RENAME ================================== XXX Snip this: really not very interesting. Vanilla Cyrus implements mailbox rename as copy and then delete Safe but very slow from mailbox containing many messages. Motivation is that mailboxes can span multiple disk partitions (but we don't care about this) Problem: Large numbers of PINE and Mulberry rename folders at the start of the month. We replace this with fast rename system call. Rely on reconstruct to sort out problems on power failure /* ====================================================================== */ Compatibility ============= HERMES_DIRECTORY Problem: PINE (<4.50) and Prayer struggle with dual use mailboxes Solution: Make all live mailboxes leaves in tree and hack in some emulation for directory objects. Nasty: hope to dump this some day. HERMES_UNIX_NAMESPACE Problem: Large numbers of installed user agents look for mail in "mail/", "~/mail" or "~dpc22/mail/". This is meaningless to Cyrus. Solution: Translate names of these forms into valid Cyrus format, and translate back when printing mailbox names. Smoke and Mirrors, but works rather well in practice. Long term solution will be to get people to reconfigure user agents and selectively disable this feature. /* ====================================================================== */ Two phase expunge : Concept =========================== Delete+Expunge model. Predates popular Desktop Trashcan metaphor. Does provide opportunities for server side optimistation. Doesn't stop people from throwing away their mail :( Want reliable recovery from: Accidental message expunge Add expunge.index, expunge.cache to each folder mailbox_expunge(): copy from live to expunged folder Accidental folder deletion Translation folder delete to rename into reserved area. Add timestamp to avoid clashes on e.g: postponed-msgs. Two phase expunge : Operation ============================= Need to extend quota system: expunged messages, deleted folders should not count against live quota Asynchronous expire job clears out old data. Runs nightly, but also automatically if expunged data threshold reached. Occupies disk blocks: overhead around 30% for 28 days. No additional disk I/O - unlink(2) is expensive on most filesystems - defer to overnight actually performance boost in most cases. Two phase expunge : Variables ============================= Global defaults can be tuned for individual acconts: expunge_vol_min Try to hold onto at least this much expunged data (currentlly 50 MBytes) expunge_vol_max Hold onto at most this much expunged data during regular operation (currently 75 MBytes) expunge_vol_overflow Force immediate expiration when this much expunged data recorded (100 MBytes) expunge_timeout Hold onto expunged data for at most this long (Default: 28 days) Two phase expunge : User Interface ================================== Modified Cyrus presents a number of special (invisible) directories - Invisible to avoid confusion naive users, user agents. - ACLs restricted to avoid abuse. .EXPUNGED/ Expunged mail messages .DELETED/ Deleted mail folders .EXPUNGED/.DELETED/ Expunged messages in Deleted mail folders Can get at these special directories using User Agents, Webmail. - currently rather messy - Webmail Interface will be extended later. Two phase expunge : Example ============================ . list ".DELETED/" % * LIST (\Noinferiors) "/" ".DELETED/postponed-msgs-20031117-08:46:24" . select .EXPUNGED/INBOX * FLAGS (\Answered \Flagged \Draft \Deleted \Seen) * OK [PERMANENTFLAGS ()] * 35 EXISTS * 35 RECENT * OK [UNSEEN 1] * OK [UIDVALIDITY 1069097426] * OK [UIDNEXT 7240] . OK [READ-ONLY] Completed /* ====================================================================== */ Replication : Possible Approaches ================================= Four basic approaches to replication: SAN replication: High end SAN boxes can do automatic block level replication Linux DRDB Network block device. Poor man's SAN replication. Appliance or filesystem replication: NetApp SnapMirror. Veritas Volume Replicator. Application level replication: Typically transaction based replication: Replica system is always in self consistent state. May be possible to aggregate transactions to catch up from backlog. Greatest flexibility. Read-write access to both ends of link for sanity checks. Typically most work: Doesn't building on existing replication system. In Cyrus case we build an entire IMAP like protocol for replication. Synchronous Verus Asynchronous Replication ========================================== Synchronous replication: Only acknowledge write on master when committed to disk on replica. Better safeguards verus higher latency. Aynchronous replication: Acknowledge writes immediately, replicate later Much faster, replica may end up in unknown state. Important that: Order of writes is preserved Journaling filesystems are used Application level replication as afterthought typically asynchronous Synchronous requires very careful application design Race conditions a big problem for something as stateful as IMAP. The big picture: Interval between Cyrus rolling replication runs: 1 to 3 seconds Mailscanner on PPSW batches messages: 10 -> 30 seconds under load. Replayable log of messages delivered by LMTP on PPSW would be cool. Applications of Cyrus application level replication =================================================== Migration of data between Cyrus backends Load balancing System maintainance. Efficient Transfer to tape staging system. (large SATA disk array plus LTO-2 stacker) Initial upload of mailboxes from old Hermes into Cyrus Rolling replication in real time to spare system. Picks up log of actions generated by IMAP, POP, LMTP services Systems are installed in pairs: Half the accounts are master copies, half are replicas/ Each system maintains a list of "master" accounts to prevent replication in the wrong direction. Replication : Maintaining the single instance store =================================================== 96 bit UUID value assigned to each message body and cache entry. - Unfortuanately Folder UniqueID + UID not good enough. Upload single message multiple times (e.g: mailing list delivery) Uploads message body once then references cached copy. Copy from one folder to another - Messages passed by reference (UUID) rather than value on replica Similar techniques used when renaming entire mail folders. - Track mailbox UniqueID to avoid expensive DELETE + UPLOAD Replication : Sanity checks =========================== Replication system is critical component on Hermes. Hotspare and tape backups rely on it. Maintain database of message MD5 checksums on all systems - Master system, replica system, backup system - Incremental updates Overlay MD5 checksums one every 24 hours - whinge if checksums mismatch - early warning system /* ====================================================================== */ Miscelleaneous patches ====================== Mailbox quota. - 200 MByte largest practical size on current hardware. - Most user agents die before this point is reached. Use GCC "unsigned long long" for quota arithmetic - 64 bit quotas available on x86 architecture - Gives CPU something to do (very bored most of the time) Make POP3 UIDL format compatible with UW server - stops POP client configured to leave mail on server downloading additional copy of messages. Implement undocumented IMAP SCAN extension - Allows PINE to search large numbers of folders for given text - relies on SQUAT engine for performance. Autocreate target folders named in user Sieve file. - Doesn't fit well within Cyrus ACL system - Will cause some amusement when shared folders become available. Check for potential overquota condition before message delivery Vanilla Cyrus will deliver the first message which causes account to exceed quota. Amusing if 10 MByte message filtered to spam folder. Outlook patch (will probably take a slide to explain this :() Fix FUD to do something halfway sane. /* ====================================================================== */ Implementation ============== Picture (lots more potential interconnections...) World Cambridge User Agents Webmail PPSW ... PPSW Terminal --rsync--> Spare Cyrus <-- Sync --> Cyrus | | | | +-Sync> Otanes SATA disk box) 2 x LTO-2 tape stacker All currently running Redhat 9 (Gack!) - Kickstart installation very useful - kernel+libc+Berkeley DB only infrastructure Cyrus needs. Self contained. - LVM used for snapshots on Otanes. Easy to replicate data to other system: FreeBSD 5 might be a very nice alternative Linux Filesystems ================= ext3 Lots of logging options: writeback, ordered, data Doesn't cope with large numbers of files in directory. - still waiting for stable htree patch (Linux 2.6?) ReiserFS: Best small file perforance. Official release on provides writeback logging Full data logging available as unofficial patch Fsck is hopeless: can lose entire directory trees. XFS: Good dump+fsck utilities Good large file performance Resizeable file Stability? ReiserFS 4 should be interesting Write anywhere atomic filesystem Full data logging without 2 x write penalty. Transaction API available to application fsync() performance currently very poor Veritas File system SCSI vs SATA? XXX ============= Issues Graph /* ====================================================================== */ Migration ========= Collection of Perl and C running on red Interfaces to Cyrus replication system using SSH (RSA authentication) Uses c-client to access folder on Old Mailstore as User Problem cases: People with misconfigured user agents People using POP3 LAST extension. People active on CUS: Tony writing migration tool there Manually maintained Exim filter files. Migration Schedule ================== Summer 2003: CS staff (other than CUS users) People with quotas > 10 MByte on Hermes October 2003: New undergraduates and postgraduates November -> December 2003: Undergraduates and pgrads who arrived in or before 2001 January -> April 2004: Everyone else! /* ====================================================================== */ Still to do =========== User Admin scripts Replace telnet/SSH system with pair of FreeBSD servers. Improved Webmail Interface - Support for charsets other than ISO-8859-1 - Support for dual use mailfolders New List System Shared Mail folders!