Archive for April, 2009

Removing Duplicate E-Mail Messages

April 7, 2009 Leave a comment

Update: (2014/1/31) I would not attempt any of the GMail-related procedures below as I suspect Google has made changes to the IMAP interface since this post.

I have saved most of my e-mail for the past ten years. Currently, I keep this collection in Maildir format and use OfflineIMAP to keep it synchronized with my Gmail account. These messages have been through a number of transformations which have, over the years, resulted in many duplicates. Unfortunately, finding and removing these is not a trivial task for a few reasons.

  • There are messages in the collection with missing message-id fields. Finding duplicates based on this header (using, for example, Mutt’s =~ pattern) is therefore not a reliable solution.
  • Many duplicates do not share the same set of headers. For example, both Gmail and OfflineIMAP have added different headers to messages.
  • Gmail violated standards with their custom IMAP interface. Assuming the advanced IMAP settings are inactive, messages must be moved to the trash folder before synchronizing rather than simply removed from the local archive in order to delete them. Upon synchronization, this then removes the message from all other IMAP folders.

The plan then in finding duplicates with potentially different headers is to match messages on just their contents. I first used formail to create a directory of files sharing the same name as their Maildir equivalent, but with only the most basic set of headers. (Eliminating all headers would not be wise, because many messages have no contents but meaningful subjects, i.e., “How are you? EOM.”)

$ mkdir -p ~/mail-tmp
$ find "~/mail/[Gmail].All Mail/cur" -type f -print0 \
| xargs -0I{} basename "{}" \
| xargs -d '\n' -I{} sh -c 'cat "~/mail/[Gmail].All Mail/cur/{}" \
| formail -ck -X From: -X Subject: -X To: -X Date: > "~/mail-tmp/{}"'

Fdupes was then used to find redundant file names and move their corresponding originals to a temporary maildir called ~/dupes.

$ mkdir -p ~/dupes/cur ~/dupes/new ~/dupes/tmp
$ fdupes -f ~/tmp/mail-tmp \
| grep -v '^$' \
| xargs -d '\n' -I{} basename "{}" \
| xargs -d '\n' -I{} mv "~/mail/[Gmail].All Mail/cur/{}" ~/dupes/cur

I then opened up both ~/dupes and ~/mail/[Gmail].All Mail in Mutt to see that things were in order. I checked 5 messages or so (of the 3000+ duplicates found), and noticed that all existed in both collections. Satisfied, I emptied the Gmail trash, moved all the duplicates in ~/dupes to ~/mail/[Gmail].Trash, and ran OfflineIMAP.

It seems like this should have done the trick, but alas, Gmail rejected many of these new additions to its trash folder, resulting in only a few deletions actually being registered. I never figured out why, but found that if I turned on the advanced IMAP controls, made sure only Inbox, All Mail, and Trash labels were visible in IMAP, set Gmail to “Move the message to the Trash” when it is “expunged from the last visible IMAP folder,” and moved everything in ~/mail/[Gmail].Trash to a folder outside the purview of OfflineIMAP, all messages were successfully moved to the trash upon synchronization.

One possible way a duplicate could have slipped through would be if the From:, Subject:, To:, and Date: headers appeared in an order different from those in the original. The most reliable would be to reorder these headers the same way for each message after passing them through Formail.