Archive for August, 2009

Downloading Chat Logs from GMail

August 17, 2009 Leave a comment

Update (Summer 2013): Google has again disabled IMAP access to chat logs with their introduction of Hangouts.

Update (2011/9/16): It appears chat logs are back in IMAP! (Settings->Labels->Check “Show in IMAP” for “Chats”)

GMail disabled access to chat logs via IMAP some time ago. At that point, I stopped using the GMail interface and started using Pidgin. Unfortunately, I unknowingly logged 2098 chat conversations on GMail servers between the moment GMail disabled access and my realization they had done so. (I was not using IMAP during this period, but assumed the functionality was still there and that I would download them all for safekeeping at some point.) Below is the process I used to recover these data.

I should mention that while this task succeeded, it was quite a messy process. Keep in mind that the alteration of code on Google’s side could render useless any and all steps listed below. :(

First, I tagged all of my chat logs with an arbitrary label (“cl”), since this appears to be the only way to get them to show up in GMail’s HTML mode.

I then made sure GMail was set to display 100 conversations per page and downloaded the conversation indices. GMail reported a total of 2098 chat conversations. The random string below appears to change from user to user and possibly from session to session. It can be determined by looking at URLs in HTML mode.

$ cd tmp
$ for st in `seq 0 100 2098`; do w3m -dump_source "[random string]/?s=l&l=cl&st=$st" > index.$st.html 2>> index.log; sleep .$(($RANDOM % 3))$(($RANDOM % 9)); done

Next, I scraped the indices for links to chat conversations and downloaded these links. Again, you may have to look at URLs to find the appropriate random string.

$ grep -h '?v=c&' index.*.html | sed 's/<a href="\(.*\)">/\1/' | while read u; do w3m -dump_source "[random string]/$u" > "con.$u.html" 2>> con.log; sleep .$(($RANDOM % 3))$(($RANDOM % 9)); done

I then downloaded the plain text chat transcripts. Careful, Google temporarily disabled my account after the loop below grabbed about 1500 conversations. You may want to increase the sleep time. Notice the bold string changed again…

$ cat con.*.html | grep -h '^<a name="m_.\+"></a>' | sed 's/^<a name="m_\([[:alnum:]]\+\)"><\/a>$/\1/' | while read th; do w3m -dump_source "[random string]/?v=om&th=$th" > trans.$th.txt 2>> trans.log; sleep .$(($RANDOM % 3))$(($RANDOM % 9)); done

So I then had each conversation in a file. The next step was to put these in a maildir.

$ mkdir -p maildir/tmp maildir/new maildir/cur
$ for f in trans.*.txt; do tail -n +2 "$f" | sed 's/^M$//' | perl -e 'use MIME::Words qw(decode_mimewords);while(<>){if($_=~/^Subject: (.*ANSI_X3.*)$/){$r=decode_mimewords($1);print "Subject: $r\n"}else{print $_}}' | perl -e 'use utf8;use Encode;while(<>){if($_=~/^Subject: (.*) \?\?\?\?\?\?\?\?$/){print "Subject: ",encode("MIME-Header","$1 さんとのチャット"),"\n"}else{print $_}}' > "maildir/cur/$f"; done

What a mess. The plain text version of the messages has a blank line at the top, so I used tail to filter it out. They also have DOS newline characters. I am not sure if GMail or w3m added these on, but they were easily removed with sed. Remember to type Ctrl+v <enter> to insert the ^M above. Google also appears to have some bugs in the MIME header code for chat logs. Several conversations took place during a period when GMail was set to present a Japanese interface, so I had to dig out the Perl for this.

Categories: Uncategorized Tags: , , , , ,