Skip to main content.
home | support | download

Back to List Archive

[swish-e] Reassessment of the problem - Indexing emails

From: Troy Wical <troy(at)not-real.wical.com>
Date: Wed Nov 03 2010 - 12:49:59 GMT
	I've spent a bit of time with Swish-E in an attempt to get an ezmlm mailing list archive searchable, and Peter and others have been kind enough to wade through the journey with me.  I've learned a whole lot along the way, which I am very grateful for. However, after 6 months of off and on work on this project, I want to see if I can get the opinion of others on this list that likely have a more professional approach. I can hack, use google, and follow directions fairly well, but I am in no way a coder of any sort. I think that aspect has lead me to try different approaches to my problem in a manner that may not have been the most resourceful. With that in mind, let me provide a brief description of the problem, and what I have tried so far. If others have ideas that they are willing to share, I thank them in advance.

THE GOAL
--------
	To make the email archives of our ezmlm mailing list, searchable. The archives are in a format related to maildir. The are 100 emails per folder, with each email file named 00 through 99. Below is an example of the emails format:

######## BEGIN EXAMPLE EMAIL ##########
Return-Path: <user@domain.com>
Mailing-List: contact mailinglist-help@domain.com; run by ezmlm
Delivered-To: mailing list mailinglist@domain.com
Received: (qmail 71965 invoked from network); 12 Feb 2002 02:33:14 -0000
Received: (QMFILT: 1.1); 12 Feb 2002 02:33:14 -0000
Received: from imo-m08.mx.aol.com (64.12.136.163)
  by pon.type2.com with SMTP; 12 Feb 2002 02:33:13 -0000
Received: from user@domain.com
        by imo-m08.mx.aol.com (mail_out_v32.5.) id w.168.8a7a97e (4012)
         for <message@domain.com>; Mon, 11 Feb 2002 21:33:08 -0500 (EST)
From: user@domain.com
Message-ID: <168.8a7a97e.2999d8e3(-at-)aol.com>
Date: Mon, 11 Feb 2002 21:33:07 EST
To: email@email.com
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Mailer: AOL 5.0 for Mac sub 28
Subject: Re: subject line

message body......
######## END EXAMPLE EMAIL ##########

SOLUTION #1
-----------
	This is what is currently in place. In can be viewed at http://type2.com/search . It's an index of all the flat text email files (roughly 250k.) It's not ideal, but it makes all the content available, which wasn't the case before. Aside from that, it's not much of a step forward. Search results list emails by the emails file name, which is something between 00 and 99. There is no excerpt, and as such, no phrase highlighting. The user has to click and view each email to get any idea of whether the message is relevant to what they are searching for. Boolean does help narrow down the results, but todays users are expecting more, appropriately. This is using Swish-E 2.4.5.

	In an attempt to get the subject line of the email to be indexed as the swishtitle, I started a thread on this discussion forum (see http://swish-e.org/archive/2009-12/12787.html) Peter set me on track to try and get Swish3 to help me out. Steep learning curves for me provided plenty of frustration (learned not to manually install perl modules and use CPAN at the same time,) but in the end, we fought through the errors and got an index to finally be created. That index can be seen at http://type2.com/cgi-bin/searchv2/search.cgi, using the swish3 checkbox. This was a great improvement! There were now excerpts, and swishtitle was populated by something more useful than a number. However, the link to the message that was being provided, was not working properly. It was pointing to a location that appeared to be a combination of the directory that the message resides in, and the senders email address. For example if you search for "test" against the swish3 index (http://type2.com/cgi-bin/searchv2/search.cgi?query=test&submit=Search!&sort=swishrank&si=0) the first message links to:
	http://type2.com/mail-archives/test.v01510114b88e17bce393(-at-)[66.44.105.253]

	I would rather have it be http://type2.com/mail-archives/test/20, which is the actual message. Granted, right now that url does not work. However, that does not appear to be related to the fact that swish3 is indexing the message id as the location. I'm not sure if ReplaceRules can take care of that or not, but I get the feeling that something else needs to take place for it to link that way I would like.

	For a period of time, I looked into how to convert an email to xml or html, so that properties could be more easily defined. The problem I ran into though, was that none of the code I was finding, was built to convert emails that had the structure I was dealing with. I was able to put a script together that ripped out the headers and cleaned up the email a bit though. The result is any of the emails seen from a search at http://type2.com/search, as opposed to the example given above.


#### INSERT LIFE HERE (several month departure from the project ####

SOLUTION #2
-----------
	I'm not a fan of ezmlm, but I didn't build the environment, I took it over from a previous generation of volunteers that wanted to move on after decades of service. And, to the credit of the environment, it's very stable. As much as I would prefer mailman on debian, this setup of FreeBSD and qmail has ran rock solid for a long time. That being said, I've worked to try and learn how it works, instead of go and change everything. In an effort to find a solution that would make these archives more searchable, I found ezmlm-web, a web based solution to viewing the archives. You could sort by thread, date, user (our implementation is at http://type2.com/ezmlm-archives/index.cgi.) It's was great start. However, you can only search the subject line, which just doesn't cut it. There is no timeline from the developer regarding possible full-text search options, so I continued to try and get swish-e to get the job done.
	Enjoying the idea of using ezmlm-web's existing ability to provide readability via thread, user, date, etc, I though about using spider.pl to crawl the ezmlm-web portal. I've implemented spider.pl at work to index 10's of internal documentation sites and make them all searchable from one location, saving admins loads of time finding info. Perhaps it could work here too. So I tried it. I set no depth for the spider, and screen'd out the session. Several days later, I reattached to the screen session to find the spider still crawling away. For those searching the swish-e archives, I ran into issues this way too (http://swish-e.org/archive/2010-10/12946.html). Once again, Peter and Bill came through. So, without incremental index updates, this was going to take too long to use. It provides the output I am looking for (excerpt, highlighting, swishtitle = subject, etc) but at the expense of taking a very long time to complete. The "Type2" index at http://type2.com/cgi-bin/searchv2/search.cgi is the small example I did.


A SHORT EMAIL TURNED LONG
-------------------------
	My intention was not to create another long email here, I've done plenty of that already. Too late though. My steep learning curve, combined with the amount of time it is taking me to implement something that I consider successful, has me a bit frustrated. So, this long email culminates with the following question...

What is your opinion on the best way to index this type of email archive?

Your time is greatly appreciated. As someone who is volunteering his time to maintain a site, and make it more functional, I understand how valuable time is.

Troy
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Nov 3 08:50:04 2010