Skip to main content.
home | support | download

Back to List Archive

Stemming - Varying Results

From: Antonio Barrera <abarrera(at)not-real.princeton.edu>
Date: Tue Oct 04 2005 - 20:14:35 GMT
Swishers,
 
I have a database which I dump to XML files and then index.  The following
is an example of two of the xml files:
 
file: 6.xml:
 
<?xml version="1.0" encoding="ISO-8859-1"?>
<record id='4'>
<id>4</id>
<link>http://www.accessscience.com/server-java/Arknoid/science/AS</link>
<title>
<maintitle>AccessScience</maintitle>
<alttitle>McGraw-Hill Encyclopedia of Science and Technology</alttitle>
</title>
<subjects>
Aerospace Engineering Chemical Engineering Chemistry Civil Engineering
Electrical Engineering Environmental Engineer
ing Mechanical Engineering Nanotechnology Operations Research General
Engineering </subjects>
<dates>2002+</dates>
<location_access></location_access>
<brief_description>Web version of McGraw-Hill Encyclopedia of Science &amp;
Technology, covers 9th ed. (2002) onward
 </brief_description>
<more_information></more_information>
<keywords></keywords>
<notes></notes>
<is_avail>0</is_avail>
<availability></availability>
<long_description>Web version of McGraw-Hill Encyclopedia of Science &amp;
Technology, covers 9th ed. (2002) onward.
 It contains articles, dictionary terms, and biographies plus research
updates, links to relevant web sites, and a &
quot;Student Center.&quot; Includes over 7100 articles in 81 major areas of
science and technology.</long_descriptio
n>
<libwebcomp>0</libwebcomp>
</record>

File: 398.xml

<?xml version="1.0" encoding="ISO-8859-1"?>
<record id='416'>
<id>416</id>
<link>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?otool=njpulib</link>
<title>
<maintitle>Medline (PubMed)</maintitle>
<alttitle>PubMed</alttitle>
</title>
<subjects>
Chemistry Environment History of Science Psychology Public Policy General
Engineering Biological &amp;amp; Life Scie
nces Health &amp;amp; Medicine Nanotechnology Population Research Sociology
</subjects>
<dates>1951+</dates>
<location_access></location_access>
<brief_description>A service of the National Library of Medicine, includes
over 14 million citations for biomedical 
articles back to the 1950's. </brief_description>
<more_information></more_information>
<keywords></keywords>
<notes></notes>
<is_avail>0</is_avail>
<availability></availability>
<long_description>A service of the National Library of Medicine, includes
over 14 million citations for biomedical a
rticles back to the 1950's. </long_description>
<libwebcomp>0</libwebcomp>
</record>

I am using Stemming_en, a search for "Environmental" includes the 6.xml in
the results, but not 398.xml.  A search for environment, returns 398.xml,
but not 6.xml.  In the live version, Environment returns 22 hits,
Environmental 30.  Shouldn't stemming result in the same number of hits?
 
My config file for indexing is below:
 
IndexDir /var/www/search_indexes/dbs
IndexContents XML* .xml
FollowSymLinks yes
 
IndexFile /var/www/search_indexes/db.xml.index
 
IndexName "Resource DBs Index"
IndexDescription "This is an index to the databases and indexes." 
IndexAdmin "Antonio Barrera (abarrera@princeton.edu)"
 
UndefinedMetaTags index
PropertyNames maintitle alttitle brief_description long_description id dates
notes is_avail availability libwebcomp 
more_information location_access
 
ConvertHTMLEntities yes
IndexReport 3
 
FuzzyIndexingMode stemming_en
 
StoreDescription XML2 <brief_description> 800
 
WordCharacters abcdefghijklmnopqrstuvwxyz\&;0123456789.@|,-'"[](~!@$%^{}_+?
 
BeginCharacters abcdefghijklmnopqrstuvwxyz\&;0123456789.@|,-'"[](~!@$%^{}_+?
 
EndCharacters abcdefghijklmnopqrstuvwxyz\&;0123456789.@|,-'"[](~!@$%^{}_+?
 
IgnoreLastChar .@|,-'"[](~!@$%^{}_+? 
 
IgnoreFirstChar .@|,-'"[](~!@$%^{}_+? 
 
IndexOnly .xml

 
thanks,
Antonio Barrera
Princeton University Library
Received on Tue Oct 4 13:14:53 2005