Hello swish-e community,
we have problems with swish-e 2.4.3 (compiled with large file support,
/configure CPPFLAGS='-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64')
We have ~ 3 million XML records in one file. With header information like
Content-Length: 1407
Path-Name: 3
Document-Type: XML*
<?xml version="1.0" encoding="utf-8"?>
...
The file size is ~ 5 GB (zipped ~ 700 MB).
Our Linux server has 6 GB of main memory.
We've tried to build an index in both ways: with and without -e option:
zcat <zipped-xml-file> | swish-e [-e] -v 3 -c <conf-file> -S prog -i stdin
In both cases we got incomplete index/prop files with ".temp"-extension.
Without "-e" swish-e does not process all XML records.
It ran out of memory while working on XML record with id 4050780 (== path name).
end of logfile:
4050780 - Using XML2 parser - (60 words)
err: Ran out of memory (could not allocate 262144 more bytes)!
.
Using the "-e" option swish-e processed all XML records (id 1111111135
is the last record in our XML file). But it stopped working without
any error message in the logfile and without generating a core dump.
end of logfile:
1111111135 - Using XML2 parser - (20 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 14,395,157 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: ...
Perhaps anyone can help us.
Thanks a lot in advance.
Best regards, Uwe Dierolf
--------------------------------------------------------------------------
Uwe Dierolf Tel 0721/608-6076
University Library of Karlsruhe Fax 0721/608-4886
Straße am Forum 76049 Karlsruhe / Germany
--------------------------------------------------------------------------
Received on Mon Jan 31 00:41:23 2005