Feature Proposal: Boost Search Performance in huge webs by doing it parallel
Doing a lot of searches for dynamic pages and forms with dynamic content, performance gets poor when the webs are getting larger.
Description and Documentation
We use a lot of formatted searches even to set up form-options. Having more than 7000 topics in a web, calling an edit-Link needs more than 8 seconds now. Analysis shows, that searching the topics needs the time. What do you think about boosting up the performance by using perls thread implementation (since 5.8 built-in) to do the search in Foswiki/Store/SearchAlgorithms/Forking.pm? We usually have plenty of cores idling.
-- Contributors: GeorgOstertag
- 30 Mar 2011
It's an interesting idea; the challenge is adding some additional logic to manage sorting & grouping the partitioned results coming back from the various threads. Do you have any thoughts on how to partition the work within a web?
For what it's worth, MongoDBPlugin
is working towards completely delegating queries to an external database server. It's mostly working; we have many webs ~2000 topics, and at least one approaching 26,000. Depending on the topology of a given MongoDB deployment, a single query can be split automatically among several "shards", although we're not using this yet (rather, trying to load-balance queries among several servers).
The benefit of this approach is that we can configure the database solution to manage search partitioning & load-balancing, without complicating Foswiki itself. Once we have delegated ACL checking to the database, we will in fact avoid Foswiki perl code having to do much handling of topics at all (other than the ones it actually needs to render data from). For example, if you have a query that matches 26,000 topics and you want to display topics 1 to 10 of 26,000 - you don't want Foswiki to parse & process all 26,000 of them to determine how many of the matching topics may actually be shown to the user...
- 31 Mar 2011
Yes, I too have been pondering about how to multi-thread a foswiki rendering.
One thing to know, is that its not actually the Searching itself that is taking up most of the time, its more awkward things like testing for permissions (which today still requires loading the topic - but MongoDBPlugin
will do in as an in database query before July2011) , and then the iteration and rendering of all the topics.
Taken together, these (imo) call for thinking about using async worker / map-reduce like architecture for MACROs, rather than just for the search for topic hits.
Mind you, I also believe that heads towards a component result caching design I did a few years ago, where I re-wrote the rendering loop to use server side includes to re-combine dynamic fragments that were built onSave - very scatter-gather like.
That said, the MongoDBPlugin
performance improvements have been quite dramatic - The listed response times are for a web with 25,000 topics, running on a core2duo 1.8GHz desktop with 2 GB RAM - we've gone from 5.5seconds (grep&rcs) to 1.8seconds (mongodb) using plain cgi
- so with fastcgi / mod_perl / speedy_cgi preloading, that sound result in ~1.3second response times.
- 31 Mar 2011
7000 topics is peanuts for a real database, even for something like sqlite and for sure for a large scale database like mongodb. Well, most people aren't into large scale problems but that's something different.
Frankly, foswiki itself should not care about parallelizing a (single) search. These kind of optimizations should totally be offloaded to a database backend that does proper indexing. As Paul already mentioned, foswiki's search layer currently is processing var too much of a search result on application layer using perl rather than delegating it completely to a database layer, including sorting + limitting + acl checking. Once that is done I doubt that there still is a strong need for parallel search.
For now the only search engine for foswiki that does sort + limitting + acl checking (and much more) is SolrPlugin
by totally ignoring the standard %SEARCH layer of foswiki. Solr itself can be parallelized using master-slave replication. See SolrReplication
. Note that this is no parallelization of a single search in itself rather than distributing search services among nodes in a cluster.
- 31 Mar 2011
I had a look at MongoDBPlugin
and it looks interesting. However, setting up an extra database (on extra servers if we go that far), migrating all the data there and operating a DBMS comes at some extra costs and I am wondering, if this is really necessary when the machine, foswiki is running on, has still 7 cores idle while one is occupied with collecting the data for the page to display.
You are right, putting all the searching into a database-based solution might be the generic approach which might scale to several ten thousands of topics. However it does also mean to migrate from a cgi-script-plus-flat-files-only solution (which is one of the special and IMHO very attrictive features of foswiki) to a system where you have to operate a far more difficult setup (I am thinking of maintaining updates for the DBMS, backing up data and so on).
Having a Storage-solution in perl (cgi-script only) inside the application is very attractive for foswiki Users who are interested in simplicity. This Storage solution could be treated as an alternativ to a DBMS based solution. And it could contain performance optimization like parallel searching.
While not being an expert, looking at SearchAlgorithms
/Forking.pm "search" the first approach could be to split the inputTopicSet which contains M topics into chunks of N topics, where N > 10 and M/N < numCpu * 10 or something like that. Then create numCpu threads and feed them with the chunks, collecting the results. Afterwards join the results and order them according to the requested order parameter of the search.
- 05 Apr 2011
Using DBMS techniques doesn't necessarily mean to drop flat files. Instead all solutions discussed and implemented so far cache
the content data while indexing them for faster search. The credo followed here is to use 3rd party technology already available and build for the purpose of doing fast searches on large amount of data instead of reinventing the wheel. So in such a setup flat files remain the authoritative location
for your data which always can be used to bootstrap your DBMS by reindexing all available flat files. That way you get the benefit of both worlds, i.e. be more robust in cases where your system crashes leaving you behind with just the bare filesystem on your hard drive, this at a cost of having a more complex setup depending on your choice of a DBMS. Costs vary from no-brainer-do-it-yourself to need-to-ask-your-it-department of course. What you get is a migration path from simple to more complex while extending the capabilities of the system.
Besides, even without indexing your content into a separate DBMS there are internal requirements within the foswiki core that motivate a DBMS. For example tracking backlinks, childtopics, parent topics, preferences or cache users, groups and acls coming from external sources.
From what I see there's not much way around making use of some sort of DBMS ... and use foswiki as an application layer only for displaying results, not for doing the number crunching.
- 05 Apr 2011
Michael is correct, MongoDBPlugin
(an SQL equivalent) are only caches. We can throw away the content on either plugin/contrib and you don't lose any data. In fact, DBIStoreContrib
doesn't even require a special import step - it will automagically import any topics it doesn't have, "on the fly". MongoDBPlugin
has a rest handler which you must invoke to get it populated the first time (topics are kept in-sync thereafter; unless you hack the text files on the server).
- 06 Apr 2011
Thank you for all your explanation. So I will have a look at MongoDBPlugin
- 14 Apr 2011
Cool, hope you can provide some feedback. Check out the open tasks - Tasks.MongoDBPlugin
. FWIW I'm running Foswiki trunk (@svn Foswikirev:11285
) in production (with MongoDBPlugin
) which seem reasonably stable. The actual HEAD of MongoDBPluigin
& core on trunk right now has had some significant work done to it since these revs and so I'm waiting for the dust to settle from that before I update.
- 15 Apr 2011