Recently we started a new project named WikiTip at wikitip.info. This project came out from a long search of the best solution to provide a comprehensive glossary tooltip for some websites administered by us that have content in multiple languages and use different characters including simplified Chinese ones.
We came up with an server/client architecture based on a WordPress platform at the server side, with collaboration tools for developping the dictionaries, and a RESTful API based on JSON to consume the definitions from a target website. We also developed a client module for WordPress incarnated in the WikiTip Knowledge Cluster ToolTip for WordPress plugin that provides webmasters an easy tool to integrate remote definitions tooltips into their WordPress based websites. We named any wikitip site as a Knowledge Cluster since the definitions there will be clustered around a central concept.
At this point in time webmasters should ask for an account at wikitip.info and they will receive a free site like mycluster.wikitip.info where they can start building (or importing) their glossaries / dictionaries. Then on their own WordPress blogs they need to install the plugin and configure it with proper credentials, pointing to their knowledge cluster. Every post in the cluster is making a definition, and any tag associated to that post is a term to be explained. Those terms can be in any language, use any kind of characters, etc. For instance an English definition of the term ming men (life gate in Chinese) will be tagged with ming, ming men, mingmen, and 命门, making all terms that should receive the definition of ming men in the target websites.
By using WPML plugin, all clusters can hold translated definitions in several languages, while keeping the tags as defined terms (some related features are still in development).
The best feature of this system is that a cluster of knowldge may be reused on multiple websites, so webmasters in the same fields or that administer multiple websites can easily consume definitions from a single cluster at wikitip.info.
We are still developing the solution and we are now concerned with scalability issues. Because of this we have imported a bunch or dictionaries freely available in Romania at http://dexonline.ro and created our own version of dexonline at http://dexonline.wikitip.info. Currently we have imported over 250000 definitions and over 145000 base terms with over 900000 inflected forms. We decided to implement an algorithm that is language and punctuation independent, so we need to prepare our dictionaries before actual usage in order to reduce the processing time of identification of terms in the target website page text.
Below are some results we got on our VPS with 1 processor at around 2.4GHz and 2GB RAM:
Number of terms: 145000
Terms source | Preprocessing/loading dictionary object (sec) | Actual search duration within the text (sec) | Dictionary object size | Used memory |
---|---|---|---|---|
DB (preprocessing) | 38 | – | 2MB | 161MB |
Object retrieved from file cache | 1,74 | 0,37 | 2MB | 130MB |
Object retrieved from memory cache | 1,32 | 0,44 | 2MB | 128MB |
Number of terms: 1050000
Terms source | Preprocessing/loading dictionary object (sec) | Actual search duration within the text (sec) | Dictionary object size | Used memory |
---|---|---|---|---|
DB (preprocessing) | 409 | – | 12MB | 1GB |
Object retrieved from file cache | 5,19 | 1,2 | 12MB | 790MB |
Object retrieved from memory cache | 2,18 | 1,09 | 12MB | 778MB |
Therefore our strategy for large clusters is to preprocess once a day the dictionary into the binary format and to save it into the file cache. Subsequently, first reader will load it from the file cache to the memory cache and all following readers will use this object from memory. If the memory cache fails by any reason, then it will be again retrieved from the file cache in reasonable time. Thus the scalability of our system allows usage of quite large dictionaries within normal user expectations.
Of course, specialized – therefore small – dictionaries will be processed in split seconds by the same algorithm.
When launching in production, the VPS will be upgraded to 8GB of RAM to accomodate several simultaneous requests against the largest knowledge clusters.
Leave a Reply