jQuery Thesaurus Fix for Huge Terms Database

I am using a modified jQuery Thesaurus for some project and I want to explain my solution for the case we have a huge terms database.

The original code provides the following functions:

1. checks the content of a HTML node in a webpage for terms against a dictionary

2. marks the found terms and construct links for AJAX calls to the terms definitions

3. on mouseover constructs a tooltip on the fly and populates it with the term definition

My modified version is using a JSON feed instead of the default DB controller, but this is not the subject to discuss in this article.

The js code waits for page to complete then downloads (or loads from DB) the full list of terms as a javascript object. At this moment, if the database has a big number of terms the speed of execution decreases until the tool becomes unusuable. There are reports that over 400-500 terms the solution is already out of question.

Here I want to explain my solution to this problem. I decided that any webpage content should be much more smaller than a list of terms from a database with several thousand of entries (or even 130k entries as mentioned in the above report). In that case it makes sense to pass the text to the DB controller then filter the list of terms only to the terms that actually exists in the target webpage.

Therefore I have modified the project to handle this request: (code was updated to address the issue mentioned in the first commentary)

1. Change this function as follows

/**
     * 1) Loads list of terms from server
     * 2) Searches terms in DOM
     * 3) Marks up found terms
     * 4) Binds eventhandlers to them
     */
    bootstrap : function() {
		var words;
		var list;
		$.each(this.options.containers, $.proxy(function(i, node) {
                	words += " " + $(node).text().replace(/[\d!;:<>.=\-_`~@*?,%\"\'\\(\\)\\{\\}]/g, ' ').replace(/\s+/g, ' ');
			list = words.split(" ");
			list = removeDuplicates(list);
			words = list.join(" ");
            }, this));
        $.getJSON(this.options.JSON_TERMS_URI+'&words='+words+'&callback=?', $.proxy(function(data){
            this.terms = this._processResponseTerm(data);
            $.each(this.options.containers, $.proxy(function(i, node) {
                this._searchTermsInDOM(node);
                this._markup(node);
            }, this));
            this.bindUI('body');
        }, this));
    },

You can see I am accessing my JSON feed instead of the DB controller but this is not an issue, the idea remains the same. I am passing the extracted text from the containers declared in the Thesaurus Options.

2. Filter the terms in the DB controller (syntax is for generating a JSON feed)

$tgs = $this->get_thesaurus(); //make an array of all terms in the dictionary
$words = $_GET['words']; //load list of unique words from target
$tags = array();

foreach ($tgs as $tag) {
	$list = explode(" ",$tag); //make list of words from each term
	foreach ($list as $word) {
		if (stristr($words, $word)) { //check if any of the words are present at target
			$tags[] = $tag;
			break;
		}
	}
}
		
return array( //return JSON
  'count' => count($tags),
  'tags' => $tags
);

By using this method the size of the dictionary terms loaded in the javascript object falls back to a small number and the speed of the solution is not anymore compromised. It is true that for webpages with massive content the list of words cannot be sent to the server, but for most of the cases this solution will work well.


Comments

6 responses to “jQuery Thesaurus Fix for Huge Terms Database”

  1. Well it appears that if the webpage has too much content then the script is unable to load the text to the server. Therefore I need to find a way to eliminate duplicate words so the load would be smaller.
    Will be back on that.

  2. So I managed to find a solution to POST the page content no matter the size. It involves a PHP Proxy to be installed to the content server. The jQuery script will POST the text content to this proxy toghether with remote URI information, and at the other end a special page will accept the POST and save the content and a key into a database table.

    After the POST is completed the callback function will construct the thesaurus object in the client browser and the rest will function normally.

    If anybody is interested I will add the code snippets for this solution as well.

  3. Felix Avatar
    Felix

    I would like to see that solution here 🙂

    I am playing with thesaurus and a special database with many terms.

  4. Hi,

    I just published a plugin into the wordpress plugin repository here: http://wordpress.org/extend/plugins/wikitip-knowledge-cluster-tooltip-for-wordpress/. This plugin performs these tasks:

    1. loads necessary js and css files
    2. when the page is ready it extracts the text from containes into a variable and post it back to the server. In fact I need to post it to the remote server but this is impossible to do because of cross domain scripting interdiction so I post it to a local php proxy page that in exchange transfer everything to the remote server. if you choose to use a local db controller then you will not need anymore to perform this trick.
    3. at the place where the dictionary is stored the server matches existing terms from the dictionary that it can also find in the page and returns only a small array of relevant terms to the browser.
    4. then the script identifies the terms in text as usual, then link them to the definitions.

    Of course step 3 is more tricky since we need to use callbacks to request the terms only when the post was succesfull in the first place. But it is not that hard to understand it.
    Let me know if you have more questions.

  5. I am not using wordpress)

    is it possible to use the script correctly not as a plugin to wordpress?

  6. Yes, but you need someone to write whatever code you need in your platform. Whatever platform you use, you need to process the dictionary and filter the big number of terms in it to a small number of terms that are already inside the target text. This is done in backend. Then you only send the list of filtered terms to the original script and process it normally.

Leave a Reply