captura imagine website wikitip.info

o captura a paginii principale a sitului web wikitip.info

Recently we started a new project named WikiTip at wikitip.info. This project came out from a long search of the best solution to provide a comprehensive glossary tooltip for some websites administered by us that have content in multiple languages and use different characters including simplified Chinese ones.

We came up with an server/client architecture based on a WordPress platform at the server side, with collaboration tools for developping the dictionaries, and a RESTful API based on JSON to consume the definitions from a target website. We also developed a client module for WordPress incarnated in the WikiTip Knowledge Cluster ToolTip for WordPress plugin that provides webmasters an easy tool to integrate remote definitions tooltips into their WordPress based websites. We named any wikitip site as a Knowledge Cluster since the definitions there will be clustered around a central concept.

At this point in time webmasters should ask for an account at wikitip.info and they will receive a free site like mycluster.wikitip.info where they can start building (or importing) their glossaries / dictionaries. Then on their own WordPress blogs they need to install the plugin and configure it with proper credentials, pointing to their knowledge cluster. Every post in the cluster is making a definition, and any tag associated to that post is a term to be explained. Those terms can be in any language, use any kind of characters, etc. For instance an English definition of the term ming men (life gate in Chinese) will be tagged with ming, ming men, mingmen, and 命门, making all terms that should receive the definition of ming men in the target websites.

By using WPML plugin, all clusters can hold translated definitions in several languages, while keeping the tags as defined terms (some related features are still in development).

The best feature of this system is that a cluster of knowldge may be reused on multiple websites, so webmasters in the same fields or that administer multiple websites can easily consume definitions from a single cluster at wikitip.info.

We are still developing the solution and we are now concerned with scalability issues. Because of this we have imported a bunch or dictionaries freely available in Romania at http://dexonline.ro and created our own version of dexonline at http://dexonline.wikitip.info. Currently we have imported over 250000 definitions and over 145000 base terms with over 900000 inflected forms. We decided to implement an algorithm that is language and punctuation independent, so we need to prepare our dictionaries before actual usage in order to reduce the processing time of identification of terms in the target website page text.

Below are some results we got on our VPS with 1 processor at around 2.4GHz and 2GB RAM:

Number of terms: 145000

Terms sourcePreprocessing/loading dictionary object (sec)Actual search duration within the text (sec)Dictionary object sizeUsed memory
DB (preprocessing)38 –2MB161MB
Object retrieved from file cache1,740,372MB130MB
Object retrieved from memory cache1,320,442MB128MB

 

Number of terms: 1050000

Terms sourcePreprocessing/loading dictionary object (sec)Actual search duration within the text (sec)Dictionary object sizeUsed memory
DB (preprocessing)409 –12MB1GB
Object retrieved from file cache5,191,212MB790MB
Object retrieved from memory cache2,181,0912MB778MB

 

Therefore our strategy for large clusters is to preprocess once a day the dictionary into the binary format and to save it into the file cache. Subsequently, first reader will load it from the file cache to the memory cache and all following readers will use this object from memory. If the memory cache fails by any reason, then it will be again retrieved from the file cache in reasonable time. Thus the scalability of our system allows usage of quite large dictionaries within normal user expectations.

Of course, specialized – therefore small – dictionaries will be processed in split seconds by the same algorithm.

When launching in production, the VPS will be upgraded to 8GB of RAM to accomodate several simultaneous requests against the largest knowledge clusters.

Tags: , , , ,

I am using a modified jQuery Thesaurus for some project and I want to explain my solution for the case we have a huge terms database.

The original code provides the following functions:

1. checks the content of a HTML node in a webpage for terms against a dictionary

2. marks the found terms and construct links for AJAX calls to the terms definitions

3. on mouseover constructs a tooltip on the fly and populates it with the term definition

My modified version is using a JSON feed instead of the default DB controller, but this is not the subject to discuss in this article.

The js code waits for page to complete then downloads (or loads from DB) the full list of terms as a javascript object. At this moment, if the database has a big number of terms the speed of execution decreases until the tool becomes unusuable. There are reports that over 400-500 terms the solution is already out of question.

Here I want to explain my solution to this problem. I decided that any webpage content should be much more smaller than a list of terms from a database with several thousand of entries (or even 130k entries as mentioned in the above report). In that case it makes sense to pass the text to the DB controller then filter the list of terms only to the terms that actually exists in the target webpage.

Therefore I have modified the project to handle this request: (code was updated to address the issue mentioned in the first commentary)

1. Change this function as follows

/**
     * 1) Loads list of terms from server
     * 2) Searches terms in DOM
     * 3) Marks up found terms
     * 4) Binds eventhandlers to them
     */
    bootstrap : function() {
		var words;
		var list;
		$.each(this.options.containers, $.proxy(function(i, node) {
                	words += " " + $(node).text().replace(/[\d!;:<>.=\-_`~@*?,%\"\'\\(\\)\\{\\}]/g, ' ').replace(/\s+/g, ' ');
			list = words.split(" ");
			list = removeDuplicates(list);
			words = list.join(" ");
            }, this));
        $.getJSON(this.options.JSON_TERMS_URI+'&words='+words+'&callback=?', $.proxy(function(data){
            this.terms = this._processResponseTerm(data);
            $.each(this.options.containers, $.proxy(function(i, node) {
                this._searchTermsInDOM(node);
                this._markup(node);
            }, this));
            this.bindUI('body');
        }, this));
    },

You can see I am accessing my JSON feed instead of the DB controller but this is not an issue, the idea remains the same. I am passing the extracted text from the containers declared in the Thesaurus Options.

2. Filter the terms in the DB controller (syntax is for generating a JSON feed)

$tgs = $this->get_thesaurus(); //make an array of all terms in the dictionary
$words = $_GET['words']; //load list of unique words from target
$tags = array();

foreach ($tgs as $tag) {
	$list = explode(" ",$tag); //make list of words from each term
	foreach ($list as $word) {
		if (stristr($words, $word)) { //check if any of the words are present at target
			$tags[] = $tag;
			break;
		}
	}
}
		
return array( //return JSON
  'count' => count($tags),
  'tags' => $tags
);

By using this method the size of the dictionary terms loaded in the javascript object falls back to a small number and the speed of the solution is not anymore compromised. It is true that for webpages with massive content the list of words cannot be sent to the server, but for most of the cases this solution will work well.

Tags: ,

O traducere în limba română pentru modulul bbPress 2.0.x pentru WordPress.

Se instalează în directorul …/plugins/bbpress/bbp-languages/

bbPress20-ro_RO.zip

733 de șiruri traduse (100%).

(This is bbPress 2.0 plugin translation into Romanian language.)

Tags:

Discutam mai demult despre înlocuirea HDD cu dispozitive SSD în mediul enterprise. Iată că odată cu lansarea unei noi generații de cipuri Flash numite eMLC (de la enterprise MLC) deja hard-discurile de înaltă performanță pot fi înlocuite cu dispozitive SSD la un preț mai avantajos.

Desigur că deocamdată sunt multe limitări, iar RAMSAN-810, primul sistem de stocare cu eMLC de la TMS este recomandat numai pentru aplicații read-intensive precum cele de data warehousing.

Mai jos este un grafic interesant cu privire la evoluția prețurilor dispozitivelor de stocare.

Tags: ,

Relevanssi - WordPress Search Done RightRelevanssi este un modul de căutare performant pentru WordPress și WordPress Multisite. El poate fi instalat în două variante, una gratuită și una premium, pentru care se plătește un abonament anual.

Modulul este compatibil cu ultimele versiuni de WordPress și poate indexa orice fel de conținut clasic sau personalizat (custom post types, custom taxonomies, profile utilizatori, comentarii, etc). Rezultatele sunt ordonate după relevanță. conform scorului obținut pe baza mai multor criterii configurabile în zona de administrare.

În versiunea multisite se pot face căutări în mai multe bloguri simultan prin configurarea formularului de căutare și a machetei de afișare a rezultatelor.

Modulul permite evidențierea termenilor căutați atât din căutările proprii cât și din cele externe, de la Google, AOL, Bing, Yahoo.

Oferim aici o traducere profesională a modulului în limba română. Pachetul de traducere conține localizările ambelor versiuni Relevanssi, prin urmare nu contează ce versiune folosiți, pur și simplu plasați fișierele în folderul modulului și traducerea va funcționa automat.

Pentru a accesa modulul urmați linkul din imaginea din dreapta.

Relevanssi free 2.9.9 / Relevanssi Premium 1.6 localizare ro_RO

Tags: ,

Prezentăm în acest scurt articol rezultatele unor teste de viteză pentru dispozitive de stocare de tehnologii diferite, utilizabile în computere personale.

Testul constă în scrierea și citirea unui fișier de 256MB, folosind blocuri de mărime diferită (pe axa Y) și o coadă de 4 comenzi de scriere (una în execuție, trei în așteptare). Despre teoria cozilor în ceea ce privește dispozitivele de stocare am mai scris în studiile de caz corespunzătoare.

Rezultatele au fost exprimate în lățime de bandă disponibilă, anume câți MB se pot transfera într-o singură secundă pe acel dispozitiv de stocare. Măsurătorile IOPS nu sunt foarte interesante pentru calculatoarele personale așa că au fost ignorate în acest test.

Drept concluzie, un disc SSD este foarte util pentru creșterea performaței unui calculator personal. De asemenea sisteme de backup conectate la USB 3 sunt deja o necesitate în cazul configurațiilor de calcul moderne.

În plus vechea idee de a folosi un memory stick pentru accelerarea sistemului de operare Windows se pare că nu mai are deloc aplicabilitate.

Viteze pentru 4 tehnologii de stocare

Tags: ,

This plugin creates and maintains one or multiple hierarchies of blogs in the WPMS network.

Download the plugin here: nsh.zip v0.1.0

Newest installation will always be found here: http://wordpress.org/extend/plugins/wpmswpmu-network-sites-hierarchy/

 

Tags:

« Older entries § Newer entries »

Recommended Articles