In more than one occasion I found myself in the need of a little, handy search engine for my custom, PHP-based websites. If the main content of our pages is stored in a database table (that is the case of all the CMS), than we just need to use some SQL queries to sort the pages… but what if we’re building a very customized website from the scratch, and many of our pages are handled manually with a lot of custom PHP code?
There are obviously a lot of great search engines already out there that will be able to do the dirty indexing job for you. I’m particularly in love with Sphider, a lightweight PHP-MySQL web spider that in just 300KB features advanced stuff like word autocompletion, spelling suggestions, etc.
But what if we’d like to have a further simplified search engine, something to use with our simple 50 pages website and that
- Could be written in less then 100 lines of code
- Could use a single database table
- Could preserve all the major functionalities, like natural language searches, common words removal and optional boolean search?
Today I’d like to introduce you a super-easy technique that uses the ob_get_contents() PHP buffering function. I really don’t know if that’s a common solution or not, I tried to google about it but I couldn’t came up with anything quite similar. Let’s start.
How does it work? The idea.
Read this simple piece of code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <html> <head>...</head> <body> <?php ob_start(); ?> <h1>Basic Page Example</h1 <p>This is your custom page code, something with basic XHTML mixed with PHP:</p> <ul> <?php for ($i=0; $i<100; $i++): ?> <li>List Item <?= $i ?></li> <?php endfor; > </ul> <?php $content = ob_get_contents(); ob_end_flush(); ?> </body> </html> |
The result is that in the $content variable you’ll have the whole exploded page content. As you might be starting to guess, that’s a potentially perfect start for our website indexing process!
If we take this text, we remove the tags not to be indexed (<script>, <object>, <form>, etc.), we strip all the remaining tags and we store the result in our database table, we’re already half way through the problem!
Let’s create our tag stripper function (that’s just a basic stub, it could be improved to handle useless whitespace removal, etc.):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | function removeTagWithContent($text, $tags = '', $invert = FALSE) { preg_match_all('/<(.+?)[s]*/?[s]*>/si', trim($tags), $tags); $tags = array_unique($tags[1]); if(is_array($tags) AND count($tags) > 0) { if ($invert == FALSE) return preg_replace('@<(?!(?:'. implode('|', $tags) .')b)(w+)b.*?>.*?</1>@si', '', $text); else return preg_replace('@<('. implode('|', $tags) .')b.*?>.*?</1>@si', '', $text); } elseif ($invert == FALSE) { return preg_replace('@<(w+)b.*?>.*?</1>@si', '', $text); } return $text; } function tagStripper($content) { $content = removeTagWithContent($content, '<script><embed><form>', true); $content = strip_tags($content); $content = html_entity_decode($content); return $content; } |
The MySQL Table
This is the basic SQL code you need to create a proper table to store all the indexed data:
1 2 3 4 5 6 7 8 9 10 11 | CREATE TABLE `search_pages` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `page` mediumtext NOT NULL, `title` mediumtext NOT NULL, `content` longtext NOT NULL, `indexed` timestamp NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `page` (`page`(255)), FULLTEXT KEY `fulltext_title` (`title`), FULLTEXT KEY `fulltext_content` (`content`) ) ENGINE=MyISAM; |
As you can see, we added 2 different full-text indexes, one for the page title and one for the page content. This will allow us to have a finer grained control over in the MySQL sorting algorithm.
Indexing the pages
Everything is ready for our page indexing. Let’s create the functions to be called on every part of the page you want to perform indexing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | $pagecontent = ""; $title; function setIndexTitlePage($t) { global $title; $title = $t; } function startIndexingContent() { if (!isIndexingEnabled()) { return; } ob_start(); } function stopIndexingContent() { global $pagecontent; if (!isIndexingEnabled()) { return; } $text = ob_get_contents(); ob_end_flush(); $pagecontent .= tagStripper($text); } function storeIndexedContent() { global $pagecontent, $title; if (!isIndexingEnabled()) { return; } // the page URL $page = preg_replace('#?.*$#D', '', $_SERVER['REQUEST_URI']); $link = mysql_connect('host','user','pass'); $sql = sprintf( "INSERT INTO search_pages (page, title, content, indexed) VALUES ('%s', '%s', '%s', FROM_UNIXTIME(%d))", $page, $title, $pagecontent, time() ); mysql_query($sql, $link); } |
So that you’ll have to wrap the relevant part of your pages between the startIndexingContent() and stopIndexingContent() functions. This is to avoid a repeated indexing of common parts like headers, sidebars, etc. At the end of your pages, you just have to call the storeIndexedContent() function to save it for later use.
Wait! There’s something wrong here…
Yep. That’s right. If we really do it in this way, then all the visits to every page of your website will cause a write in the DB. Which is not that good. We’ve to find out a way to know when it’s really the time to re-index the pages. This is obviously a quite website-specific task. I’m lefting this part out of the tutorial as it’s the only one that could cause a noticeable slow-down to your server if not setted up in the right way, but let me share some ideas about that:
- Enable website re-indexing after a fixed amount of time, i.e. 2 days: Check the timestamp of the indexed version in the DB table, if it’s older that a specific amount of time, the update the row with the new content.
- Make a SHA1 hash of the content of the page, if it differs from the one stored in the db, then store the new version!
- If your site is SVNd, than just use the SVN revision number to find out if the page stored in the db needs to be refreshed.
- Disable the indexing everytime, except when the page is being called by a specific cron task you made to spider the content (it’s just a matter of calling a recursive retrieval of pages using wget).
I’m open to other suggestions, also.
Last but not least: the searching functions!
All the work is done within MySQL: you just need to use the full-text specific syntax MATCH (...) AGAINST (... IN BOOLEAN MODE) to perform a full text search to some table fields.
1 2 3 4 5 6 7 8 9 | SELECT *, 0.6 * MATCH(title) AGAINST(%s IN BOOLEAN MODE) + 0.9 * MATCH(content) AGAINST(%s IN BOOLEAN MODE) AS rank FROM search_pages WHERE MATCH(title) AGAINST(%s IN BOOLEAN MODE) OR MATCH(content) AGAINST(%s IN BOOLEAN MODE) ORDER BY rank DESC |
As you can see, thanks to the two separate fulltext indexes instead of a single one, we are now able to weight the title tag in a different way than the content of the page. The query will return us the the rows sorted by the rank field.
Conclusions
This is a super-quick search engine implementation. It cannot be used for big sites with tons of pages, nor it can provide advanced features like keyword suggestion. But it’s really 90 lines of code to implement, if you know what I mean. I’m using it in a 20 pages big website, and it works surprisingly well. And the search results are incredibly good, too.
It could certainly be useful to some of you. Or, at least, it might have been useful to let you understand the super powers that MySQL has in terms of full-text searches.
Hope to have some feedback/suggestions about this one.. :)


[...] View original here: Create a quick-and-dirty search engine for your custom PHP website … [...]
all in one search engine
great! :-P
i’ll try it in the next occasion!
check http://trypu.com
cool stuf, but is it possible to make the search with out a database, maybe some php that extracts text from html documents and saves it to xml and perform the search on those xml files. grazie.
Hi, super cool code althought i cant get it to work since its complaining about PHP Fatal error: Call to undefined function isIndexingEnabled() in test.php on line 31
pucaqydi…
< a href=http://namelindablog.info/free-nanda-nursing-diagnosis/>Free Nanda Nursing Diagnosis< / a > …
sorry for using this method to contact you,i dont see a link or an email address.
how much do you charge.
There are wide variety of nike air max for both men and women’s selection.nike air max are best selling nowadays.Get your own Cheap Nike Shox,Women’s Nike Shox NZ Shoes now.