TODO
----

 General development directions

* More various databases support.
* More various transport protocols support.
* More various APIs. e.g write Java class with libudmsearch support.
* Support for huge databases with hundred or thousand millions documents.
* Make it more managable, i.e. administration tools, etc.


  Below there are things that can be implemented somewhere in the future.
They are given in no paticular order. If you want to change the order of
their development, please ask on general@mnogosearch.org.


Search quality and results presentation
---------------------------------------
* Click rank
* Administator defined dynamic site priority:
	- approved sites which should be displayed in the top of results;
	- disapproved sites (e.g. for abuse) which should not be displayed.
* Take in account words context: <b>, <font size="xx">, <big> and so on.
* Optional automatic URL limit by SERVER_NAME variable.
* "Exclude" limits, for example "to search though everything except
  given site": ue=http://esite/
* Fuzzy search for accent letters, for example cyrillic "io" and "ie".
* Regex search
* Rank URLs with long pathnames lower than direct hits on let's say a domain 
name with no directory path.


Indexing related stuff
----------------------
* Detect clones on site level. Currently it is implemented on page level
only. The idea is to detect that site being indexed is a mirror of another
site without having to index all pages but after indexing several pages only.
* SPAM clearance.
* Fix that indexer bacame slow when ServerTable is big. This is because
of full consecutive examination. Make in-memory cache for ServerTable part.
* FTP digest ls-lR.gz support. For example,ftp://ftp.chg.ru/ls-lR.gz
* Make it possible for external parsers to return converted content 
together with headers like Content-Type, Title and so on.
* Exclude autoincrement mode for 'url' table. We have to use CRC32 mode
  since it is much faster for indexing and probably would take less space.


Charset related stuff
---------------------
* Remove "ForceIISCharset1251 yes/no"command. Replcase it with 
enhanced "CharsetByServer <charset> <regexp> [<regexp>...]" 
commmand.
* Stateful character sets support: UTF-7, Asian ISO-2022-XX
and others. They will not be used as a LocalCharset because
of much space, however indexer should be able to index them,
as well as search frontend should be able to use them as
a BrowserCharset.


Misc
----
* Smart search results cache cleaning after reindexing.
* Make it possible to set table names in indexer.conf and search.htm
* There was a discussion about word separators back in January; see 
http://www.mail-archive.com/udmsearch%40web.izhcom.ru/msg00200.html.
* Learn about dublin core. A simple set of standard metadata for web pages.
  http://www.searchtools.com/related/metadata.html#dc
* Add curl library support.
* Rewrite mirroring functions. Make it possible to optionally store whole 
document, not only MaxDocSize.


Portability and code quality
----------------------------
Remove warnings on various platforms. Currenly it is built without
warnings on Linux and FreeBSD with these CFLAGS:

-Wall 
-Wconversion 
-Wshadow  
-Wpointer-arith 
-Wcast-qual 
-Wcast-align 
-Wwrite-strings  
-Waggregate-return  
-Wstrict-prototypes  
-Wmissing-prototypes 
-Wmissing-declarations 
-Wredundant-decls 
-Wnested-externs 
-Wlong-long 
-Winline

However some other platform compilers do produce warnings.
For example, mixed signed/unsigned chars on NetBSD Alpha compiler. 
Please report those warnings to general@mnogosearch.org!


Documentation
-------------
* Constantly improve it!
* PDF version.



Things that will most likely be done in 3.3 (in no particular order)
--------------------------------------------------------------------

1. Better relevancy
   - DONE: separate word enumeration for each section
     and add number_of_words_in_this_section into coord, i.e. 
     "section + position_inside_section + number_of_words_in_this_section"
     instead of
     "section + position_inside_document"
     Note, number_of_words_in_this_section doesn't need to be exact,
     it can be approximate, to safe space.
   - DONE: Use number_of_words_in_this_section in relevancy formula,
     i.e. be close to the classic TF*IDF rank algorithm.
   - better "body" capacity (get rid of "64K words in body" limit)
     and more sections (reg rid of "256 sections" limit).
     It can be done using dynamic encoding, e.g.
     128 sections with 256*256*256 words plus
     128*256 sections with 256*256 words.

2. Cluster
   - Res2XML (built-in XML template)
   - DONE: XML2Res (to parse built-in XML template)
   - DONE: DBAddr http://hostname/path/to/searchxml.cgi
   - Make it possible to run search.cgi as a HTTPD server
   - Site enumerating without having to talk to each cluster node
     (e.g. crc48 or crc56, with direct encoding for short names)
   - Clone detection at search time
   - Configurable distibution type: by site_id, by seed, etc.

3. Extend SQL drivers to use prepared statements in sql.c
   - Prepare/Bind/Exec for MySQL
     (using mysql_escape_string or hex notation for 4.0,
      or using PS API for 4.1 and later)
   - Prepare/Bind/Exec for PgSQL
   - Prepare/Bind/Exec for Interbase
   - Prepare/Bind/Exec for SQLite3
   - Prepare/Bind/Exec for CTLib
   - Modify sql.c to use Prepare/Bind/Exec for all databases
   - DONE: DBMode=blob for Interbase 
   - DBMode=blob for sqlite3?

4. DBType=myinnodb (and maybe for other handler types)
   - scripts MySQL with Engine=InnoDB
   - true transactional code in sql.c, instead of LOCK TABLE.

5. More concurent "indexer" safety
   - test with concurrent indexers with all databases
   - set isolation levels or lock tables when running
     "indexer -Eblob" to avoid concurent indexers update
     tables (especially table "bdicti")
     which will give an inconsistant result in table "bdict".
     
     Oracle: 
             SELECT FOR UPDATE
             LOCK TABLE t1 IN {ROW SHARE|SHARE|EXCLUSIVE} MODE
             SET TRANSACTION
     
     DB2:    
             SELECT .. FOR {READ ONLY|FETCH ONLY|UPDATE [OF column [, column]*]}
             LOCK TABLE t1 IN {SHARE|EXCLUSIVE} MODE
             SET TRANSACTION
     
     PostgreSQL:
             SELECT FOR UPDATE
             LOCK [ TABLE ] name [, ...] [ IN lockmode MODE ] [ NOWAIT ]
               lockmode ::= ACCESS SHARE | ROW SHARE | ROW EXCLUSIVE |
                           | SHARE UPDATE EXCLUSIVE | SHARE
                           | SHARE ROW EXCLUSIVE | EXCLUSIVE | ACCESS EXCLUSIVE
             SET TRANSACTION ISOLATION LEVEL
     
     MSSQL:
             SET TRANSACTION ISOLATION LEVEL
             Hints in SELECT statement: UPDLOCK,  XLOCK, TABLOC
             SELECT...table_name (TABLOCK)  - share mode 
             SELECT...table_name (TABLOCK REPEATABLEREAD) - exclusive mode
             SELECT...table_name (TABLOCKX) - lock until the end of trans
             
             SELECT FOR UPDATE allowed only for DECLARE CURSOR.
             An exclusive lock can be placed on a SQL Server table with
             the SELECT..table_name (TABLOCKX) statement.
             This statement requests an exclusive lock on a table.
             It is used to prevent others from reading or updating
             the table and is held until the end of the command or transaction.
             It is similar in function to the Oracle
             LOCK TABLE..IN EXCLUSIVE MODE statement.
             
     Sybase:
             SELECT FOR UPDATE
             LOCK TABLE table-name IN { SHARE | EXCLUSIVE } MODE
             sa_locks - Displays all locks in the database.
             
             FOR UPDATE can not be used in a SELECT which is not part of the
             declaration of a cursor or which is not inside a stored procedure.
     
     Mimer:
             SELECT FOR UPDATE - is not allowed for a read-only cursor
             
             SET TRANSACTION ISOLATION LEVEL

6. More multithread safety
   - test with multiple threads
   - better robot.txt locking
     Currently all threads are waiting for a single thread
     to fetch robots.txt file, independently of host name.
     It can be done by implementing of a shared array
     of "robots.txt currently being fetched".

7. DBMode=blob improvements
   - RENAME TABLE for more databases
     MSSQL, Sybase:
       [EXEC] sp_rename t1,t2
       SELECT * INTO t1 FROM t2 WHERE 1=0; - copy structure (without indexes)
     Oracle:
       CREATE TABLE t2 AS SELECT field FROM t1 WHERE 1=0; -- does not copy idx
       ALTER TABLE t1 RENAME TO t2;
       RENAME t1 TO t2
     PostgreSQL:
       ATLER TABLE t1 RENAME TO t2;
     DB2:
       CREATE TABLE t1 LIKE t2; -- does not copy indexes
       RENAME TABLE t1 TO t2
     Mimer, Interbase: do not seem to have table rename.
   - Check a possibility to use VIEWs for those databases
     not supporting RENAME
   - Partial incremental "indexer -Eblob"
   - Configurable choice to run partial or full
     "indexer -Eblob", depending on amount
     of new data collected.
   - Put information from "url" into "bdict" table ???
   - Put information from "urlinfo" ???

8. Database consistency check (and maybe pepair) tools,
   - e.g. report (and/or remove) all bdicti/urlinfo records
   which don't have corresponding url records.
   - don't put lost url records during "indexer -Eblob" run,
   generate warnings if found lost records.

9. Source code and packaging improvements
   (see some more info added by svoj in TODO.ru)
   - more separate files (e.g. break utils.c)
   - dynamically loadable database modules
   - build statically linked (platform independent) and
     dinamically linked (distribution-specific) RPMs,
     FreeBSD packages and so on.
     Gentoo: http://www.mnogosearch.org/board/message.php?id=17992
     Solaris SPARC: http://www.mnogosearch.org/board/message.php?id=17955

10. mnoGoSearch benchmark suite
   - tiny (~1000 documents)
   - medium (~10000 documents
   - huge (~1000000 documents)
   - cluster with huge databases on several machines

11. Windows version
   - Unix compatible indexer.conf
       - UdmEnvWrite()  (can be done by a Unix developer)
       - GUI for all missing important commands
       - GUI for "extra" (i.e. not so important) commands 
   - package prepared plugins, for example for ispell or external parsers,
     to reduce manual actions required from user.

12. API improvements (PHP, ASP, Perl) ???
   - Stabilize and document C API.
   - Put module code into the main tree,
     add --with-php, --with-perl, and so on, options to configure.
   - Add "PHP via COM" frontend example (Windows)
   - From Yannick LE NY
     http://pecl.php.net/package/mnogosearch, there is an update.
     This update correct:
     - Initial PECL release
     - fix compiler warnings and errors on 64bit platform
     - #34705 (php bugs), disable udm_clear_search_limits when used with mnogosearch 3.2+
       *this is a required backward compatibility break*


13. Better internationalization (from Yannick LE NY)
     http://www.mnogosearch.org/board/message.php?id=17948
   - add i18n templates 
   - use gettext to i18n the indexer binary help and messages.
   - Korean frequency dictionaries ???
     See this thread:
     http://www.mnogosearch.org/board/message.php?id=17984
     http://www.mnogosearch.org/board/message.php?id=18219
   - Character set for FTP requests, replies, file listings.
     http://www.mnogosearch.org/board/message.php?id=17992

14. Documentation
   - Full step-by-step instructions how to install and configure mnoGoSearch

15. Support for Internationalized domain name:
    http://en.wikipedia.org/wiki/Internationalized_domain_name
    Maybe using GNU IDN Library: Libidn http://www.gnu.org/software/libidn/

16. Misc:
   - Single char problem: <font>W</font>ord
     http://www.mnogosearch.org/board/message.php?id=17998


Things that will be done before 3.3.0 release
---------------------------------------------
- Add "did you mean?" support into cluster.
  Fix that an attempt to generate word suggestions
  crashes when search is done from a cluster node.
  It tries to query to a non-existing SQL table.
  "did you mean" suggestion should be included into
  XML search response, if no documents were found.
- Fix stemming plug-in to work with the latest MySQL-5.1 versions
  (both in 3.2.x and 3.3.0)
- Complete DBMode=rawblob
- Test with Perl and PHP modules - add a way to cover 
  Perl and PHP modules by "msearch-test" tests.
- Change MinCordFactor and MaxCoordFactor to work per-section,
  not per document.
- NumSections autodetection from wf
  and from secno specifiers, e.g. "body:a b c".
- DBMode=blob for sqlite3
- Find the best combination of the default values for all score commands.
- Link all score commands in the manual,
  add a new section "commands affecting score".
- Add "indexer -Esql -e"SELECT xxx FROM"
- Move processing of DateFactor and RelevancyFactor to searchtool.c
  Make sure they're documented.
- Move processing of UserScore/UserScoreFactor to searchtool.c
  Make sure they're documented.
- List all DB software tested with
- Add cluster documentation:
  * How does it work?
  * "Merge" cluster type
  * "Distributed" cluster type
  * Quick install notes
  * Performance
  * List cluster limitation:
    + only "body" and "title" sections
    + cluster of clusters does not fully work


Documentation TODO for 3.3.0 release
------------------------------------
These ChangeLog items must be put into relevant manual sections:

* Cluster support was added. A typical cluster consists of several
  database machines and a single front-end machine. The front-end
  machine receives HTTP requests from a user's browser, forwards
  search queries to the database machines using HTTP protocol,
  receives back a limited number of top best search results (using a
  simple XML format, based on OpenSearch specifications) from every
  database machine, then parses and merges the results, and displays
  them according to score and applying HTML template. This approach
  distributes operations with high CPU and hard disk consumption
  between the database machines in parallel, leaving simple merge
  and HTML template processing functions to the the front-end
  machine. As of version 3.3.0, mnoGoSearch allows to join up to 256
  database machines into a single cluster.

  node.xml-dist is now installed into /etc directory - an XML
  template for a cluster database machine.

  "DBAddr http://hostname/search.cgi/node.xml" search.htm command
  was added, to specify an URL of a cluster database machine
  interface with XML format.

  "DBAddr file:///path/to/node.xml" search.htm command was added, to
  specify a static XML search response. This is mostly for test purposes.

  Two cluster types were implemented - a merge cluster to join
  results from several independent databases, each created by its
  own indexer.conf, as well as a distributed cluster - created by a
  single indexer.conf when indexer automatically distributes search
  index between database machines.

  Changing default distribution type from "reminder" to "quotient".
  Thus, for indexer.conf having three DBAddr command, distribution
  is done as follows:

      o URLs with seed 0..85 go to the first DBAddr
      o URLs with seed 85..170 go to the second DBAddr
      o URLs with seed 171..255 go to the third DBAddr
  This distribution style simplifies manual redistribution of an
  existing clustered database when adding a new DBAddr (i.e. a new
  database machine). Future releases will provide an automatic tool
  for redistribution when adding and deleting machines in an
  existing cluster, as well as more configuration commands to
  control distribution.

  See also http://www.mnogosearch.org/board/message.php?id=18766

* Fixed that indexer didn't work with MySQL-5.1.15-GPL.
