com.ibm.icu.text
Class DictionaryBasedBreakIterator
- Cloneable
public class DictionaryBasedBreakIterator
A subclass of RuleBasedBreakIterator_Old that adds the ability to use a dictionary
to further subdivide ranges of text beyond what is possible using just the
state-table-based algorithm. This is necessary, for example, to handle
word and line breaking in Thai, which doesn't use spaces between words. The
state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide
up text as far as possible, and then contiguous ranges of letters are
repeatedly compared against a list of known words (i.e., the dictionary)
to divide them up into words.
DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old,
but adds one more special substitution name: _dictionary_. This substitution
name is used to identify characters in words in the dictionary. The idea is that
if the iterator passes over a chunk of text that includes two or more characters
in a row that are included in _dictionary_, it goes back through that range and
derives additional break positions (if possible) using the dictionary.
DictionaryBasedBreakIterator is also constructed with the filename of a dictionary
file. It uses Class.getResource() to locate the dictionary file. The
dictionary file is in a serialized binary format. We have a very primitive (and
slow) BuildDictionaryFile utility for creating dictionary files, but aren't
currently making it public. Contact us for help.
protected class | DictionaryBasedBreakIterator.Builder- The Builder class for DictionaryBasedBreakIterator inherits almost all of
its functionality from the Builder class for RuleBasedBreakIterator_Old, but
extends it with extra logic to handle the DICTIONARY_VAR token
|
int | first()- Sets the current iteration position to the beginning of the text.
|
int | following(int offset)- Sets the current iteration position to the first boundary position after
the specified position.
|
protected int | handleNext()- This is the implementation function for next().
|
int | last()- Sets the current iteration position to the end of the text.
|
protected int | lookupCategory(char c)- Looks up a character category for a character.
|
protected RuleBasedBreakIterator_Old.Builder | makeBuilder()- Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
|
int | preceding(int offset)- Sets the current iteration position to the last boundary position
before the specified position.
|
int | previous()- Advances the iterator one step backwards.
|
void | setText(CharacterIterator newText)
|
void | writeTablesToFile(FileOutputStream file, boolean littleEndian)
|
checkOffset, clone, current, debugDumpTables, debugPrintln, equals, first, following, getRuleStatus, getRuleStatusVec, getText, handleNext, handlePrevious, hashCode, isBoundary, last, lookupBackwardState, lookupCategory, lookupState, makeBuilder, next, next, preceding, previous, setText, toString, writeSwappedInt, writeSwappedShort, writeTablesToFile |
clone, current, equals, first, following, getInstanceFromCompiledRules, getRuleStatus, getRuleStatusVec, getText, hashCode, isBoundary, last, next, next, preceding, previous, setText, toString |
clone, current, first, following, getAvailableLocales, getAvailableULocales, getCharacterInstance, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getLineInstance, getLocale, getSentenceInstance, getSentenceInstance, getSentenceInstance, getText, getTitleInstance, getTitleInstance, getTitleInstance, getWordInstance, getWordInstance, getWordInstance, isBoundary, last, next, next, preceding, previous, registerInstance, registerInstance, setText, setText, unregister |
DictionaryBasedBreakIterator
public DictionaryBasedBreakIterator(String description,
InputStream dictionaryStream)
throws IOException Constructs a DictionaryBasedBreakIterator.
description - Same as the description parameter on RuleBasedBreakIterator_Old,
except for the special meaning of DICTIONARY_VAR. This parameter is just
passed through to RuleBasedBreakIterator_Old's constructor.dictionaryStream - the stream containing the dictionary data
first
public int first()
Sets the current iteration position to the beginning of the text.
(i.e., the CharacterIterator's starting offset).
- first in interface RuleBasedBreakIterator_Old
- The offset of the beginning of the text.
following
public int following(int offset)
Sets the current iteration position to the first boundary position after
the specified position.
- following in interface RuleBasedBreakIterator_Old
offset - The position to begin searching forward from
- The position of the first boundary after "offset"
last
public int last()
Sets the current iteration position to the end of the text.
(i.e., the CharacterIterator's ending offset).
- last in interface RuleBasedBreakIterator_Old
- The text's past-the-end offset.
preceding
public int preceding(int offset)
Sets the current iteration position to the last boundary position
before the specified position.
- preceding in interface RuleBasedBreakIterator_Old
offset - The position to begin searching from
- The position of the last boundary before "offset"
previous
public int previous()
Advances the iterator one step backwards.
- previous in interface RuleBasedBreakIterator_Old
- The position of the last boundary position before the
current iteration position
Copyright (c) 2006 IBM Corporation and others.