There is only one truth. It is the source.
July 25, 2011
NLTK can be used to detect where sentence boundaries are. NLTK has a difficult installation process.
As of this writing (July 25, 2011), installing NLTK using pip on Mac OS X (10.6.8) downloads a precompiled egg that does not work:
$ pip install nltk Downloading/unpacking nltk Downloading nltk-2.0.1rc1.macosx-10.6-x86_64.tar.gz (1.9Mb): 1.9Mb downloaded Running setup.py egg_info for package nltk Traceback (most recent call last): File "<string>", line 14, in <module> IOError: [Errno 2] No such file or directory: '/Users/enricob/Desktop/project/build/nltk/setup.py' Complete output from command python setup.py egg_info: Traceback (most recent call last): File "<string>", line 14, in <module> IOError: [Errno 2] No such file or directory: '/Users/enricob/Desktop/project/build/nltk/setup.py' ---------------------------------------- Command python setup.py egg_info failed with error code 1 Storing complete log in /Users/enricob/.pip/pip.log
The solution is to install from source:
$ curl -O http://pypi.python.org/packages/source/n/nltk/nltk-2.0.1rc1.tar.gz#md5=e29ff1c55a1d015b84c4480bb47fd9a1 $ pip install nltk-2.0.1rc1.tar.gz
NB find the latest version of NLTK on PyPi: http://pypi.python.org/pypi?%3Aaction=search&term=nltk&submit=search
For sentence breaking, NLTK needs a data file called Punkt Tokenizer Models. To get it you can run the NLTK interactive downloader or download it from the Python shell.
(Alternative 1) The interactive downloader:
In [1]: import nltk In [2]: nltk.download() showing info http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
Now you should see a TkInter window appear with controls for choosing packages. In the "Models" tab, choose "Punkt Tokenizer Models" and click download.
(Alternative 2) Downloading Punkt Tokenizer Models directly ('punkt' is the identifier):
In [1]: import nltk In [2]: nltk.download('punkt') [nltk_data] Downloading package 'punkt' to /Users/enricob/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. Out[2]: True
NB You can find the names of the identifiers of NLTK data files from their server index: http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
Break some sentences:
In [1]: from nltk import tokenize In [2]: tokenize.sent_tokenize('This memo provides a mechanism whereby messages conforming to the MIME specifications [RFC 2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049] can convey presentational information. It specifies the "Content-Disposition" header field, which is optional and valid for any MIME entity ("message" or "body part"). Two values for this header field are described in this memo; one for the ordinary linear presentation of the body part, and another to facilitate the use of mail to transfer files. It is expected that more values will be defined in the future, and procedures are defined for extending this set of values.') Out[2]: ['This memo provides a mechanism whereby messages conforming to the MIME specifications [RFC 2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049] can convey presentational information.', 'It specifies the "Content-Disposition" header field, which is optional and valid for any MIME entity ("message" or "body part").', 'Two values for this header field are described in this memo; one for the ordinary linear presentation of the body part, and another to facilitate the use of mail to transfer files.', 'It is expected that more values will be defined in the future, and procedures are defined for extending this set of values.']