There is only one truth. It is the source.

Sentence breaking with NLTK

July 25, 2011

NLTK can be used to detect where sentence boundaries are. NLTK has a difficult installation process.

As of this writing (July 25, 2011), installing NLTK using pip on Mac OS X (10.6.8) downloads a precompiled egg that does not work:

$ pip install nltk
Downloading/unpacking nltk
  Downloading nltk-2.0.1rc1.macosx-10.6-x86_64.tar.gz (1.9Mb): 1.9Mb downloaded
  Running setup.py egg_info for package nltk
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
    IOError: [Errno 2] No such file or directory: '/Users/enricob/Desktop/project/build/nltk/setup.py'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

IOError: [Errno 2] No such file or directory: '/Users/enricob/Desktop/project/build/nltk/setup.py'

----------------------------------------
Command python setup.py egg_info failed with error code 1
Storing complete log in /Users/enricob/.pip/pip.log

The solution is to install from source:

$ curl -O http://pypi.python.org/packages/source/n/nltk/nltk-2.0.1rc1.tar.gz#md5=e29ff1c55a1d015b84c4480bb47fd9a1
$ pip install nltk-2.0.1rc1.tar.gz

NB find the latest version of NLTK on PyPi: http://pypi.python.org/pypi?%3Aaction=search&term=nltk&submit=search

For sentence breaking, NLTK needs a data file called Punkt Tokenizer Models. To get it you can run the NLTK interactive downloader or download it from the Python shell.

(Alternative 1) The interactive downloader:

In [1]: import nltk

In [2]: nltk.download()
showing info http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

Now you should see a TkInter window appear with controls for choosing packages. In the "Models" tab, choose "Punkt Tokenizer Models" and click download.

(Alternative 2) Downloading Punkt Tokenizer Models directly ('punkt' is the identifier):

In [1]: import nltk

In [2]: nltk.download('punkt')
[nltk_data] Downloading package 'punkt' to /Users/enricob/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Out[2]: True

NB You can find the names of the identifiers of NLTK data files from their server index: http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml

Break some sentences:

In [1]: from nltk import tokenize

In [2]: tokenize.sent_tokenize('This memo provides a mechanism whereby messages conforming to the MIME specifications [RFC 2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049] can convey presentational information.  It specifies the "Content-Disposition" header field, which is optional and valid for any MIME entity ("message" or "body part").  Two values for this header field are described in this memo; one for the ordinary linear presentation of the body part, and another to facilitate the use of mail to transfer files.  It is expected that more values will be defined in the future, and procedures are defined for extending this set of values.')
Out[2]:
['This memo provides a mechanism whereby messages conforming to the MIME specifications [RFC 2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049] can convey presentational information.',
 'It specifies the "Content-Disposition" header field, which is optional and valid for any MIME entity ("message" or "body part").',
 'Two values for this header field are described in this memo; one for the ordinary linear presentation of the body part, and another to facilitate the use of mail to transfer files.',
 'It is expected that more values will be defined in the future, and procedures are defined for extending this set of values.']