Thursday, October 4, 2012

Which is the best freely available sentence breaker API http://www.quora.com/Which-is-the-best-freely-available-sentence-breaker-API

http://www.quora.com/Which-is-the-best-freely-available-sentence-breaker-API

Which is the best freely available sentence breaker API?

I have some unstructured text that looks like someone took notes while on the phone. Sometimes sentences end with newlines, at other times it is a period, sometimes a colon, and most of the sentences have all words capitalized. Which sentence breaker API available online for free would be a good choice? I am interested in those which don't need any training from my end.
 

2 Answers

Nitin MadnaniWrote a doctoral dissertation on NLP.
2 votes by Sameer Gupta and Anon User
I assume by "Sentence Breaker" you mean a Sentence Boundary Detector (SBD). Although there have been several SBDs published about in the NLP literature, I can only find three that are freely available online:
  1. Punkt: An unsupervised SBD that ships with NLTK and is quite simple to use. See Section 6 in http://nltk.googlecode.com/svn/t....
  2. mxTerminator: A supervised SBD trained using a maximum entropy classifer. You can find it at http://sites.google.com/site/adw....
  3. Splitta: A supervised SBD trained using SVMs and/or Naive Bayes on the same training data as mxTerminator. According to the paper[1], this SBD now represents the state-of-the-art on English newswire text. You can find this athttp://code.google.com/p/splitta/. The README in this project is quite a useful read.
However, both mxTerminator and Splitta are designed to work with well-formed newswire text and given your description of the input data you have, they are not likely to work well. You can try Punks but I don't think it will work that perfectly either. You might need to an NLTK RegexpTokenizer if the variation in the data is systematic enough. Or train your own Splitta/mxTerminator models if you can find/create gold standard data.

References: 
[1] "Sentence Boundary Detection and the Problem with the U.S." Dan Gillick, NAACL 2009.
  

Anon User
NLTK Sentence detector:- NLTK sentence detector(http://nltk.googlecode.com/svn/t...) is based on paper( Unsupervised Multilingual Sentence Boundary Detection". You can quickly test  sentence detection functionality  athttp://text-processing.com/
Lingpipe:- Lingpipe also has sentence detector. You can check out lingipe tutorial on sentence detection(http://alias-i.com/lingpipe/demo...
GATE(http://gate.ac.uk/):- It has  standard ANNIE sentence splitter  and Regex Sentence Splitter.   ANNIE is GATE information extraction system. 
Opennlp(http://opennlp.apache.org):- opennlp  is apache nlp tool. It also provide sentence detection functionality. You can check opennlp documentation (http://opennlp.apache.org/docume...) for that.
  

No comments:

Post a Comment