Wednesday, June 20, 2012

Word colocations Python http://stackoverflow.com/questions/4128583/how-to-find-collocations-in-text-python

http://stackoverflow.com/questions/4128583/how-to-find-collocations-in-text-python

How do you find collocations in text? A collocation is a sequence of words that occurs together unusually often. python has built-in func bigrams that returns word pairs.

  >>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>

What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?

link|improve this question

1  
You would have to define more often. Do you mean statistic significance? – Björn Pollex Nov 8 '10 at 22:12
5  
Python has no such builtin, nor anything by that name in the standard library. – Glenn Maynard Nov 8 '10 at 22:17
1  
feedback

Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:

  >>> import nltk
>>> def tokenize(sentences):
...     for sent in nltk.sent_tokenize(sentences.lower()):
...         for word in nltk.word_tokenize(sent):
...             yield word
...

>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))

There are none in this small segment, but here goes:

  >>> text.collocations(num=20)
Building collocations list
link|improve this answer
is it able to work on unicode text? I got an error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-8: ordinal not in range(128) – Gusto Nov 9 '10 at 23:03
Unicode works fine for most operations. nltk.Text may have issues, because it's just a helper class written for teaching linguistics students - and gets caught sometimes. It's mainly for demonstration purposes. – Tim McNamara Nov 10 '10 at 18:10
feedback

Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.

  from itertools import izip
words
= ["more", "is", "said", "than", "done", "is", "said"]
words_iter
= iter(words)
next(words_iter, None)
count
= {}
for bigram in izip(words, words_iter):
    count
[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)

(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])

link|improve this answer
good work but your code is for another purpose - i just need collocations (without any count or similar). in the end i will need to return the most 10 colloc-s (collocations[:10]) and the total number of them usinglen(collocations) – Gusto Nov 8 '10 at 22:52
2  
You actually did not define well what you actually want. Maybe give some example output for some example input. – Sven Marnach Nov 8 '10 at 22:54
feedback
  import itertools
from collections import Counter
words
= ['more', 'is', 'said', 'than', 'done']
nextword
= iter(words)
next(nextword)
freq
=Counter(zip(words,nextword))
print(freq)
link|improve this answer

No comments:

Post a Comment