Twinpeppers

Question

How do you find collocations in text? A collocation is a sequence of words that occurs together unusually often. python has built-in func bigrams that returns word pairs.

  >>> bigrams(['more', 'is', 'said', 'than', 'done'])
  [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
  >>>

What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?

You would have to define more often. Do you mean statistic significance?
Python has no such builtin, nor anything by that name in the standard library.
use the nltk library for this nltk.googlecode.com/svn/trunk/doc/api/…

Tim McNamara · Answer 1 · 2010-11-08 23:24:14Z

Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:

  >>> import nltk
  >>> def tokenize(sentences):
  ...     for sent in nltk.sent_tokenize(sentences.lower()):
  ...         for word in nltk.word_tokenize(sent):
  ...             yield word
  ... 
  
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
  <Text: mary had a little lamb ....>
  >>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))

There are none in this small segment, but here goes:

  >>> text.collocations(num=20)
  Building collocations list

is it able to work on unicode text? I got an error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-8: ordinal not in range(128)
Unicode works fine for most operations. nltk.Text may have issues, because it's just a helper class written for teaching linguistics students - and gets caught sometimes. It's mainly for demonstration purposes.

Sven Marnach · Answer 2 · 2010-11-08 22:36:55Z

Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.

  from itertools import izip
  words = ["more", "is", "said", "than", "done", "is", "said"]
  words_iter = iter(words)
  next(words_iter, None)
  count = {}
  for bigram in izip(words, words_iter):
      count[bigram] = count.get(bigram, 0) + 1
  print sorted(((c, b) for b, c in count.iteritems()), reverse=True)

(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])

good work but your code is for another purpose - i just need collocations (without any count or similar). in the end i will need to return the most 10 colloc-s (collocations[:10]) and the total number of them usinglen(collocations)
You actually did not define well what you actually want. Maybe give some example output for some example input.

Tony Veijalainen · Answer 3 · 2010-11-08 23:09:50Z

  import itertools
  from collections import Counter
  words = ['more', 'is', 'said', 'than', 'done']
  nextword = iter(words)
  next(nextword)
  freq=Counter(zip(words,nextword))
  print(freq)

	col B	col C	col D.
row 1	x <- scan()	y <- scan(, what="")
row 2	1	Tommy
row 3	2	Timmy
row 4	3	Missy
row 5	4	Mandy
row 6	23	Mikey
row 7
etc...

Twinpeppers

Saturday, June 30, 2012

Анализ данных с R http://www.inp.nsk.su/~baldin/DataAnalysis/index.html

Twitter text mining with R http://jeffreybreen.wordpress.com/

Tuesday, June 26, 2012

Exchanging data between R and MS Windows apps (Excel, etc) http://rwiki.sciviews.org/doku.php?id=tips:data-io:ms_windows

Exchanging data between R and MS Windows apps (Excel, etc)

Text Files

MS Excel, Access, other applications

Small amount of data

rectangular data sets

Single row or column vectors

Large amount of data

Using RODBC Package

Named Ranges

Entire Worksheets

Caution

Directly Reading Excel Files

Usage example

Remarks

Caution

Download/Updates

НАЦИОНАЛЬНАЯ ИДЕЯ РОССИИ Шеститомник 25.06.2012 http://rusrand.ru/public/public_501.html

Немного об R http://r-analytics.blogspot.com/

Wednesday, June 20, 2012

Word colocations Python http://stackoverflow.com/questions/4128583/how-to-find-collocations-in-text-python

4 Answers

test

linkwithin