Saturday, June 30, 2012

Анализ данных с R http://www.inp.nsk.su/~baldin/DataAnalysis/index.html

http://www.inp.nsk.su/~baldin/DataAnalysis/index.html


PAW и ROOT

Twitter text mining with R http://jeffreybreen.wordpress.com/


slides from my R tutorial on Twitter text mining #rstats

Update: thanks to eagle-eyed Carl Howe for noticing a slightly out-of-date version of the score.sentiment() function in the deck. Missing was handling for NA values from match(). The deck has been updated and the code is reproduced here for convenience:
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
require(plyr)
require(stringr)

# we got a vector of sentences. plyr will handle a list
# or a vector as an "l" for us
# we want a simple array ("a") of scores back, so we use
# "l" + "a" + "ply" = "laply":
scores = laply(sentences, function(sentence, pos.words, neg.words) {

# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# and convert to lower case:
sentence = tolower(sentence)

# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)

# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)

# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)

# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)

return(score)
}, pos.words, neg.words, .progress=.progress )

scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}


Tuesday, June 26, 2012

Exchanging data between R and MS Windows apps (Excel, etc) http://rwiki.sciviews.org/doku.php?id=tips:data-io:ms_windows

http://rwiki.sciviews.org/doku.php?id=tips:data-io:ms_windows

Exchanging data between R and MS Windows apps (Excel, etc)

The following are considerations when deciding which approach to take when transferring data between Excel and R:

  1. what platform are you using?
  2. do you have access to Excel or only to the spreadsheet file itself? If you have access to Excel is it on the same machine as R?
  3. is it an Excel 2003 spreadsheet (.xls) or Excel 2007 spreadsheet (.xlsx)?
  4. is this a one time transfer of a particular data set or will you be transferring numerous similar spreadsheets?
  5. is the spreadsheet located on your computer or does it have to be fetched from the internet or some other place?
  6. dates have different representations in Excel and in R. See Microsoft Knowledge Base article 214330 and R manual page for Dates.

Here are some R packages and approaches for transferring data between Excel spreadsheets and :R: :

Windows only. Excel must be installed:

  • clipboard. One time transfer of Excel spreadsheet using Windows, R and Excel all on the same machine via Windows clipboard. See R FAQ 2.3 .
  • RDCOMClient or rcom. These two packages are listed together as they are very similar. Either of these provide interfaces to Windows COM facilities allowing one to read and write Excel 2003 and Excel 2007 spreadsheets on Windows. They are very flexible but require detailed programming. (1) RDCOMClient is available from the omegahat RDCOMClient page. An example of using RDCOMClient to create a small spreadsheet on the fly can be found here and an example of listing all the sheet names in an Excel workbook using RDCOMClient is shownhere and an example of using RDCOMClient to export a data frame into a spreadsheet is shown here. (2) rcom is available on CRAN but depends on statconnDCOM (is that right?) which is available on the statconnDCOM site. statconnDCOM has restrictions on commercial use.
  • RExcel is an Excel add-in that allows two way communication between Excel and R. Excel 2002, 2003, 2007 and 2010 all work. Whereas RDCOMClient and rcom are used from within R to access Excel, the user interacts with RExcel from within Excel to access R. RExcel uses rcom and statconnDCOM. The latter has restrictions on commercial use.

Windows only. Excel not needed:

  • xlsReadWrite. Can read and write Excel 2003 spreadsheets on Windows. Does not require Excel itself. It handles dates in the Excel spreadsheet well (is that right?) Be sure to read the SystemRequirements. Both free and commercial versions of xlsReadWrite exist. There is a forum for discussion and questions related to this package here. Additional information on xlsReadWrite is available here and here .
  library(xlsReadWrite)    DF1 <- read.xls("test.xls") # read 1st sheet    DF2 <- read.xls("test.xls", sheet = 2) # read 2nd sheet    test1 <- read.xls("test.xls", sheet = "test1") # read sheet test1  

Windows/Mac/Linux. Excel not needed:

  • XLConnect. This package can read and write Excel 2003 and 2007 spreadsheets on all platforms. It does not require Excel itself. It requires that Java be installed on the machine. (xlsx below is also a Java based package for Excel on CRAN.) If you are using 32 bit R make sure you are also using 32 bit Java and if you re using 64 bit R make sure you are also using 64 bit Java.
  • dataframes2xls. Can write Excel 2003 spreadsheets. Does not require Excel itself. Uses python program.
        library(dataframes2xls)      df1 <- data.frame(c1=1:2, c2=3:4, c3=5:6)    df2 <- data.frame(c21=c(10.10101010101,20, 3), c22=c(50E50,60, 3) )    outFile <- 'df12.xls'      write.xls(c(df1,df2), outFile)  
  • gdata. read.xls in this package can read Excel 2003 and Excel 2007 spreadsheets on all platforms. Can read data off the internet (transparently downloading and converting it). Does not need Excel itself. Can bypass all rows prior to a row with a given string match. Has utilities for getting the number and names of sheets within an Excel workbook. gdata also has several addditional functions xls2csv, xls2tsv, xls2tab and xls2sep which create the same intermediate file as read.xls but do not read it back so that the user can then read it in any way desired. Uses perl program.
        library(gdata)    # read sheet off net ignoring rows prior to the first row having a cell containing the word State    crime.url <- "http://www.jrsainfo.org/jabg/state_data2/Tribal_Data00.xls"    crime <- read.xls(crime.url, pattern = "State")  

Some caveats when using read.xls in gdata: (1) read.xls uses a perl program that produces an intermediate file with comma separated values (csv) and with quoted fields. The perl program escapes the quotes in the input data with backslashes by default. It then reads that intermediate file with read.table. Unfortunately, read.table does not understand backslash escaped quotes as representing quotes and just interprets these as a backslash followed by a quote. Thus if your data values contain quotes use read.xls(..., quote = "") and fix up the data in R. If your data values contain quotes and commas but no tabs then use read.xls(..., quote = "", method = "tab"). If these approaches do not work use one of the additional functions xls2csv, etc. mentioned previously and read in the intermediate file yourself. (2) xls files are read using the formatted precision (if they have been formatted in Excel) whereas xlsx files are read using the full underlying precision. (3) A bug was recently found in which read.xls adds a space to the end of the last field. Usually this is harmless but occasionally it causes a problem. Until its fixed a workaround is shown here. (4) As there were problems with gdata version 2.7.1 be sure that you use gdata version 2.7.2 or later.

  • RExcelXML. This package can read Excel 2007 spreadsheets. It does not require Excel itself. See RExcelXML on Omegahat .
  • RODBC. Can read Excel 2003 and Excel 2007 on all platforms and write and append to spreadsheets on Windows. Supports reading named ranges. ODBC driver uses a strange encoding for special characters in sheet names so if sheet names do have special characters use RODBC's ability to list sheet names to find out what it thinks the names are. Some limitations on Windows are (1) if you list the sheet names in Windows (see example below) they are returned in alphabetical order so there is no way to find out which sheet name corresponds to the first sheet in an Excel workbook with multiple sheets. (This is likely a limitation of the ODBC driver and not of RODBC itself.) (2) one cannot read more than 255 columns using RODBC. These limitations may be limitations of the ODBC driver rather than RODBC itself so it might be possible to avoid them with a different ODBC driver. ODBC may be challenging to set up on non-Windows platforms but on Windows it is very easy as its just an ordinary R package install. On 64 bit Windows ensure you are using 64 bit tools (R, RODBC) or 32 bit tools (R, RODBC) but not a mixture.
        library(RODBC)    # the comments below relate to RODBC used with the Excel 2003 ODBC driver on Windows Vista    con <- odbcConnectExcel("test.xls")    # list sheet names and other info in alphabetical order -- NOT order that the sheets appear in the workbook    sqlTables(con)     DF <- sqlFetch(con, "test1") # get sheet called test1    # read named range MyData    MyData <- sqlQuery(con, "select * from MyData", na.strings = "NA", as.is = TRUE)    close(con)  
  • WriteXLS can write Excel 2003 spreadsheets. Does not require Excel itself. Uses perl program. Be sure to read the INSTALL instructions that come with WriteXLS. It uses perl packages that need to be built for the specific version of perl you are using so installation may be challenging.
        library(WriteXLS)      df1 <- data.frame(c1=1:2, c2=3:4, c3=5:6)    df2 <- data.frame(c21=c(10.10101010101,20, 3), c22=c(50E50,60, 3) )    outFile <- 'df12.xls'    write.xls(c(df1,df2), outFile)      # another example    iris.split <- split(iris, iris$Species)    WriteXLS("iris.split", "iris_split.xls")  
  • xlsx. Can read and write .xlsx spreadsheets (Excel 2007) and .xls spreadsheets (Excel 97/2000/XP/2003) on all platforms. Does not need Excel itself. Uses java program. If you are using 32 bit R make sure you are also using 32 bit Java and if you re using 64 bit R make sure you are also using 64 bit Java. If you get a Java heap space message indicating that it is out of memory see this post.
        library(xlsx)      # read sheets      names(getSheets(loadWorkbook("test.xlsx"))) # list sheet names    DF <- read.xlsx("test.xlsx", 1) # read first sheet    test1 <- read.xlsx("test.xlsx", sheetName = "test1") # read sheet named test1      # write sheets (based on post by Don MacQueen on r-help)      df1 <- data.frame(c1 = 1:2, c2 = 3:4, c3 = 5:6)    df2 <- data.frame(c21 = c(10.10101010101,20, 3), c22 = c(50E50, 60, 3) )    outFile <- 'df12.xls'    wb <- createWorkbook()    sh1 <- createSheet(wb,'sheet1')    addDataFrame(df1,sh1)    sh2 <- createSheet(wb,'sheet2')    addDataFrame(df2,sh2)    saveWorkbook(wb,outFile)  

Also see the :R: Data Import/Export manual (http://cran.r-project.org/doc/manuals/R-data.html) and search http://search.r-project.org

  RSiteSearch("Excel")  

.

Text Files

The remaining portion of this page is adapted from Paul Johnson 2005/09/25 with permission by Nick Drew 2006/04/18 :N:

Much of the remaining info on this page is outdated and probably should be deleted.

Read import_table to get some ideas about how to bring data into R from a text file.

MS Excel, Access, other applications

Most commonly, people seem to want to import Microsoft Excel spreadsheets. Be sure to prepare your data in Excel so that the names of the variables are at the top of each column of data, and you have numbers or NA filled in for all cells (although this last part is not always necessary as noted in the some of the examples below.)

Small amount of data

rectangular data sets

Perhaps the quickest way to import a 'small' amount of data from almost any Windows application (MS Excel spreadsheet, MS Access database query or table or even a delimited text file) is to select the text (including column headings) or the rows (in MS Access table or query) with the mouse and copy it to the clipboard (ctrl-c). Then type the following command at the :R: prompt:

myDF <- read.delim("clipboard")

Your data are now saved in an object called myDF. Inspect your data before using. The following example demonstrates shows how to go the other direction – how to get a 'small' amount of data out of :R: into Excel.

## export 'iris' data to clipboard   write.table(iris, "clipboard", sep = "\t", col.names = NA)  ## then open up MS Excel and paste (ctrl-v) in iris data

1) Date values may not work as expected using the above approaches.
2) I don't know what the size limit is for the Windows clipboard but be aware that there is a limit to the amount of data the clipboard can hold. However, the above methods work relatively well for 'small' data sets that have a few hundred cases or less.
3)Will probably not work if running the commands from an editor such as R-editor or Tinn-R. Those programs use the clipboard to carry their command from the editor to the R console. That temporarily displaces the Excel (or any other program's table) data that had just been copied.

Single row or column vectors

Often times a single row or column vector of data needs to be imported into :R: to perform simple calculations (like those you would normally do in a spreadsheet), to graph, or to use as input to a function. What follows are some examples of how to get data from Excel into :R: for these purposes.

  • Scan in a numeric column vector – Suppose your data are NUMERIC and organized vertically in your spreadsheet like col b in the example table below.

(Your spreadsheet might look like this.)

col Bcol C col D.
row 1 x <- scan() y <- scan(, what="")
row 21 Tommy
row 32 Timmy
row 43 Missy
row 54 Mandy
row 623 Mikey
row 7
etc...

With your mouse select from row 1, col B to row 7, col B in your spreadsheet (be sure to include the blank cell in row 7) and paste (Ctrl-V) into :R:. Now you have an object in :R: called 'x' with the values 1, 2, 3, 4, and 23. Now you can use 'x' for whatever purpose you were planning.

  • Scan in a character column vector – Suppose your data are CHARACTER and organized vertically in your spreadsheet like col c in the example table above. This works the same as the previous example, just be sure to include the argument called what = ""

Large amount of data

The above methods work fine when you have a few hundred cases and limited number of columns. When you data set has grown beyond those limits though, there are better and safer methods for getting your data into :R: using spreadsheets and databases.

For reading data from Microsoft Access, see microsoft_access.

Some of these methods are described below for Excel, but recall that Excel has a limit on the size of the worksheet. The maximum worksheet size for Excel 2000 is 65,536 rows by 256 columns. The maximum worksheet size for Excel 12 (expected release in 2007) will be 1,048,576 rows by 16,384 columns. If your data exceed Excel's limits, you may need to use Access or other relational database applications.

Using RODBC Package

I have not tested the following approach in applications other than Excel and Access, but I think these can be modified and used for non-MS applications.

Named Ranges

The safest approach is to define a named range in Excel (2000) by selecting Name » Define from the Insert menu. "Name" & "Define" the range of data using the dialog box. Save your Excel workbook. Let's say I Named my range of data by calling it "MyData" and saved the Excel file as "Test.xls". Use the following code to read the data into :R: using the RODBC package.

library(RODBC)     MyExcelData <- sqlQuery(odbcConnectExcel("Test.xls"),                           "select * from MyData", na.strings = "NA", as.is = T)  odbcCloseAll()   
Entire Worksheets

Use the following code to import in all of worksheet called "Sheet 1". The hazard with this approach is that any and all data in that worksheet will be copied in, this includes data that are hidden or that you otherwise were not intending to bring in.

library(RODBC)     MyExcelData <- sqlFetch(odbcConnectExcel("Test.xls"),                           sqtable = "Sheet1", na.strings = "NA", as.is = T)  odbcCloseAll()
Caution

Excel 2003 (and earlier?) use the first 0-16 rows to guess the data type. Consider a column of international postal codes where the first 20 rows contain 50010 and the next two rows contain 500A1 and 500E1. The value of '500A1' is likely to be interpreted as a missing value and the value of '500E1' may be interpreted as a numeric value that is in exponential format. More information can be found here: http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/.

— Nick Drew 2006/04/19 07:48

Directly Reading Excel Files

There are several alternatives to read xls (Excel 97 through 2003) or xlsx (Excel 2007) files directly:

  1. read.xls in the gdata package (which in turn calls Perl code to do the real work so it works on all platforms, does not require ODBC or Excel, can specify file or URL, can skip all rows prior to specified regular expression, only works with xls files)
  2. RODBC package. Uses ODBC data base interface to access Excel spredsheets. (This may work with either xlsx and xls files – check this.)
  3. RExcel. This is an Excel add-in which allows to select ranges in Excel and transfer them to R from an Excel menu. Excel and R are accessible at the same time, so one can immediately use the transferred data in R. Data can be transferred as matrices or as dataframes. RExcel is installed by the CRAN package RExcelInstaller. It needs further packages (rcom, rscproxy, and the statconnDCOM server) which can be installed as part of the installation of RExcel.
  4. rcom/RDCOMClient. These two packages are very similar and provide customized access to Excel spreadsheets using the Windows COM interface. They require detailed programming and knowledge of Excel's COM interface but are very flexible. They require that Excel be on the computer. They may work with xlsx and xls files – check this.)
  5. RExcelXML. This package (from www.omegahat.org/RExcelXML and the repository www.omegahat.org/R) can read .xlsx files directly and provides high- and low-level functions for accessing the cells, sheets and workbooks.
  6. The xlsReadWrite package which reads and writes Excel files directly on Windows. xlsReadWrite works on the .xls file without using ActiveX, ODBC, Perl, or Excel. Only works with xls files. The following example shows the use of xlsReadWrite.
  7. The XLConnect package which writes xls/xlsx workbooks on all platforms.
Usage example
library( xlsReadWrite )     ### create some test^H^H^H^Hbikedata  tdat <- data.frame( Price = c( 6399, 3699, 2499 ),                      Amount = c( 2, 3, 1 ),                      Date = c( 39202, 39198, 39199 ),                      row.names = c( "Pro machine", "Road racer", "Streetfire" ) )  ### write  write.xls( tdat, "bikes.xls" )        ### read and check    # read as data.frame  bikes1 <- read.xls( file = "bikes.xls" )  if (!identical( tdat, bikes1 )) stop( "oops, not good..." )       # read as data.frame (custom colnames, date as iso-string)  bikes2 <- read.xls( file = "bikes.xls", colNames = c( "", "CHF", "Number", "Date" ),                       from = 2, colClasses = c( "numeric", "numeric", "isodate" )  )  if (!all( tdat$Date == isoStrToDateTime( bikes2$Date ) )) stop( "oops, not good..." )       # read as matrix  bikes3 <- read.xls( file = "bikes.xls", type = "double" )  if (!identical( as.matrix( tdat ), bikes3 )) stop( "oops, not good..." )
Remarks

xlsReadWrite has some non-standard aspects, hence consider the following remarks:

  • Our own code is free (GPLv2), but xlsReadWrite contains 3rd party code which we may only distribute in binary form. If you want to compile the package for yourself you need a license for that code.
  • In the help files we mention a more feature rich pro version (online helpbrochure). It is a rewrite and being a small company we decided to ask people to support our effort if more advanced features are wanted/needed. This said, the free version works just fine (see testimonials).
  • The low level code has been written in Pascal (Delphi).
Caution

xlsReadWrite has the same problems reading columns of mixed-data as mentioned in the "Caution" section above. Type guessing for data.frame variables works like this: max. 16 rows will be considered and the first non-empty cell value will determine the type. Example: a numeric value in the 1st row determines the type (numeric). Now a string value in the 2nd row which cannot be converted to a number will be given back as a NA.

Solution: specify a colClasses argument and explicitly decide if you want numbers or characters. [In the pro version you can also read (an excerpt of) a single column and check the needed type for yourself. Note: the above example would work well with the pro version as the guessing algorithm considers all 16 rows (but it would fail also if the character value were on row 17 or more...)].

Download/Updates

xlsReadWrite is available on CRAN or from our website. Minor updates will only be uploaded to our website.

— Hans-Peter Suter 2007/04/30 23:33

НАЦИОНАЛЬНАЯ ИДЕЯ РОССИИ Шеститомник 25.06.2012 http://rusrand.ru/public/public_501.html


НАЦИОНАЛЬНАЯ ИДЕЯ РОССИИ. Шеститомник 25.06.2012


НАЦИОНАЛЬНАЯ ИДЕЯ РОССИИ. В 6 т. Т. I. — М.: Научный эксперт, 2012. — 752 c. Н 35
УДК 316.334.3:321(066) ББК 60.032.61 Н 35 ISBN 978-5-91290-116-4

Авторский совет: Якунин В.И., Сулакшин С.С., Багдасарян В.Э., Вилисов М.В., Кара-Мурза С.Г., Лексин В.Н.

В исследованиях и написании монографии приняли участие: Аверков В.В., Ахметзянова И.Р., Багдасарян В.Э., Бахтизин А.Р., Белобородов И.И., Белов П.Г., Буянова Е.Э., Васюкова Д.А., Венедиктов Д.Д., Вилисов М.В., Воробьева О.Д., Глигич-Золотарева М.В., Гундаров И.А., Данилина Т.А., Деева М.В., Дерин С.В., Дмитриев А.В., Журавлев Д.А., Кара-Мурза С.Г., Каримова Г.Г., Клейнхоф А.Э., Клюев Н.Н., Колесник И.Ю., Кондаков А.В., Куренкова Е.А., Куропаткина О.В., Лексин В.Н., Леонова О.Г., Липский И.А., Макурина Л.А., Малков С.Ю., Малчинов А.С., Манько В.Л., Маслова А.Н., Метлик И.В., Мчедлова М.М., Нетесова М.С., Орлов И.Б., Пантелеев С.Ю., Петренко А.И., Погорелко М.Ю., Репин И.В., Сазонова Е.С., Сафонова Ю.А., Сивков К.В., Симонов В.В., Скуратов Ю.И., Смирнов В.С., Строганова С.М., Сулакшин С.С., Сундиев И.Ю., Тимченко А.В., Фролов Д.Б., Фурсов А.И.

Вопрос о национальной идее России имеет длительную историю. Он столь же важен для страны, как вопрос о смысле жизни для каждого человека. Без ответа на него цели, ценности, жизненная энергия, успех становятся малоосязаемыми и труднодостижимыми. Современная Россия особо остро сталкивается с этим вопросом в двух аспектах. Во-первых, во внешнем мире: что есть Россия, зачем она в мировой истории, каков ее современный вклад в развитие мира? Этот вопрос сродни поискам «русской идеи», волновавшим многих русских мыслителей, начиная с Достоевского, Соловьева, Бердяева. Во-вторых, чтобы говорить о смысле жизни, нужно, чтобы жизнь была! Чтобы говорить о России, ее миссии и предназначенности, необходимо, чтобы Россия существовала! Получается даже, что это не «во-вторых», а «во-первых»! В ХХ веке Россия как государственность была разрушена дважды. В современности многое указывает на очередную угрозу этого уровня.

В коллективной монографии на основе мультидисциплинарного научного подхода, логико-философского и математического моделирования успешности страны, в качестве показателя ее жизнеспособности, проанализирована специфика России как цивилизации, как государства, как страны, как человеческого сообщества. Показана связь качества конкретного многофакторного государственного управления, общественной активности и успешности страны в целом. Выявлены специфические ключи к успеху России, отличающиеся от таковых для иных государств-цивилизаций. Показано, что современная социально-экономическая и политическая модель страны мало совместима с жизнеспособностью России.

«Модель страны», «успешность страны» вводятся как базовые категории в научно формализованном пространстве функций цели и множества независимых параметров государственного управления, управленческого выбора. Установлена связь Основного закона (Конституции) России, программирующего развитие страны, с реальными ее достижениями и вызовами. По результатам исследования предложены научный макет новой Конституции России, Доктрина безопасности и развития России и производная от этих базовых документов система нормативно-правовых актов, институциональных, социально-экономических, региональных, финансовых, внешнеполитических и гуманитарных принципов жизнеустройства России.

Заказать шеститомник "Национальной недели"

Cкачать монографию можно, нажав на соответствующий том.

СОДЕРЖАНИЕ

ТОМ I
Вводная глава
В.1. Постановка задачи
В.2. Национальная идея России в истории мысли
В.3. Центральная методология исcледования
В.4. Судьба и действие (трансцендентный вызов)
В.5. Всероссийская фокус-группа: национальный девиз России
Часть I. Жизнеспособность страны: содержание, история, факторы Введение I
Глава 1. Страна, государство, государственность
Глава 2. Состояние и особенности российской государственности

ТОМ II
Глава 2. Состояние и особенности российской государственности (продолжение)
Глава 3. Цивилизация и жизнеспособность страны (теоретико-методологическая модель)
Глава 4. История борьбы с российской государственностью

ТОМ III
Часть II. Угрозы жизнеспособности страны
Введение II
Глава 5. Несиловые методы разрушения российской государственности Глава
Глава 6. Деградация народонаселения России
Глава 7. Проблема удержания территории

ТОМ IV
Глава 7. Проблема удержания территории (продолжение)
Глава 8. Качество и компетентность государственного управления
Глава 9. Клановые механизмы

ТОМ V
Часть III. Что делать? Программа действий
Глава 10. Алгоритмизированный ответ на вызовы и управленческие акты как основания формирования Программы действий государства и общества
Глава 11. Территориальная целостность
Глава 12. Народосбережение: демографическая программа действий
Глава 13. Гуманитарное строительство
Глава 14. Государственное строительство

ТОМ VI
Глава 15. Экономическое развитие Глава
Глава 16. Россия в мире Глава
Глава 17. Неизбежные реформы оздоровления и будущее России
Глава 18. Максимизация жизнеспособности страны как основа
Глава 19. Научный макет новой Конституции России и Доктрина безопасности и развития России — базовые документы, программирующие переход к модели жизнеспособной России Для государственных служащих, научных работников, преподавателей, аспирантов и студентов.

Немного об R http://r-analytics.blogspot.com/



Wednesday, June 20, 2012

Word colocations Python http://stackoverflow.com/questions/4128583/how-to-find-collocations-in-text-python

http://stackoverflow.com/questions/4128583/how-to-find-collocations-in-text-python

How do you find collocations in text? A collocation is a sequence of words that occurs together unusually often. python has built-in func bigrams that returns word pairs.

  >>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>

What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?

link|improve this question

1  
You would have to define more often. Do you mean statistic significance? – Björn Pollex Nov 8 '10 at 22:12
5  
Python has no such builtin, nor anything by that name in the standard library. – Glenn Maynard Nov 8 '10 at 22:17
1  
feedback

Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:

  >>> import nltk
>>> def tokenize(sentences):
...     for sent in nltk.sent_tokenize(sentences.lower()):
...         for word in nltk.word_tokenize(sent):
...             yield word
...

>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))

There are none in this small segment, but here goes:

  >>> text.collocations(num=20)
Building collocations list
link|improve this answer
is it able to work on unicode text? I got an error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-8: ordinal not in range(128) – Gusto Nov 9 '10 at 23:03
Unicode works fine for most operations. nltk.Text may have issues, because it's just a helper class written for teaching linguistics students - and gets caught sometimes. It's mainly for demonstration purposes. – Tim McNamara Nov 10 '10 at 18:10
feedback

Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.

  from itertools import izip
words
= ["more", "is", "said", "than", "done", "is", "said"]
words_iter
= iter(words)
next(words_iter, None)
count
= {}
for bigram in izip(words, words_iter):
    count
[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)

(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])

link|improve this answer
good work but your code is for another purpose - i just need collocations (without any count or similar). in the end i will need to return the most 10 colloc-s (collocations[:10]) and the total number of them usinglen(collocations) – Gusto Nov 8 '10 at 22:52
2  
You actually did not define well what you actually want. Maybe give some example output for some example input. – Sven Marnach Nov 8 '10 at 22:54
feedback
  import itertools
from collections import Counter
words
= ['more', 'is', 'said', 'than', 'done']
nextword
= iter(words)
next(nextword)
freq
=Counter(zip(words,nextword))
print(freq)
link|improve this answer