Exchanging data between R and MS Windows apps (Excel, etc)
The following are considerations when deciding which approach to take when transferring data between Excel and R:
- what platform are you using?
- do you have access to Excel or only to the spreadsheet file itself? If you have access to Excel is it on the same machine as R?
- is it an Excel 2003 spreadsheet (.xls) or Excel 2007 spreadsheet (.xlsx)?
- is this a one time transfer of a particular data set or will you be transferring numerous similar spreadsheets?
- is the spreadsheet located on your computer or does it have to be fetched from the internet or some other place?
- dates have different representations in Excel and in R. See Microsoft Knowledge Base article 214330 and R manual page for Dates.
Here are some R packages and approaches for transferring data between Excel spreadsheets and :
Windows only. Excel must be installed:
- clipboard. One time transfer of Excel spreadsheet using Windows, R and Excel all on the same machine via Windows clipboard. See R FAQ 2.3 .
- RDCOMClient or rcom. These two packages are listed together as they are very similar. Either of these provide interfaces to Windows COM facilities allowing one to read and write Excel 2003 and Excel 2007 spreadsheets on Windows. They are very flexible but require detailed programming. (1) RDCOMClient is available from the omegahat RDCOMClient page. An example of using RDCOMClient to create a small spreadsheet on the fly can be found here and an example of listing all the sheet names in an Excel workbook using RDCOMClient is shownhere and an example of using RDCOMClient to export a data frame into a spreadsheet is shown here. (2) rcom is available on CRAN but depends on statconnDCOM (is that right?) which is available on the statconnDCOM site. statconnDCOM has restrictions on commercial use.
- RExcel is an Excel add-in that allows two way communication between Excel and R. Excel 2002, 2003, 2007 and 2010 all work. Whereas RDCOMClient and rcom are used from within R to access Excel, the user interacts with RExcel from within Excel to access R. RExcel uses rcom and statconnDCOM. The latter has restrictions on commercial use.
Windows only. Excel not needed:
- xlsReadWrite. Can read and write Excel 2003 spreadsheets on Windows. Does not require Excel itself. It handles dates in the Excel spreadsheet well (is that right?) Be sure to read the SystemRequirements. Both free and commercial versions of xlsReadWrite exist. There is a forum for discussion and questions related to this package here. Additional information on xlsReadWrite is available here and here .
library(xlsReadWrite) DF1 <- read.xls("test.xls") # read 1st sheet DF2 <- read.xls("test.xls", sheet = 2) # read 2nd sheet test1 <- read.xls("test.xls", sheet = "test1") # read sheet test1
Windows/Mac/Linux. Excel not needed:
- XLConnect. This package can read and write Excel 2003 and 2007 spreadsheets on all platforms. It does not require Excel itself. It requires that Java be installed on the machine. (xlsx below is also a Java based package for Excel on CRAN.) If you are using 32 bit R make sure you are also using 32 bit Java and if you re using 64 bit R make sure you are also using 64 bit Java.
- dataframes2xls. Can write Excel 2003 spreadsheets. Does not require Excel itself. Uses python program.
library(dataframes2xls) df1 <- data.frame(c1=1:2, c2=3:4, c3=5:6) df2 <- data.frame(c21=c(10.10101010101,20, 3), c22=c(50E50,60, 3) ) outFile <- 'df12.xls' write.xls(c(df1,df2), outFile)
- gdata. read.xls in this package can read Excel 2003 and Excel 2007 spreadsheets on all platforms. Can read data off the internet (transparently downloading and converting it). Does not need Excel itself. Can bypass all rows prior to a row with a given string match. Has utilities for getting the number and names of sheets within an Excel workbook. gdata also has several addditional functions xls2csv, xls2tsv, xls2tab and xls2sep which create the same intermediate file as read.xls but do not read it back so that the user can then read it in any way desired. Uses perl program.
library(gdata) # read sheet off net ignoring rows prior to the first row having a cell containing the word State crime.url <- "http://www.jrsainfo.org/jabg/state_data2/Tribal_Data00.xls" crime <- read.xls(crime.url, pattern = "State")
Some caveats when using read.xls in gdata: (1) read.xls uses a perl program that produces an intermediate file with comma separated values (csv) and with quoted fields. The perl program escapes the quotes in the input data with backslashes by default. It then reads that intermediate file with read.table. Unfortunately, read.table does not understand backslash escaped quotes as representing quotes and just interprets these as a backslash followed by a quote. Thus if your data values contain quotes use read.xls(..., quote = "") and fix up the data in R. If your data values contain quotes and commas but no tabs then use read.xls(..., quote = "", method = "tab"). If these approaches do not work use one of the additional functions xls2csv, etc. mentioned previously and read in the intermediate file yourself. (2) xls files are read using the formatted precision (if they have been formatted in Excel) whereas xlsx files are read using the full underlying precision. (3) A bug was recently found in which read.xls adds a space to the end of the last field. Usually this is harmless but occasionally it causes a problem. Until its fixed a workaround is shown here. (4) As there were problems with gdata version 2.7.1 be sure that you use gdata version 2.7.2 or later.
- RExcelXML. This package can read Excel 2007 spreadsheets. It does not require Excel itself. See RExcelXML on Omegahat .
- RODBC. Can read Excel 2003 and Excel 2007 on all platforms and write and append to spreadsheets on Windows. Supports reading named ranges. ODBC driver uses a strange encoding for special characters in sheet names so if sheet names do have special characters use RODBC's ability to list sheet names to find out what it thinks the names are. Some limitations on Windows are (1) if you list the sheet names in Windows (see example below) they are returned in alphabetical order so there is no way to find out which sheet name corresponds to the first sheet in an Excel workbook with multiple sheets. (This is likely a limitation of the ODBC driver and not of RODBC itself.) (2) one cannot read more than 255 columns using RODBC. These limitations may be limitations of the ODBC driver rather than RODBC itself so it might be possible to avoid them with a different ODBC driver. ODBC may be challenging to set up on non-Windows platforms but on Windows it is very easy as its just an ordinary R package install. On 64 bit Windows ensure you are using 64 bit tools (R, RODBC) or 32 bit tools (R, RODBC) but not a mixture.
library(RODBC) # the comments below relate to RODBC used with the Excel 2003 ODBC driver on Windows Vista con <- odbcConnectExcel("test.xls") # list sheet names and other info in alphabetical order -- NOT order that the sheets appear in the workbook sqlTables(con) DF <- sqlFetch(con, "test1") # get sheet called test1 # read named range MyData MyData <- sqlQuery(con, "select * from MyData", na.strings = "NA", as.is = TRUE) close(con)
- WriteXLS can write Excel 2003 spreadsheets. Does not require Excel itself. Uses perl program. Be sure to read the INSTALL instructions that come with WriteXLS. It uses perl packages that need to be built for the specific version of perl you are using so installation may be challenging.
library(WriteXLS) df1 <- data.frame(c1=1:2, c2=3:4, c3=5:6) df2 <- data.frame(c21=c(10.10101010101,20, 3), c22=c(50E50,60, 3) ) outFile <- 'df12.xls' write.xls(c(df1,df2), outFile) # another example iris.split <- split(iris, iris$Species) WriteXLS("iris.split", "iris_split.xls")
- xlsx. Can read and write .xlsx spreadsheets (Excel 2007) and .xls spreadsheets (Excel 97/2000/XP/2003) on all platforms. Does not need Excel itself. Uses java program. If you are using 32 bit R make sure you are also using 32 bit Java and if you re using 64 bit R make sure you are also using 64 bit Java. If you get a Java heap space message indicating that it is out of memory see this post.
library(xlsx) # read sheets names(getSheets(loadWorkbook("test.xlsx"))) # list sheet names DF <- read.xlsx("test.xlsx", 1) # read first sheet test1 <- read.xlsx("test.xlsx", sheetName = "test1") # read sheet named test1 # write sheets (based on post by Don MacQueen on r-help) df1 <- data.frame(c1 = 1:2, c2 = 3:4, c3 = 5:6) df2 <- data.frame(c21 = c(10.10101010101,20, 3), c22 = c(50E50, 60, 3) ) outFile <- 'df12.xls' wb <- createWorkbook() sh1 <- createSheet(wb,'sheet1') addDataFrame(df1,sh1) sh2 <- createSheet(wb,'sheet2') addDataFrame(df2,sh2) saveWorkbook(wb,outFile)
Also see the Data Import/Export manual (http://cran.r-project.org/doc/manuals/R-data.html) and search http://search.r-project.org
RSiteSearch("Excel")
.
Text Files
The remaining portion of this page is adapted from Paul Johnson 2005/09/25 with permission by Nick Drew 2006/04/18
Much of the remaining info on this page is outdated and probably should be deleted.
Read import_table to get some ideas about how to bring data into R from a text file.
MS Excel, Access, other applications
Most commonly, people seem to want to import Microsoft Excel spreadsheets. Be sure to prepare your data in Excel so that the names of the variables are at the top of each column of data, and you have numbers or NA filled in for all cells (although this last part is not always necessary as noted in the some of the examples below.)
Small amount of data
rectangular data sets
Perhaps the quickest way to import a 'small' amount of data from almost any Windows application (MS Excel spreadsheet, MS Access database query or table or even a delimited text file) is to select the text (including column headings) or the rows (in MS Access table or query) with the mouse and copy it to the clipboard (ctrl-c). Then type the following command at the prompt:
myDF <- read.delim("clipboard")
Your data are now saved in an object called myDF. Inspect your data before using. The following example demonstrates shows how to go the other direction – how to get a 'small' amount of data out of into Excel.
## export 'iris' data to clipboard write.table(iris, "clipboard", sep = "\t", col.names = NA) ## then open up MS Excel and paste (ctrl-v) in iris data
1) Date values may not work as expected using the above approaches.
2) I don't know what the size limit is for the Windows clipboard but be aware that there is a limit to the amount of data the clipboard can hold. However, the above methods work relatively well for 'small' data sets that have a few hundred cases or less.
3)Will probably not work if running the commands from an editor such as R-editor or Tinn-R. Those programs use the clipboard to carry their command from the editor to the R console. That temporarily displaces the Excel (or any other program's table) data that had just been copied.
Single row or column vectors
Often times a single row or column vector of data needs to be imported into to perform simple calculations (like those you would normally do in a spreadsheet), to graph, or to use as input to a function. What follows are some examples of how to get data from Excel into for these purposes.
- Scan in a numeric column vector – Suppose your data are NUMERIC and organized vertically in your spreadsheet like col b in the example table below.
(Your spreadsheet might look like this.)
col B | col C | col D. | |
---|---|---|---|
row 1 | x <- scan() | y <- scan(, what="") | |
row 2 | 1 | Tommy | |
row 3 | 2 | Timmy | |
row 4 | 3 | Missy | |
row 5 | 4 | Mandy | |
row 6 | 23 | Mikey | |
row 7 | |||
etc... |
With your mouse select from row 1, col B
to row 7, col B
in your spreadsheet (be sure to include the blank cell in row 7) and paste (Ctrl-V) into . Now you have an object in called 'x' with the values 1, 2, 3, 4, and 23. Now you can use 'x' for whatever purpose you were planning.
- Scan in a character column vector – Suppose your data are CHARACTER and organized vertically in your spreadsheet like col c in the example table above. This works the same as the previous example, just be sure to include the argument called what = ""
Large amount of data
The above methods work fine when you have a few hundred cases and limited number of columns. When you data set has grown beyond those limits though, there are better and safer methods for getting your data into using spreadsheets and databases.
For reading data from Microsoft Access, see microsoft_access.
Some of these methods are described below for Excel, but recall that Excel has a limit on the size of the worksheet. The maximum worksheet size for Excel 2000 is 65,536 rows by 256 columns. The maximum worksheet size for Excel 12 (expected release in 2007) will be 1,048,576 rows by 16,384 columns. If your data exceed Excel's limits, you may need to use Access or other relational database applications.
Using RODBC Package
I have not tested the following approach in applications other than Excel and Access, but I think these can be modified and used for non-MS applications.
Named Ranges
The safest approach is to define a named range in Excel (2000) by selecting Name » Define from the Insert menu. "Name" & "Define" the range of data using the dialog box. Save your Excel workbook. Let's say I Named my range of data by calling it "MyData" and saved the Excel file as "Test.xls". Use the following code to read the data into using the RODBC package.
library(RODBC) MyExcelData <- sqlQuery(odbcConnectExcel("Test.xls"), "select * from MyData", na.strings = "NA", as.is = T) odbcCloseAll()
Entire Worksheets
Use the following code to import in all of worksheet called "Sheet 1". The hazard with this approach is that any and all data in that worksheet will be copied in, this includes data that are hidden or that you otherwise were not intending to bring in.
library(RODBC) MyExcelData <- sqlFetch(odbcConnectExcel("Test.xls"), sqtable = "Sheet1", na.strings = "NA", as.is = T) odbcCloseAll()
Caution
Excel 2003 (and earlier?) use the first 0-16 rows to guess the data type. Consider a column of international postal codes where the first 20 rows contain 50010 and the next two rows contain 500A1 and 500E1. The value of '500A1' is likely to be interpreted as a missing value and the value of '500E1' may be interpreted as a numeric value that is in exponential format. More information can be found here: http://www.dicks-blog.com/archives/2004/06/03/external-data-mixed-data-types/.
— Nick Drew 2006/04/19 07:48
Directly Reading Excel Files
There are several alternatives to read xls (Excel 97 through 2003) or xlsx (Excel 2007) files directly:
- read.xls in the gdata package (which in turn calls Perl code to do the real work so it works on all platforms, does not require ODBC or Excel, can specify file or URL, can skip all rows prior to specified regular expression, only works with xls files)
- RODBC package. Uses ODBC data base interface to access Excel spredsheets. (This may work with either xlsx and xls files – check this.)
- RExcel. This is an Excel add-in which allows to select ranges in Excel and transfer them to R from an Excel menu. Excel and R are accessible at the same time, so one can immediately use the transferred data in R. Data can be transferred as matrices or as dataframes. RExcel is installed by the CRAN package RExcelInstaller. It needs further packages (rcom, rscproxy, and the statconnDCOM server) which can be installed as part of the installation of RExcel.
- rcom/RDCOMClient. These two packages are very similar and provide customized access to Excel spreadsheets using the Windows COM interface. They require detailed programming and knowledge of Excel's COM interface but are very flexible. They require that Excel be on the computer. They may work with xlsx and xls files – check this.)
- RExcelXML. This package (from www.omegahat.org/RExcelXML and the repository www.omegahat.org/R) can read .xlsx files directly and provides high- and low-level functions for accessing the cells, sheets and workbooks.
- The xlsReadWrite package which reads and writes Excel files directly on Windows. xlsReadWrite works on the .xls file without using ActiveX, ODBC, Perl, or Excel. Only works with xls files. The following example shows the use of xlsReadWrite.
- The XLConnect package which writes xls/xlsx workbooks on all platforms.
Usage example
library( xlsReadWrite ) ### create some test^H^H^H^Hbikedata tdat <- data.frame( Price = c( 6399, 3699, 2499 ), Amount = c( 2, 3, 1 ), Date = c( 39202, 39198, 39199 ), row.names = c( "Pro machine", "Road racer", "Streetfire" ) ) ### write write.xls( tdat, "bikes.xls" ) ### read and check # read as data.frame bikes1 <- read.xls( file = "bikes.xls" ) if (!identical( tdat, bikes1 )) stop( "oops, not good..." ) # read as data.frame (custom colnames, date as iso-string) bikes2 <- read.xls( file = "bikes.xls", colNames = c( "", "CHF", "Number", "Date" ), from = 2, colClasses = c( "numeric", "numeric", "isodate" ) ) if (!all( tdat$Date == isoStrToDateTime( bikes2$Date ) )) stop( "oops, not good..." ) # read as matrix bikes3 <- read.xls( file = "bikes.xls", type = "double" ) if (!identical( as.matrix( tdat ), bikes3 )) stop( "oops, not good..." )
Remarks
xlsReadWrite has some non-standard aspects, hence consider the following remarks:
- Our own code is free (GPLv2), but xlsReadWrite contains 3rd party code which we may only distribute in binary form. If you want to compile the package for yourself you need a license for that code.
- In the help files we mention a more feature rich pro version (online help, brochure). It is a rewrite and being a small company we decided to ask people to support our effort if more advanced features are wanted/needed. This said, the free version works just fine (see testimonials).
- The low level code has been written in Pascal (Delphi).
Caution
xlsReadWrite has the same problems reading columns of mixed-data as mentioned in the "Caution" section above. Type guessing for data.frame variables works like this: max. 16 rows will be considered and the first non-empty cell value will determine the type. Example: a numeric value in the 1st row determines the type (numeric). Now a string value in the 2nd row which cannot be converted to a number will be given back as a NA.
Solution: specify a colClasses argument and explicitly decide if you want numbers or characters. [In the pro version you can also read (an excerpt of) a single column and check the needed type for yourself. Note: the above example would work well with the pro version as the guessing algorithm considers all 16 rows (but it would fail also if the character value were on row 17 or more...)].
Download/Updates
xlsReadWrite is available on CRAN or from our website. Minor updates will only be uploaded to our website.
— Hans-Peter Suter 2007/04/30 23:33
No comments:
Post a Comment