How to Use R Software for Data Analysis

I get it, using R for data analysis can be intimidating. All that code, vector and matrix manipulation, graphing, etc – it can be hard to know where to start if you’ve never done it before. I went through the same thing when I started working as an R programmer! Plus, adding a package or function that doesn’t do what you expect can be frustrating.

Do you want to learn about how to use r software for data analysis? Or about data analysis using r projects? Check out our latest whitepaper that discusses an approach that allows end users to utilize R and make it a powerful tool for processing large data. It’s free and filled with valuable information about the future of data analysis.

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

The R environment

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes

  • an effective data handling and storage facility,
  • a suite of operators for calculations on arrays, in particular matrices,
  • a large, coherent, integrated collection of intermediate tools for data analysis,
  • graphical facilities for data analysis and display either on-screen or on hardcopy, and
  • a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it as an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.

R has its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both on-line in a number of formats and in hardcopy.

R

R is now one of the most popular analytics tools in the industry. It has surpassed SAS in usage and is now the Data analytics tool of choice, even for companies that can easily afford SAS. Over the years, R has become a lot more robust. It handles large data sets much better than it used to, say even a decade earlier. It has also become a lot more versatile.

1800 new packages were introduced in R between April 2015 and April 2016. The total number of R packages is now over 8000. There are some concerns about the sheer number of packages, but this has certainly added a lot to R’s capabilities. R also integrates very well with many Big Data platforms, which have contributed to its success.

What file types are typically associated with R?

  • *.r – An R script.
  • *.rmd – An R Markdown file.
  • *.rnw – An R Sweave file.
  • *.rds – A file containing a [single] R object; can be created using saveRDS(), and loaded using readRDS().
  • *.rdata – A file containing one or more R objects or workspaces; can be created using save(), and loaded using load().

What is R? How do I use it?

R acts as an alternative to traditional statistical packages such as SPSS, SAS, and Stata such that it is an extensible, open-source language and computing environment for Windows, Macintosh, UNIX, and Linux platforms. Such software allows for the user to freely distribute, study, change, and improve the software under the Free Software Foundation‘s GNU General Public License. It is a free implementation of the S programming language, which was originally created and distributed by Bell Labs. However, most code written in S will run successfully in the R environment. R performs a wide variety of basic to advanced statistical and graphical techniques at little to no cost to the user. These advantages over other statistical software encourage the growing use of R in cutting edge social science research.

Can I use R without having to learn the details of the R language?

Yes (at least for the basics), there are a number of “front ends” that have been constructed in order to make it easier for users to interact with the R statistical computing environment. For example, a graphical user interface (or “GUI”) allows the analyst to carry out data analysis tasks by selecting items from menus and lists, rather than entering commands.

One such GUI is the R Commander, written by John Fox. The R Commander is accessed by installing and loading the Rcmdr package within R. The R Commander provides an easy-to-use, menu-based system for loading data into R, manipulating data values, performing statistical analyses, creating graphical displays, and carrying out diagnostic tests on statistical models. Documentation for the R Commander is available on John Fox’s website and in the following paper:

Fox, John. 2005. “The R Commander: A Basic-Statistics Graphical User Interface to R.

14(9).

Journal of Statistical Software

There are several other GUI systems, in addition to the R Commander, for interacting with R.

The advantage provided by the R Commander or another GUI is that the user does not need to learn a language in order to carry out his or her analysis. Instead, each step is taken by making one or more selections from a menu of available options. The disadvantage of interacting with the R environment through a GUI is that the course of the analysis is limited to those actions that have been programmed into the GUI. Thus, one could argue that using a GUI removes much of the flexibility that is inherent in the R environment.

In order to overcome the preceding limitation, the R Commander and most other GUIs allow the user to employ both methods of interacting with the environment within a single R session. For example, one could invoke the R Commander, and use its GUI to read the contents of an external file and create an R data frame. For many types of analyses, other features of the R Commander could be used to estimate model parameters, construct graphical displays, and so on. But, if the user wanted to carry out a task that is not available in the R Commander (e.g., a multidimensional scaling analysis), then the data frame created in the GUI could still be treated like any other currently defined R object (say as an argument to a function or the target of an assignment) on the R command line. In this manner, a user could exploit the advantages of both the GUI and the command-line interface.

Where can I obtain R?

Installation files for Windows, Mac, and Linux can be found at the website for the Comprehensive R Archive Network, http://cran.r-project.org/. The site also contains documentation for downloading and installing the software on different operating systems. There is no cost for downloading and using R.

Where can I find more information on R?

Books

Braun, W. and Murdoch, D. (2007). A First Course in Statistical Programming with R. Cambridge, MA: Cambridge University Press.

Chambers, J. M. (1998). Programming with Data: A Guide to the S Language. Murray Hill, NJ: Bell Laboratories.

Dalgaard, P. (2008). Introductory Statistics with R (2nd edition). New York: Springer.

Everitt, B., and Hothorn, T. (2006). A Handbook of Statistical Analyses Using R. Boca Raton, FL: Chapman & Hall/CRC.

Faraway, J. J. (2005). Linear Models with R. Boca Raton, FL: Chapman & Hall/CRC.

Faraway, J. J. (2006). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. Boca Raton, FL: Chapman & Hall/CRC.

Fox, J. (2002). An R and S-Plus Companion to Applied Regression. Thousand Oaks, CA : Sage Publications.

Muenchen, R. A. (2009). R for SAS and SPSS Users. Springer Series in Statistics and Computing. New York: Springer.

Murrell, P. (2005). R Graphics. Boca Raton, FL: Chapman & Hall/CRC.

Pinheiro, J. C. and Bates, D. M. (2004). Mixed Effects Models in S and S-Plus. New York: Springer.

Spector, P. (2000). Data Manipulation with R. New York: Springer.

Venables, W. N., and Ripley, B. D. (2002). Modern Applied Statistics with S. Fourth Edition. New York: Springer.

Zuur, A. F., Ieno, E. N., and Meesters, E. H. W. G. (to be published 2009). A Beginner’s Guide to R. Use R. New York: Springer.

Conclusion:

If you want to learn about data analysis using r software, or even how to use r software for data analysis, then check out these free online resources. R is an open source programming language for statistical and graphical analysis that is supported by the R Foundation for Statistical Computing.

Leave a Comment