What are the basics of R programming?


   
           R is an open-source high-level programming language and the software environment used for statistical and data analysis purposes. This Domain-Specific language is the best tool for data modeling, graphical representation, and reporting so that the statistician, data miners manipulate and represent data in a compelling way. To start programming, you need to install the R compiler in your system. To install R console, you must go to R-Project and choose the preferred mirror. Then, you can install the second component R-Studio which we are adding on top of R. R Studio is GUI(Graphical User Interface) which makes the interface more appealing to work with R. The basic commands execute with R are click interface without a code like opening or storing your script, import/export .csv files, package management, and help features. You can install RCmdr by the command install.packages('Rcmdr') which focuses on statistics and graphics. Rstudio has the upper left script section with the lower left has the console section where the actual calculation takes place. The code is sent from the script section to the console. You can directly write the code in the console, but it is difficult to manipulate and correct the code. The upper right window has an environment section where all the objects created in the session will be listed. In the lower right, the files section will help to store or import to the studio. The plots section will open the graph that will be displayed.  The package tab will list all the packages installed on your machine. There are 8 packages supplied with R distribution. You can get the user manuals and documentation about each function which will save you plenty of time. The viewer will not likely to be used at the beginner level. So, It is an environment within which statistical techniques are implemented or extended by the packages.

Setting up a Script: Simply click the + sign in the studio to create a script. With an extension of .R, you can tell that this is our script. We have our script created and start coding means creating the objects. So, start creating the object by "myFirstobject" to be a vector with numbers 5 through 10. There are 3 important things to be observed. First, this code is sent from a script window to the console window where the calculation took place. Second, we can not see the new object in the console. There is no result and no output. Third, the new object has been created in the environment. The computer recognizes an object called myFirstobject. The double dots indicate the series of numbers. If you just type myFirstObject, you will get the output in the console. This is one way to display the data. A more sophisticated way would be in the environment and you can see the simple vector of length six. Also, you can plot those 6 integers against its index which is the position from 1 to 6 in the vector.

Key Features of R: R has tons of functions and object classes already defined and data analytics is quite clear. Our task is to find the proper function and supply the correct specification to the function in the form of arguments. For ex, Plotting the histogram is simple in R by the command 'hist(x)'. So, your responsibility is shifted from building a function from scratch to a prepackaged function by your needs. R is getting more popular and growing to adapt to new developments very fast with high quality. It performs the following tasks,
    * Data Entry or through the features of direct data entry
    * Data preprocessing like cleaning, changing, deleting or filtering data
    * Statistical Analysis include modeling, machine learning, and prediction
    * Data Simulations to varying degrees
    * Data Visualizations up to the complex graph
    * Web Scraping for Twitter Analytics
    * Data Visualization framework for website integration

R is a community-based project and R-Base has many add on packages of nearly 8000 that are contributed by the ever-growing community. It is easy to manage the package at the package tab. The system library came with the installation and it is the base of R. The user libraries are downloaded from the web. It is a two-step process. You need to download it from the repository and activate it. Once you install the required library, it will be available in the user library and just tick the library to activate it.  In a general consensus, R is comparable and superior to all proprietary statistics packages.

Coding with R:  For coding, you need to understand the objects in R. An object is a collection of data and can be a whole dataset to load in R. It can be the result of the calculation or it can be part of a dataset with specific traits. The objects can have different properties and different classes in R. The object class that you select will determine what you can do or can not do in a dataset. The classic object class is 'data.frame'. It is similar to an excel sheet with columns for variables and rows for observations. The functions in R will work with data frames. For ex, the simplest object class in R is a vector. The vector is the collection of values of the same class. If you put the stock price of 20 days of data in a single excel column is a vector. If you add the date in another column, you would get the time series class or TS. The following are the list of commands used frequently in R,
    * ls() command to list all the objects created in the session
    * you can remove the object by the command rm("objectname") from the environment
    *  the predefined seq() function will help you to generate the values in the arguments. For ex, seq(from=3, length=3, by=3) will generate the values as [1] 3 6 9.
    * the paste() function will concatenate strings. Anything you feed into the function is turned into a vector of characters. For ex, paste("XYZ", 1:2) will produce XYZ 1 and XYZ 2.
    * We can identify the index positions by the observation number in the vector. Let's say an integer vector x = 4:20 which will generate the 17 index positions available in the vector. The which(x ==12) will produce the index position of value 12. The value of the index position can be obtained by the command x[].
  * Functions in R: R recognizes the function as objects. So, whenever you create the function, it will appear in the environment. The functions will do some sort of calculations for you. The basic function can be defined as,
which will generate the output of 100.
 * Loops in R: Loops allow the operations to be repeated again. R has for and while loops to allow certain operations to repeat a fixed number of times. The syntax of the for loop is "for(name in vector) {commands}".
 * DataSets: There are various datasets that come with R-Base or add on packages. It allows you to try out the new features of R. You can check out the package "datasets" in R-Studio which will list alphabetical list of data sets available. If you want to use one of the datasets, you need to check out the help section variable including the dimensions. The following commands can be applied in datasets,                                                                                                                                                    
- The head(dataset) and tail(dataset) will give the first and last six rows of observation
 - the summary(dataset) will provide basic statistics like Min, Max, Median, mean, and quartile for each variable.
    - the plot(dataset) will help to draw the scatter plots for the dataset variables. The histogram will be useful for one variable in your dataset. The hist(dataset) helps to get an idea about the time-series data. The visual impressions are a valuable source for insight into your data.
 * Data Frames: We can see the variables in the data frames by the command "head(dataset)". If you want to extract a single column, you need to use $ sign that way the computer knows about the column X of data from Y. For ex,  if you want the sum of a column in mtcars dataset, then you need to specify it as sum(mtcars$wt). But, if you work with a single data frame, you can attach the dataset to your environment so that R can understand the variables belonging to the data frames. The "attach(dataset)" command helps to attach the dataset to the environment. then, just state the variable without $ sign as "sum(wt)" for accessing the variable in the dataset. You can remove the dataset from the environment by the command "detach(dataset)". The specific information about the dataset can be defined as the dataset with the position of the row and the position of variables such as "mtcars[3,6]". From the head command, you can determine the position of the variable. The concatenate tool helps to insert the vector for the rows such as "mtcars[c(2,3,4),6]" which will give the values of the index positions of 2,3, and 4 of the variable number 6.

Data Visualization in R:  R has an extensive collection of functions and add-on packages available. The standard plots like histogram and scatterplot have more than 10 functions and the sorts of arguments that you will use to tweak the plots. Starting R programming with plotting is a viable solution for visually oriented people. There are 3 different systems in producing the graphs in R,
  1. R-Base
  2. Lattice
  3. ggplot2
      R Base is the default way of data visualization. There are several functions for different plot types like plot, hist, barplot, boxplot etc., It gives a quick idea about your data but they are not primarily made for graphs. Lattice is appropriate for scientific publications. The syntax is similar to R-Base, but has different characteristics. The plot matrixes that are putting has several plots on the one-page comparison. ggplot2 is the best project by Hadley Wickham. It is an advanced tool to code and you can create all the available plots with standard visualization.

Comments