How does data mining work with R

R data mining

R Data Mining is particularly suitable for evaluating large amounts of data. Data mining is carried out with the statistics program R. This involves the following questions:

  • How do you load data into R?
  • How do you get an overview of extensive data in R?
  • How do you merge data in R?
  • How does R support data preprocessing, e.g. data cleansing or big data analytics?
  • How do the statistics functions scale in R data mining?

Our statisticians will be happy to help you if you need support in using R in the context of data mining. Simply use our contact form for a free consultation & a non-binding offer - or give us a call.

Let us know your requirements and we will provide you with a free offer within a few hours.

Inquire now without obligation

How do you load data into R?

The data for R data mining can come from all possible sources: from files on your computer, from a database, but also from the Internet. In order to document the original data, it makes sense to always first download the file from the Internet to your computer and only then read it into the R program, e.g. into a matrix or a data frame.

The download takes place with the function download.file (url = “http://address.com/file.txt”, destfile = “file.txt”) for a file or with con The data is then transferred from the file to a variable read into R. This is done with a different function depending on the format of the file.

  • Text file: x
  • Zipped gz file: dat
  • Table: read.table ("file.txt")
  • Html file: htmlTreeParse (“http: //…”, useInternal = TRUE) or readLines ()
  • CSV file: read.csv ("file.csv")
  • Xlsx file: read.xlsx (“name.xlsx”, sheetIndex = 1, colIndex =, rowIndex =) (the xlsx package is required for this.)
  • XML file: xmlTreeParse (“http://address.com/file.xml”, useInternal = TRUE) (with XML package)

These functions allow a large number of arguments, which you can best look at in the corresponding help text, e.g. with? Read.table. For example, with sep = “,” the comma can be defined as a separator between data, the data types of the columns can be defined with colClasses, the number and names of the rows with nrows and row.names, the character that signals missing values ​​with na .strings are specified and the number of lines to be skipped at the beginning of the file is specified with skip.
Basic knowledge of SQL is required to access a database. First, connect to
database When you are finished, it is essential to disconnect the connection with dbDisconnect (database).
While you have access to the database, you can execute SQL commands on it and use it to read out data, but also change it: dbGetQuery (database, “select count (*) from table”) or dbSendQuery (database, “select * from… where variable between 1 and 3 ”).
You can display the tables with tables <- dbListTables (database) and read tables dbReadTable (database, “table name”). More information on using SQL in R can be found here.

How do you get an overview of extensive data in R data mining?

If your data is available as a table or data frame, you can get a quick overview with the following commands:

  • The command dim (t) shows you the number of rows and columns in a table t, length (x) the length of a vector x. You can use these commands to quickly check whether rows or columns have been lost when reading or processing data.
  • The head (t) and str (t) commands display the column headings and the first five rows of table t so that you can preview the data. Names (t) only shows the column headings.
  • Summary (t) calculates the minimum and maximum values, the mean, the median and the first and third quartile for each column of a table t.
  • Table (x) determines the frequencies of the values ​​of a vector.

A quick plot (graphic representation) also provides an overview of the data. The two standard R functions are plot (data, type) and text (150,600, ”sample text”). The input for plot () is the data as a data frame, where x and y are entered as separate arguments, each as a vector consisting of numbers. Type can be: type = "p" (dots), "l" (lines), "b" (both) or "h" (histogram). Colors can be specified with the argument col = ”red”.
The packages graphics and ggplot2 offer additional graphics functions.

Using R correctly in the context of data mining: How do you bring data together?

Data that are available as two matrices x and y are combined in a single one with the command rbind (x, y) or cbind (x, y). The r stands for rows and c for columns. The function rbind is used when the two tables have the same columns and their rows are to be written one below the other, and cbind when the two tables have the same rows (e.g. belonging to the same data record from a measurement) but the columns are merged should be. For example: There were two measurement rounds in which 3 different physical quantities a, b and c were measured at 5 different temperatures. The temperatures used are in a temp matrix, the data of the two measurement rounds in the matrices data1 and data2. Temp looks like this, for example:

temperature
1-10
20
310
420
530

Data1 and Data2 have this form:

abc
114.18.917.0
215.69.016.4
317.89.116.0
419.19.214.8
520.59.310.6

 

Now you would e.g. with data1 <- cbind (temp, data1) and data2 <- cbind (temp, data2) First add the "Temperature" column to both matrices:

temperatureabc
1-1014.18.917.0
2015.69.016.4
31017.89.116.0
42019.19.214.8
53020.59.310.6

 

And then you hang the two matrices with it rbind (data1, data2) behind each other:

temperatureabc
1-1014.18.917.0
2015.69.016.4
31017.89.116.0
42019.19.214.8
53020.59.310.6
6-1014.08.916.5
7015.69.015.8
81017.69.115.2
92019.09.214.1
103020.39.39.7

 

Of course, the same also works with significantly larger amounts of data. To be sure that the merge worked, you should check the dimensions of the matrices after each step with the dim () function. Data1 and data2 initially had the dimensions 5 × 3, after adding the temperature column 5 × 4, and the combined matrix was 10 × 4. With the head () function you can check the first five lines of the table, with the View () function of the dplyr package the matrix opens in a separate window and can be scrolled there. It is not advisable at all, especially with large amounts of data, to display the matrix in the console with the command data1. The display is confusing and simply cuts off part of the data with large amounts of data.
The merge function is another way of merging data records. It can also merge data if row n of one table does not belong to the same data record as row n of the other table. A key value or two is used for the merging, on the basis of which data records that belong together are recognized:
total <- merge (data1, data2, by = ”ID”)
total <- merge (data1, data2, by = c (“ID”, “Country”))
With large amounts of data, data tables are often used instead of data frames. These are similar, but their processing is faster. The package data.table is required for this.
The read functions naturally also include the corresponding write functions such as write.table (), with which the data generated in R can be written back to a file and thus saved.

How does R support data preprocessing, e.g. data cleansing?

An important prerequisite for data mining is clean data. They are mostly available as a table. Data cleansing ensures that every row corresponds to a data set and every column to a variable. The handling of missing data must also be clarified. For the later evaluations, probably not all data is required, but only a part of it, e.g. only certain rows (data records) or certain columns (variables) in the table.
Colnames (m) <- c ("var1", "var2", ...) is a function with which you can assign column headings to your data table. Rows and columns of the table can be extracted using their number, e.g. the n-th row with table [n,] and the m-th column with table [, m]. If the columns have names, you can also select a column with table $ name. More complex methods for the selection of data are also possible, for example the definition of conditions like this: m [m $ var1% in% c (“a”, “b”),] selects those rows from a matrix m for which var1 is either “a” or “b”. With m [, var1 == 15] you select those columns for which var1 is equal to 15. And m [which (m $ var2> 8),] extracts all those lines for which var2 is greater than 8. Which (m $ var2> 8) is a vector with the corresponding line numbers.
For data preprocessing in particular, there are some packages that extend R with additional functions. These include the packages plyr, dplyr and tidyr. You can also find more R functions here.

How do the statistics functions scale in data mining with R?

After these steps have been carried out, the statistical evaluations follow. The statistical functions of R are described here. At this point we are interested in the scalability of these functions for R data mining.
In principle, R can also process large amounts of data. However, data mining is really about very, very large amounts of data, so that at some point the computing times always get too long. But since R can be expanded through packages, there are solutions for this as well. For example, HP has developed an open source package called HP Distributed R, with which an R program can run on several servers in parallel. In this way, evaluations are possible that would overwhelm a single computer. The Programming with Big Data in R (pbdR) project is developing further packages for the efficient processing of Big Data. R is also supported as a programming language by Apache SparkTM, an open source framework for processing big data in computer clusters.

Read more about R Data Mining