Geo-Spatial Analytics with R

I got couple of questions about what is R. Well, R is freely available language and environment for statistical computing and graphics. R provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering. R is a free software. Here the word free is used in the same sense as freedom. R can be developed, modified and distributed freely, provided one abides by the GNU General Public License. You may consult the R project homepage for further information.

Being Open source and free, lot of interesting development has happened in R, that help us to perform data analysis quickly. R has many excellent library that can put most costly commercial products to shame. R is powerful, though not very easy to learn. There are too many choices in R and one need to devote considerable time to understand what works best for a particular situation.

In this blog, we will see how to read data from a RDBMS – MySQL. Perform some amount of data manipulation to generate reports. We will also present data in Geo-spatial reports in addition to  standard charts.

For our tutorial, we use month-wise sales data at dealer level. We also have address of dealers in our database that we will leverage to generate sales data at district level. We have this data in a relational database – MySQL. R has library to extract data from most standard data bases. Here to access this data we need to first connect R with the database. This require RMySQL package, which is currently not installed in my machine. I need to get this package. So I give a command in R:

install.packages(“RMySQL”)

It gives me a list of server to choose form. I select the closest server in India, about a minute later I get confirmation that the package is downloaded but it has failed with an error. A quick look in the terminal window tells me that it could not find mysqlclient-dev library. This error is kind of expected in a Linux system. Unlike windows, linux machines compile the code in local machine. I have not installed mysql client programs in my machine, all I have is a server. Naturally RMySQL installation fails. Error message gives me the way out is to install libmysqlclient-dev or  libmariadbclient-dev. I install this library with in Ubuntu:

sudo apt-get install libmysqlclient-dev

and then again try installing RMySQL. This time installation succeeds without any problem.

Note: There are some alternate remedy suggested like installing r-client-dev of libmariadbclient-dev. These solutions do not work, at least in my Ubuntu 14.10 machine. So stick to the suggestion pointed in error message.

To connect to MySQL one needs userid and password. While these can be given in script, it is actually a foolish idea to do so. It is quite easy to share program script with someone with userid and password, I would never take such chance and prefer to use a configuration file to keep the connection recipe secure. The file where these information is kept should be named .my.cnf and placed in the home directory. Here is how you set up .my.cnf file (or ‘c:/my.cnf’ under Windows)
# this is a comment
[client]
user = dj
host = localhost
[rs-dbi]
database = s-data
[lasers]
user = opto
database = opto
password = pure-light
host = merced
[iptraffic]
host = data
database = iptraffic
This file should be readeable only by you. Inside each section, MySQL parameters may be specified one per line (e.g., user = opto). MySQL always considers default options from the [client] group for connecting to a server. To override or add additional options, R MySQL combines default options from the [rs-dbi] group, but you may specifiy you own group in the dbConnect call to tailor your environment. Note that to override options, you must place your group after the [client] group in configuration file.
For instance, if you define a group, say, [iptraffic], then instead of including all these parameters in the call to dbConnect, you simply supply the name of the group, e.g., dbConnect(mgr, group = “iptraffic”).
library(DBI)
# Connect to my-db as defined in ~/.my.cnf
con <- dbConnect(RMySQL::MySQL(), group = "my-db")

dbListTables(con)
dbWriteTable(con, "mtcars", mtcars)
dbListTables(con)

dbListFields(con, "mtcars")
dbReadTable(con, "mtcars")

# You can fetch all results:
res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4")
dbFetch(res)
dbClearResult(res)

# Or a chunk at a time
res <- dbSendQuery(con, "SELECT * FROM mtcars WHERE cyl = 4")
while(!dbHasCompleted(res)){
  chunk <- dbFetch(res, n = 5)
  print(nrow(chunk))
}
# Clear the result
dbClearResult(res)

# Disconnect from the database
dbDisconnect(con)

MySQL configuration file

Instead of specifying a username and password in calls to dbConnect(), it’s better to set up a MySQL configuration file that names the databases that you connect to most commonly. This file should live in~/.my.cnf and look like:

[database_name]
option1=value1
option2=value2

If you want to run the examples, you’ll need to set the proper options in the [rs-dbi] group of any MySQL option file, such as /etc/my.cnf or the .my.cnf file in your home directory. For a default single user install of MySQL, the following code should work:

[rs-dbi]
database=test
user=root
password=

This entry was posted in Analytics, spatial data. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *