SQL Server 2016 R Services: executing R code examples

Curious about running your fist R script inside SQL Server and want to learn how it looks like? Unfortunately, it’s not a mix of R and T-SQL code, because R commands will be executed in a separate engine, which is installed with SQL Server. Therefore, sp_execute_external_script procedure is introduced with a bunch of parameters.

To give an illustration of what I mean, let’s look at the required parameters @language and @script. First can now take the only value – R. In the future Microsoft is planning to support more languages, for example, they talked about Python and Julia at PASS Summit 2015. The second parameter is used to pass an R script to it. Let’s look at the first very simple example, which will return information about internal iris dataset, that is available in R by default. R language comes with some data integrated inside it in the form of datasets you can play with. It’s very convenient and makes it easier to start experiments.

The output information tells us, that iris dataset has 150 rows and 5 columns. It shows the name of the columns, its datatypes as well as some values.

All other parameters are optional, but probably you’ll use them actively, especially those, that are responsible for data exchange between SQL Server and R environments. For instance, you can send some data into R using and ordinary SELECT statement and pass it to @input_data_1 parameter. Inside R script, this data will be available in the form of a dataset with the default name of InputDataSet, if you haven’t changed it using @input_data_1_name parameter. Now, only one input dataset is supported. Another key thing to remember is that if you want to return some data back from R to the output of sp_execute_external_script procedure, you should put it inside R script into a dataset with the default name OutputDataSet, or any other in case you define its name in @output_data_1_name parameter. For instance, in the second example, I’m giving a simple SELECT statement as an input with the default name to R script and returning it back.

If you don’t want to use the default names, or you already have a script with other names and don’t want to touch it, you can define the names of input and output datasets.

The result of the execution will be the same.

You should have already noticed, that the output has no column names. This is because I used with result sets undefined option. With this in mind, you can also define the returned dataset in order to better control the results.

Furthermore, there is an option to send just a couple of variables using @params argument, where you define them, and later just pass the values there. Use output clause if you want to return some values.

That is all you need to start with executing basic R scripts inside SQL Server. In the next posts, I will show how to control parallel execution, work with large data volumes and some data type considerations you should know when working with R.

SQL Server 2016 R Services: offline installation and adding additional R packages

Let’s discover, how to install R Services with SQL Server 2016. First, you need to check the respective component it during the installation process. In the early CTP versions it was called the Advanced Analytics Extensions, but in RTM it was renamed to R Services (In-Database).

If your server is connected to the Internet, the installation process will be easy and straightforward, all the required components will be automatically downloaded and installed. But, the real world is different, and most of the servers don’t have a direct Internet connection because of the security requirements. In that case, you still should run SQL Server installation and do the same choice as I mentioned earlier, but on the next step, you’ll see the following window where you should agree with the R Open installation.

Click Accept, then Next. The installation program will ask you for the directory, where R Open and R Server distributives are located as well as the links, where you can download them.

If you don’t have these distributives, go download them from the computer, where there is the Internet connection and then put somewhere, so that the installation program will have access to this folder. After it click Next and wait for the installation to finish. As you see, the process is quite easy too. Now it’s time to activate the R Services after SQL Server 2016 installation. You can do it with sp_configure command.

Attention! You need to restart SQL Server instance. After it, you could run the basic R scripts inside you SQL Server, but as we learned earlier, the real power of R language is in its extensions, called packages. So, in the future, you’ll need some of them too. The packages installation is described in the official documentation, but I suggest you one more way, which is more convenient, I believe.

Go to the C:\Program files\MSSQL13.MSSQLSERVER\R_SERVICES\bin or C:\Program files\MSSQL13.<instanceName>\R_SERVICES\bin\x64 folder depending, whether you installed the default of named instance and run R.exe with administrative privileges (it is required to have write permissions the packages directory).

You’ll see common R language console, where you can enter any R command.

Let’s see, how to install the very popular ggplot2 packages, which is often used to create rich graphics. Again, if your server is connected to the Internet, you just need to run the following command.

If not, the process could be somewhat complicated. First of all, you need to download the package and all of its requirements from the Internet. To do that, we need to go to the CRAN repository. For the ggplot2 package the URL will be https://cran.r-project.org/web/packages/ggplot2/index.html.

On the bottom, you see the link for the Windows binary ZIP archive for this package, and on top in Depends and Imports sections the list of required packages, which should be also downloaded. And of course, the required packages for the required too. So, you have to follow the links recursively to download the complete set. In this moment there are 13 of them, in the future, this number could change. I put them in C:\temp directory on my server, but you can choose any other.

Not I need to create an index file for downloaded packages, the PACKAGES file. This could be completed using the following commands in the R console.

 

Now we are ready to start the installation process. Don’t forget to define the directory with downloaded packages.

After successful installation, we can check it by trying to load the package.

If you forget or miss to download any required package, you will see an error on this step with the name of absent package. Don’t give up, just go, download it too, put in the same folder, create the new PACKAGES file and repeat the ggplot2 installation. Hope, that this article will help you to save some time with offline installation and help to start using R Services faster. In the next post, we’ll see how to execute R scripts inside SQL Server.

SQL Server 2016 R Services: introduction

This article is a short introduction to R programming language and a new R Services feature in SQL Server 2016. I hope, it will be a good starting point for those, who haven’t heard about it at all. R is a software environment and programming language for statistical computing, predictive analytics, and rich graphics. R can be easily extended through user-created packages, thousands of which are available in the CRAN (The Comprehensive R Archive Network), the central repository for all R binaries and third-party packages. Installing and updating packages from this source is an easy and straightforward process, which you can do just by one command or through graphical interface.

R is an open source project and is freely available for various operating systems, which is one of the main reasons why it has become substantially popular among statisticians and data scientists in recent years for data analysis, prototyping, and even software development. R community is huge and you can find an answer to almost any question or problem over the Internet, tons of scripts and examples are available for learning, downloading and modifying. Of course, such popularity has caused the inclusion of R language support to some Microsoft products and services.

On April 6th, 2015 Microsoft announced the acquisition of Revolution Analytics company, which has made R enterprise-ready by offering such products as R Open and R Server, which can scale and run R scripts number of times faster than an open source version by providing optimized versions of popular R built-in functions, multi-threaded processing and loading data to memory by chunks as needed.

Without a doubt, R Services in SQL Server 2016 is one of the biggest results of this unity. It brings R language support to one of the most popular RDBMS in the world, allowing to implement advanced analytics scenarios inside database engine and process large amounts of data using Revolution Analytics technologies. It’s possible to extend T-SQL code with more powerful R functions for data processing and predictive analytics. Moreover, you can now easily create rich graphics visualizations by using R capabilities and use these images inside your applications and reports.
In the next posts of this series, I will start to discover this new feature for you step by step and try to show its pros and cons, from the installation nuances up to processing large volumes of data in parallel. Hope, all this information will help you learn R Services and start using it.

SQL Server Query Plans: Startup Expression Predicate

I’ve already posted about predicates in query plans, but here is one more: Startup Expression Predicate. Again, it’s better to illustrate its behavior by example. Let’s create a small table with one clustered index and put some data into it.

Now, look at the following query. It should calculate the amount of rows in the table, but only if @return_anything flag is on.

The query is returning about 70 thousand rows and does a certain amount of logical reads.

Table ‘startup_expression_predicate_example’. Scan count 1, logical reads 10042, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 15 ms, elapsed time = 17 ms.

Let’s look at the query plan now.


We see the Filter operator with Startup Expression Predicate before Clustered Index Seek. The SQL Server fires the whole subtree to the left from the Filter operator only when its expression is true. Therefore, if we try to execute the same query with @return_anything equals to zero, it won’t even touch the table.

Execution statistics is quite simple.

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

You can encounter this query plan in different situation, don’t be afraid of it, it’s a great optimization that make execution simpler by skipping the whole parts of the query plans if they are not needed.

SQL Server Query Plans: Predicates vs. Seek Predicates

If you have ever written any query with a predicate, you should have noticed, that, at first, the optimizer is trying to find the smallest indexes, and then try to push predicate down to the get data operators to minimize the amount of logical and physical reads. To illustrate this behavior, let’s look at the example below. I will create a small table with just two columns, put some data into it and create a clustered index on the datetime field, where the values are between January 1st 2016 and January 10th 2016.

Now we are ready to query this data, and we’ll start with a small example just to count the number of rows between two dates. Before executing our queries we also should enable execution time and IO statistics collection.

In my case, the query is returning about 70,000 rows, which is quite good because data distribution is uniform and we are searching for 7 of 10 days. The executions statistics tells, that it has spent about 16 milliseconds CPU time and about ten thousand logical reads. Index properties say that it has 14288 pages in total and 10 thousand is about 70% of it. So, everything looks perfect.

Table ‘seek_predicate_example’. Scan count 1, logical reads 10045, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 16 ms, elapsed time = 22 ms.

Moreover, let’s look at the query plan and Clustered Index Seek operator in particular.

The values we are searching are shown as Seek Predicates. It means that index is used and the query plan is utilizing it for seeking through the data.

Now, let’s make a query more complicated and try to confuse the optimizer by adding the third value.

It’s time to check the query plan now.

The new parameter is now in Predicate section, not in Seek Predicate. If you look at its value, it’s higher than the first one and limiting the amount of data to select from 70% to 30% approximately. And the query now returning about 30 thousand rows. But let us check the execution statistics.

Table ‘seek_predicate_example’. Scan count 1, logical reads 10045, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 16 ms, elapsed time = 22 ms.

Surprisingly or not, it’s the same as for the first query. Why? Because SQL Server cannot use too much parameters for the Seek Predicates to use index: one lower value and one upper. The third parameter goes to just Predicates section and will be used after getting rows from the index to further filter them. That is the main difference between Seek Predicates and Predicates.

But how can we force SQL Server to use the better expression? The answer is to make it simple and not to confuse optimizer. If you’re sending too many parameters to the query and can minimize them, do it, help the optimizer. For example, let’s rewrite our query like this and check the execution statistics again.

And the execution statistics. The query plan is the same, as for the very first example. You can check it by yourself.

Table ‘seek_predicate_example’. Scan count 1, logical reads 4303, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 7 ms.

Now, you see the difference, the number of logical reads is lower and the query is faster.

You can come across this situation not only when using a complex predicate expression but in many other situations. Therefore, always check, whether your indexes are really used for searching, not just for reading all data and filter it afterwards.

MAXDOP and Resource Governor in SQL Server

Many of you at least should have heard about Max Degree of Parallelism setting, which exists in SQL Server from the early versions. By default, its value is zero, which means that SQL Server determines by itself the optimal number of threads for your queries. However, let us go a little bit further and delve into mastering and controlling parallelism inside SQL Server Database Engine.

In this post, I’m not going to discuss, how to calculate the best MAXDOP value, but just discover, how queries behave depending on it. It could be changed either by SQL Server Management Studio Server Properties windows or by T-SQL sp_configure command.



For example, if set to one, it forces all the query plans to be serial, with no parallelism at all, like in the example below.


Don’t do this in production, if you are not 100% sure or you just read an article somewhere over the Internet, telling you that it’ll help to eliminate CXPACKET wait “problem”.

However, users and developers are still able to control the parallelism for selected queries by using MAXDOP query hint. They can override the server option easily just by defining the needed value.


Moreover, if we look at this query through sys.dm_exec_query_profiles DMV while it’s running, we’ll see, that it consumes 10 threads.


Thus, everyone can go over the server limitation set by DBA. Actually, this is normal behavior, because it’s not the limitation option. It just allows you quickly setting up the best option for all queries and then using MAXDOP hint tune the selected ones with other value.

But what can you do to guard your system from being overloaded by someone’s heavy uncontrolled parallel queries? If you really want to restrict the parallelism MAXDOP values, you should use Resource Governor. To read more about it and learn how to set up look here. I will just show its capabilities regarding the controlling parallelism.


Every workload group inside it has its own Degree of Parallelism option. Again, by default, it equals to zero, which means, that it’s not working. But if set up in 5, for example, it will limit all the queries, that are going to run with the degree of parallelism higher that this value. Therefore, the previous example with MAXDOP 10 option will be limited to five threads only.


Hence, the Resource Governor is the only feature, that should be used for controlling and limiting the degree of parallelism for different groups of users, while MAXDOP server option and hint are just for setting the optimal value for all and selected queries correspondingly.