Sunday, 3 March 2013

A Lap Around HDInsight

I’m currently working with Microsoft’s Patterns and Practices team, researching and documenting best practices guidance for big data analysis with HDInsight. For those of you who may not know, HDInsight is Microsoft’s distribution of Hadoop – an open source platform for analysis of huge volumes of complex data; and as a database guy with a background in BI, I’m obviously interested in figuring out the best ways to use HDInsight, and how it integrates with the existing SQL Server technologies that I’m used to working with. The results of our work will be published soon, but in the meantime, I thought it might be useful to share some basic steps to help you get started with HDInsight.

Before we start however, it might be useful to explain a bit about HDInsight, Hadoop, and Big Data. If you’re involved in database development or BI, you can hardly have missed the growing buzz about Big Data. There seem to be a lot of definitions being bandied around (and products being advertised), and it can all be very confusing – so, here’s my take on it.

  • “Big Data” is a catch-all term that’s popularly used to describe data that you want to analyze, but is difficult to process with traditional database technologies. The difficulty is usually caused by one or more of the so-called three V’s:
    • Volume: The sheer amount of data makes analysis with a traditional relational database impractical.
    • Variety: The data is in difficult to process formats, and may be completely unstructured.
    • Velocity: The data is generated at a rapid pace and must be processed as close to real-time as possible.
  • Hadoop is an Apache open source project for a big data analysis solution that is based on distributing the data processing across multiple server nodes in a cluster. The distributed processing uses a technique pioneered by those bright folks at Google, called Map/Reduce. The data in a Hadoop cluster is stored on a distributed file system called HDFS, which enables each node to work on a subset of the data in parallel (the Map phase), and then the results from each node are consolidated into a single result (the Reduce phase). You can write the Map/Reduce code to process your data in Java (which is the native language for Hadoop), or in other languages supported by the Hadoop distribution you’re using (such as JavaScript, Python, or C#). In addition to the core Hadoop project, there are a number of other projects that add value to a Hadoop cluster. In particular, a project named Pig provides an abstraction layer over Map/Reduce that enables you to write and execute data processing workflow steps in a language named Pig Latin, which are then translated to Map/Reduce jobs by the Pig interpreter. Another significant Hadoop related project is Hive, which provides an abstraction over Map/Reduce in which data in HDFS files can be represented as tables and queried with a SQL-like language named HiveQL.
  • HDInsight is the brand-name of the Microsoft distribution of Hadoop. It’s an open source distribution, based on the same Hadoop core as other distributions, and available as an on-premise application for Windows Server, or as a cloud service hosted in Windows Azure.

Provisioning a Cluster

The first thing you need to do is to provision an HDInsight cluster, and this is your first major decision point. You can opt to install a local, on-premise HDInsight cluster on Windows Server via the Microsoft Web Platform Installer, or you can create a cloud-based cluster on Windows Azure. At the time of writing, HDInsight is available as a Windows Azure preview service at

To create an HDInsight cluster, you need to enable preview services in your Windows Azure subscription, and subscribe to the HDInsight preview. It usually takes a few days for your subscription activation to arrive by email, and then you can return to the Windows Azure portal to provision your cluster. The Windows Azure-based HDInsight service uses a Windows Azure blob store to host the HDFS file system structures used by the cluster, so you need to create a Storage account for the cluster. The storage account must be co-located with the HDInsight cluster nodes, which in the current preview means that you must create your storage account in the East US data center. For this example, I’ve created a storage account named graemesplace:


After you’ve created your storage account, you can create an HDInsight cluster. The Quick Create option lets you specify a cluster name, choose from a range of pre-configured cluster sizes, specify a password for the default Admin account, and match the storage account you created earlier to the cluster. You can also use the Custom Create option to have more control over the cluster, for example specifying a user name other than Admin. I’ve used the Quick Create option to create a cluster named graemesplace as shown here:


The DNS name you assign will determine the address for the cluster in the domain. I’ve specified “graemesplace”, so the address of my cluster is

The provisioning process takes several minutes, and when your cluster is ready you can view it in the Windows Azure portal as shown here:


At the bottom of the HDInsight page in the portal, there are links to connect to the cluster (via a remote desktop session) and to manage the cluster. This second option is accomplished through a dashboard page, which is shown here:


Obtaining and Loading Data for Analysis

So now I have an HDInsight cluster, I’m ready to analyze some data. Being British, I am of course obsessed with the weather, so I plan to do some analysis of historic weather in my home town of Dunfermline, Scotland. To obtain this data, I’m going to use the UK Met Office Weather dataset that is available in the Windows Azure Datamarket, as shown here. If you haven’t explored the Azure Datamarket, it’s a one-stop shop for data from government organizations and private commercial data publishers, and includes a pretty eclectic mix of datasets that you can purchase and download. Of course, like any true Scotsman, I’ve chosen the Met office weather data because it’s free!


The Met office data includes weather data from all over the UK, but since I’m only interested in weather statistics for Dunfermline, I’ll filter my search by the site code field (which uniquely identifies each weather station). I can look up the site code for Dunfermline in the Site dataset as shown here.


Now I can filter the data from the DailyForecast dataset based on the ID for the Dunfermline site as shown here.


Note the URL for the currently expressed query. This is a REST-based URI that can be used to download the data from a client application. Note also that you can click Show to display a secure primary account key, which is used to authenticate requests from client applications. One such client application is Excel, in which you can download data from the Windows Azure Datamarket as shown here.


To download the data, I just specify the URL and account key from the query page in the Azure Datamarket site, and (after a while) the data appears in the worksheet. I can then remove any columns I don’t need as shown here.


Note that the data includes multiple measurements for each day (taken at different times). To analyze the data properly, I plan to use HDInsight to calculate an average figure for each day. I’m happy that I have the columns I need, so I just remove the headings and save the data as a comma-delimited text file to be uploaded to my HDInsight cluster.

Since the storage for the HDInsight cluster is actually a Windows Azure storage volume (ASV), I can upload the data using any tool that can connect to the Azure blob store. However, to keep things simple, I’m going to use interactive console in the HDInsight dashboard and use the fs.put() JavaScript command to display a dialog box that enables you to  upload the data file as shown here:


You can also use the interactive console to navigate the HDFS file system using UNIX-style commands prefixed with a “#” character. For example, to display a directory listing you can type #ls. HDFS has a root folder (indicated by “/”), and each user has a home folder under the /user folder, so my home folder on the cluster is /user/admin. By default, your home folder is the current folder when you connect; and as a shortcut, you can use a “.” character to indicate your home folder, so I can view the contents of a subfolder named weather in my home folder by using either of the following commands:

#ls /user/admin/weather

#ls ./weather

So now I’ve uploaded the comma-delimited text file containing the weather data to ./weather/data.txt, and I’m ready to process it.

Understanding Map/Reduce

Fundamentally, all processing on HDInsight takes the form of a Map/Reduce job. Each node performs a Map job on a subset of the data, and then a Reduce job is performed by a designated node to consolidate the results of the Map jobs and produce a single output file. Natively, Map/Reduce jobs are implemented in Java and compiled to .jar executable, which are uploaded and executed as jobs in the cluster. However, you can also use other languages to implement Map/Reduce code. I’m not a Java programmer (indeed, some would say I’m not a programmer!), so I’ve limited myself to creating a simple example Map/Reduce program in JavaScript as shown here:


The code consists of a map function and a reduce function. The map function is run on all cluster nodes, each node submitting a subset of the source data one line at a time in the value parameter. The map code creates an array from the line of data, using a comma as a delimiter, and writes a key/value pair consisting of the date and temperature fields from the source data to the context object.

The reduce function operates on each distinct key generated from the map operations, and iterates through the values associated with that key. The key value is the date, and the values are the temperature readings for that date. The code adds all of the temperature values, and then divides them by the number of readings for that date to calculate an average value. The consolidated results for each key are then written as a key/value pair, so we should end up with a single date and average temperature for each date value in the source data.

The HDInsight dashboard provides a handy Create Job tile that you can use to run a Map/Reduce job that has been implemented as a Java .jar executable. However, since I’ve chosen to implement my code in JavaScript, I’ll need another way to execute it. The easiest way to run the job is to upload the JavaScript file (in this case to ./weather/getdailytemp.js) and  use Pig to call it, which I can do by entering the following command in the interactive JavaScript console:

pig.from(“/user/admin/weather/data.txt”).mapReduce(“/user/admin/weather/getdailytemp.js”, date, avgtemp:long).to(“/user/admin/weather/avgdailytemp”);

This command tells Pig to run a Map/Reduce job on the weather data by using the JavaScript file I’ve uploaded and produce results with two columns: a date column (which in the absence of an explicit data type will be a chararray) and an avgtemp column (which is declared as a long integer). The results are then stored as a file in the ./weather/avgdailytemp folder.

When the job has finished (which can take some time), viewing the contents of the output directory reveals an output file (in this case named part-m-00000) and a file named _SUCCESS (which is simply a flag to indicate the outcome of the job).


You can use the #cat command to view the contents of the results file as shown here:


Using Pig

While I could write custom Map/Reduce code for all of the data processing my analysis needs, it would require considerably more programming skills than I currently have, and would be extremely time-consuming. Fortunately, I can use Pig to do more than just run my own Map/Reduce code; I can also use it to create data processing workflows in a high-level language called Pig Latin, and have Pig translate those workflows to Map/Reduce code for me.

You can execute Pig Latin statements interactively in a command line shell called (I kid you not) Grunt, or you can create a text file that includes a sequence of Pig Latin commands to be executed as a script. The easiest way to get to the Grunt shell is to use the Connect link in the Windows Azure portal or the Remote Desktop shortcut in the HDInsight dashboard to open a remote desktop session with the cluster (logging in with the credentials you specified when provisioning the cluster), and then using the Hadoop Command Line shortcut on the desktop to run pig from the c:\apps\dist\pig-0.9.3-SNAPSHOT\bin folder as shown here.


When using Pig to process data, typically you execute a sequence of Pig Latin commands, each of which defines a relation. You can think of a relation as being a result set, often tabular in structure. The first command often uses the Pig Latin LOAD statement to populate a relation from a data file, and subsequent commands create new relations by performing operations on the relations created by previous commands. For example, I can use the following sequence of Pig Latin commands to process the weather data in ./weather/data.txt:

Forecasts = LOAD './weather/data.txt' USING PigStorage(',') AS (day, date, temp:long, wind:long);
GroupedForecasts = GROUP Forecasts BY date;
GroupedAvgs = FOREACH GroupedForecasts GENERATE group, AVG(Forecasts.temp) AS avgtemp, AVG(Forecasts.wind) AS avgwind;
AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp, avgwind;

The first command loads the comma-delimited weather data to create a relation named Forecasts with four columns (day, date, temp, and wind).

The second command creates a new relation named GroupedForecasts that contains the Forecasts relation grouped by date.

The third command creates a relation named GroupedAvgs that calculates the average temp and wind values for each group in GroupedForecasts.

The fourth command creates a relation named AvgWeather, which flattens the date group in the GroupedAvgs relation to create a row for each date with date, avgtemp and avgwind columns.

Pig doesn’t actually generate any map/Reduce jobs until you call either a DUMP statement (to display the contents of a relation in the console) or a STORE statement (to save a relation as a file). For example, executing the command DUMP AvgWeather after the four commands above would create the necessary Map/Reduce jobs to process the workflow, and display the results in the console window. The command STORE AvgWeather INTO ‘./weather/dailyaverages’ would store results in the ./weather/dailyaverages folder.

To run the commands as a script, I simply save them in a file named ProcessWeather.pig and run them from the command line using the pig program, as shown here:


The output from the Pig script is similar to that of the Map/Reduce job I ran earlier. When the script has completed, the ./weather/dailyaverages folder contains a file named part-r-00000, a _SUCCESS file, and a _logs folder. using #cat to view the part-r-00000 file shows the results of the processing:


Using Hive

Until now, all of the data processing we’ve done with Map/Reduce code and Pig Latin has generated results in text files. You can view the text files in the interactive console or download them to a client application such as Excel, but it would be good to be able just to consume results from HDInsight the same was as you do from a relational database – by executing queries against tables. That’s exactly what Hive is designed to let you do.

With Hive, you can create tables that get their data from files in HDFS folders. The important thing to understand here is that Hive tables don’t contain data; but rather they provide a metadata layer that projects a schema onto data in underlying files. What does that mean? Well, Hive (in common with Pig and all Map/Reduce code) uses an approach called schema on read, in which a table is associated with a folder in the HDFS folder structure (actually, a table could be partitioned across multiple folders, but let’s not get ahead of ourselves!). When you query a table, the data in any files found in the related folder is read, and only then are the columns and data types in the table definition used to apply structure to the data. You can create a Hive table based on an empty folder, and then copy files to that folder (either source files or the results of Map/Reduce or Pig processing). When you query the table, a Map/Reduce job is generated to apply the table schema to the data in the files you have placed in the folder.

For example, I could create a table for average weather readings by using the following HiveQL statement:

CREATE TABLE avgweather
(weatherdate STRING,
avgtemp FLOAT,
avgwind FLOAT)

If you’re used to working with SQL Server, or any other SQL-based database, this should look pretty familiar. However, because the data will actually be retrieved at query-time from underlying files, there are a few additional options that you typically don’t find in relational databases. For example, you can specify the delimiter you want Hive to use when mapping data in the files to columns in the table by including a ROW FORMAT DELIMITED clause as shown in the following screenshot of the Hive interactive console. In this case, the fields are tab-delimited (which matches the output generated by the Pig script I ran earlier):


The folder that this table maps to is automatically created in the /hive/warehouse folder in the HDFS file system. If you prefer not to use the default location, you can specify an alternative location as shown in the following code:

CREATE TABLE avgweather
(weatherdate STRING,
avgtemp FLOAT,
avgwind FLOAT)
STORED AS TEXTFILE LOCATION '/user/graeme/weather/avgweather'

The ability to use a non-default location is useful if you want processes outside of Hive to be able to access the data files, or if the files already exist and you just want to create a table over them. However, you should be aware that by default, Hive treats all tables as INTERNAL tables, which means that it tightly binds the lifetime of the folder to that of the table. Put more simply, if you delete a table by using the DROP TABLE command, the folder associated with the table (along with all the files it contains) is deleted. If you want the folder to exist independently of the table, you must create it as an EXTERNAL table as shown here:

(weatherdate STRING,
avgtemp FLOAT,
avgwind FLOAT)
STORED AS TEXTFILE LOCATION '/user/graeme/weather/avgweather'

After you have created a table, you can populate it by simply adding files to the folder with which the table is associated. You can do this by using the #cp (copy), #mv (move), #copyFromLocal, or #moveFromLocal HDFS commands in the interactive JavaScript console, or by using the LOAD HiveQL command in the interactive Hive window as shown here:


This code loads data from the part-r-00000 file generated by the Pig script earlier by moving (not copying) the part-r-00000 file from the dailyaverages folder created by the Pig script earlier to the avgweather folder for the table.

Now that the folder associated with the table contains some data, you can query the table to project its schema onto the data, as shown here:


Using Hive from Excel

One of the main benefits of Hive is that HDInsight includes an ODBC driver for Hive that enables client applications to connect to HDInsight and execute HiveQL queries, just like they can with SQL Server (or any other ODBC-compliant database server).  To install the ODBC driver, download click the Downloads tile on the HDInsight dashboard and download the appropriate version for your system (32-bit or 64-bit). Note that the driver download also installs an add-in for Excel, so be sure to choose the CPU architecture that matches your Office installation (and remember that you can install 32-bit Office on 64-bit Windows – I’m just saying!):


After you’ve installed the driver, you can use it from Excel (and other ODBC-aware tools like SQL Server Integration Services, SQL Server Analysis Services (in tabular mode), and SQL Server Reporting Services). Installing the driver also installs a Hive Pane add-in for Excel, which makes it easy for users to connect to HDInsight and create HiveQL queries. However, I experienced a few issues when using this with Excel 2013 (and it seems I’m not alone), so rather than try to create connections explicitly in Excel, I recommend you create a system data source name (DSN) and use that – for some reason it seems a little more stable in this pre-release build!

To create a DSN, go to Administrative Tools and open Data Sources (ODBC). Then create a DSN that uses the Hive ODBC driver and references your HDInsight cluster, as shown here:


Now you’re ready to connect to Hive on your HDInsight cluster from Excel. For example, if you’re using Office Professional Plus 2013, you can use the PowerPivot add-in to connect to an ODBC source as shown here:


Then select the DSN you created earlier and enter the required credentials:


You can use the Table Import Wizard to select a Hive table, or enter a HiveQL query. In this case, I’ve just selected the avgweather table I created earlier:


After the table is imported, you can edit it in the data model, for example to add calculated columns based on DAX expressions or change data types. In this case, I’ve changed the data type of the weatherdate column to Date.


Now that the table has been imported into the data model of the workbook, I can use it to analyze and visualize the data it contains. For example, I can create a Power view report that shows the temperature and wind measurements as a line chart across the entire time period contained in the table.


I’ve only really scratched the surface of HDInsight in this article, so be sure to keep an eye on the Patterns and Practices site for more detailed guidance. In the meantime, if I’ve inspired you to experiment with HDInsight for yourself, the best place to start is