Hadoop Tutorial 2.2 -- Running C++ Programs on Hadoop
From DftWiki
--D. Thiebaut 12:46, 13 April 2010 (UTC)
|
|
Running the WordCount program in C++
![]() | | |
|
The reference for this is the C++ wordcount presented in the Hadoop Wiki [1]. Submitting C++ compiled code is supported by Hadoop through a pipes API, the syntax of which is documented here. | ||
![]() | ![]() |
The Files
You need 3 files to run the wordCount example:
- a C++ file containing the map and reduce functions,
- a data file containing some text, such as Ulysses, and
- a Makefile to compile the C++ file.
wordcount.cpp
- The wordcount program is shown below
- It contains two classes, one for the map, one for the reduce
- It makes use of several Hadoop classes, one of which contains useful methods for converting from tuples to other types: StringUtils.
#include <algorithm> #include <limits> #include <string> #include "stdint.h" // <--- to prevent uint64_t errors! #include "hadoop/Pipes.hh" #include "hadoop/TemplateFactory.hh" #include "hadoop/StringUtils.hh" using namespace std; class WordCountMapper : public HadoopPipes::Mapper { public: // constructor: does nothing WordCountMapper( HadoopPipes::TaskContext& context ) { } // map function: receives a line, outputs (word,"1") // to reducer. void map( HadoopPipes::MapContext& context ) { //--- get line of text --- string line = context.getInputValue(); //--- split it into words --- vector< string > words = HadoopUtils::splitString( line, " " ); //--- emit each word tuple (word, "1" ) --- for ( unsigned int i=0; i < words.size(); i++ ) { context.emit( words[i], HadoopUtils::toString( 1 ) ); } } }; class WordCountReducer : public HadoopPipes::Reducer { public: // constructor: does nothing WordCountReducer(HadoopPipes::TaskContext& context) { } // reduce function void reduce( HadoopPipes::ReduceContext& context ) { int count = 0; //--- get all tuples with the same key, and count their numbers --- while ( context.nextValue() ) { count += HadoopUtils::toInt( context.getInputValue() ); } //--- emit (word, count) --- context.emit(context.getInputKey(), HadoopUtils::toString( count )); } }; int main(int argc, char *argv[]) { return HadoopPipes::runTask(HadoopPipes::TemplateFactory< WordCountMapper, WordCountReducer >() ); }
Makefile
Before you create the Makefile, you need to figure out whether your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:
uname -a
To which the OS responds:
Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux
The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.
Once you have this information create the Makefile (make sure to spell it with an uppercase M):
CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include
wordcount: wordcount.cpp
$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
-lhadooputils -lpthread -g -O2 -o $@
- Note: Users have reported that in some cases the command above returns errors, and that adding -lssl will help get rid of the error. Thanks for the tip! --D. Thiebaut 08:16, 25 February 2011 (EST)
Data File
- We'll assume that you have some large text files already in HDFS, in a directory called dft1.
Compiling and Running
- You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!
sudo apt-get install g++
- Compile the code:
make wordcount
- and fix any errors you're getting.
- Copy the executable file (wordcount) to the bin directory in HDFS:
hadoop dfs -mkdir bin (Note: it should already exist!) hadoop dfs -put wordcount bin/wordcount
- Run the program!
hadoop pipes -D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
-input dft1 -output dft1-out \
-program bin/wordcount
- Verify that you have gotten the right output:
hadoop dfs -text dft1-out/part-00000 "Come 1 "Defects," 1 "I 1 "Information 1 "J" 1 "Plain 2 ... zodiacal 2 zoe)_ 1 zones: 1 zoo. 1 zoological 1 zouave's 1 zrads, 2 zrads. 1
![]() | | |
| ||
![]() | ![]() |
![]() | | |
| ||
![]() | ![]() |
Computing the Maximum Temperature in NCDC Data-Files
![]() | | |
|
This is taken directly from Tom White's Hadoop, the Definitive Guide [2]. It fixes a bug in the book that prevents the compiling of the example code given on page 36. | ||
![]() | ![]() |
The Files
You need 3 files to run the maxTemperature example:
- a C++ file containing the map and reduce functions,
- a data file containing some temperature data such as found at the National Climatic Data Cener (NCDC), and
- a Makefile to compile the C++ file.
Max_temperature.cpp
#include <algorithm> #include <limits> #include <string> #include "stdint.h" // <-- this is missing from the book #include "hadoop/Pipes.hh" #include "hadoop/TemplateFactory.hh" #include "hadoop/StringUtils.hh" using namespace std; class MaxTemperatureMapper : public HadoopPipes::Mapper { public: MaxTemperatureMapper(HadoopPipes::TaskContext& context) { } void map(HadoopPipes::MapContext& context) { string line = context.getInputValue(); string year = line.substr(15, 4); string airTemperature = line.substr(87, 5); string q = line.substr(92, 1); if (airTemperature != "+9999" && (q == "0" || q == "1" || q == "4" || q == "5" || q == "9")) { context.emit(year, airTemperature); } } }; class MapTemperatureReducer : public HadoopPipes::Reducer { public: MapTemperatureReducer(HadoopPipes::TaskContext& context) { } void reduce(HadoopPipes::ReduceContext& context) { int maxValue = -10000; while (context.nextValue()) { maxValue = max(maxValue, HadoopUtils::toInt(context.getInputValue())); } context.emit(context.getInputKey(), HadoopUtils::toString(maxValue)); } }; int main(int argc, char *argv[]) { return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper, MapTemperatureReducer>()); }
Makefile
Create a Make file with the following entries. Note that you need to figure out if your computer hosts a 32-bit processor or a 64-bit processor, and pick the right library. To find this out, run the following command:
uname -a
To which the OS responds:
Linux hadoop6 2.6.31-20-generic #58-Ubuntu SMP Fri Mar 12 05:23:09 UTC 2010 i686 GNU/Linux
The i686 indicates a 32-bit machine, for which you need to use the Linux-i386-32 library. Anything with 64 indicates the other type, for which you use the Linux-amd64-64 library.
CC = g++
HADOOP_INSTALL = /home/hadoop/hadoop
PLATFORM = Linux-i386-32
CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include
max_temperature: max_temperature.cpp
$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
-lhadooputils -lpthread -g -O2 -o $@
Data File
- Create a file called sample.txt which will contain sample temperature data from the NCDC.
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
- Put the data file in HDFS:
hadoop dfs -mkdir ncdc hadoop dfs -put sample.txt ncdc
Compiling and Running
- You need a C++ compiler. GNU g++ is probably the best choice. Check that it is installed (by typing g++ at the prompt). If it is not installed yet, install it!
sudo apt-get install g++
- Compile the code:
make max_temperature
- and fix any errors you're getting.
- Copy the executable file (max_temperature) to a bin directory in HDFS:
hadoop dfs -mkdir bin hadoop dfs -put max_temperature bin/max_temperature
- Run the program!
hadoop pipes -D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
-input ncdc/sample.txt -output ncdc-out \
-program bin/max_temperature
- Verify that you have gotten the right output:
hadoop dfs -text ncdc-out/part-00000 1949 111 1950 22
- Congrats, you have computed the maximum of 5 recorded temperatures for 2 different years!
Speed Test: Java vs Python vs C++
Data Set #1: 6 books
- The wordcount program in native Java, in Python streaming mode and in C++ pipes mode is run on 6 books from the Gutenberg project:
-rw-r--r-- 2 hadoop supergroup 393995 2010-03-24 06:30 /user/hadoop/dft6/ambroseBierce.txt -rw-r--r-- 2 hadoop supergroup 674762 2010-03-24 06:30 /user/hadoop/dft6/arthurThompson.txt -rw-r--r-- 2 hadoop supergroup 590099 2010-03-24 06:30 /user/hadoop/dft6/conanDoyle.txt -rw-r--r-- 2 hadoop supergroup 1945731 2010-03-24 06:30 /user/hadoop/dft6/encyclopediaBritanica.txt -rw-r--r-- 2 hadoop supergroup 343694 2010-03-24 06:30 /user/hadoop/dft6/sunzi.txt -rw-r--r-- 2 hadoop supergroup 1573044 2010-03-23 21:16 /user/hadoop/dft6/ulysses.txt
- The successful execution of both versions yield these execution times (real time):
| Method | Real Time (seconds) |
Ratio to Java |
|---|---|---|
| Java | 19.2 | 1 |
| C++ | 40.2 | 0.48 |
| Python | 23.4 | 0.82 |
Data Set #2: 4 180-MB files
- Same idea as above, but with much larger files that will be split into several splits
-rw-r--r-- 2 hadoop supergroup 187789938 2010-04-05 15:58 /user/hadoop/wikipages/block/all_00.xml -rw-r--r-- 2 hadoop supergroup 192918963 2010-04-05 16:14 /user/hadoop/wikipages/block/all_01.xml -rw-r--r-- 2 hadoop supergroup 198549500 2010-04-05 16:20 /user/hadoop/wikipages/block/all_03.xml -rw-r--r-- 2 hadoop supergroup 191317937 2010-04-05 16:21 /user/hadoop/wikipages/block/all_04.xml
- Execution times (real time):
| Method | Real Time (seconds) |
Ratio to Java |
|---|---|---|
| Java | 2 min 15.7 | 1.0 |
| C++ | 5 min 26 | 0.416 |
| Python | 12 min 46.5 | 0.177 |
MapReduce vs Hadoop
The Terabyte Sort benchmark gives a good estimate of the difference between the two.
In May 2009, Yahoo! reported in its Yahoo! Developer Network Blog sorting 1 TB of dat in 62 seconds on 3800 computers.
Google reported in its 11/2008 blogspot sorting 1 TB of data on MapReduce in 68 seconds, on 1000 computers.
| Company | Platform | Number of Computers | 1 TB Sort Execution Time |
|---|---|---|---|
| MapReduce | 1000 | 68 sec | |
| Yahoo! | Hadoop | 3800 | 62 sec |
So, roughly, given the number of computers, and assuming linear speedup, MapReduce is about 4 times faster than Hadoop.
References
- ↑ Hadoop Wiki, C/C++ MapReduce Code & build
- ↑ Tim White, Hadoop, the definitive guide, O'Reilly Media, June 2009, ISBN 0596521979






