Secondary Sort Example in Hadoop | Hadoop Tutorial - JavaMakeUse: Java | Big Data | Scala | Hive | Spark | Hadoop | HBase | Solr | Spring

Tools and Technologies we are using to solve this use case

Java 7
Eclipse Mars
Hadoop 2.7.1
Maven 3.3
Ubuntu 14(Linux OS)

Main Objects of this project are:

CustomKey.java
CustomPartitioner.java
KeyComparator.java
GroupComparator.java
ProductReviewVO.java
ProductMapper.java
ProductReducer.java
ReviewDriver.java

Step 1. Create a new maven project
Go to File Menu then New->Maven Project, and provide the required details, see the below attached screen.

Step 2. Edit pom.xml
Double click on your project's pom.xml file, it will looks like this with very limited information.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.javamakeuse.bd.poc</groupId>
  <artifactId>SecondarySorting</artifactId>
  <version>1.0</version>
  <name>ReduceSideJoin</name>
  <description>SecondarySorting Example in MapReduce</description>
</project>

Now edit this pom.xml file and add Hadoop dependencies, below is the complete pom.xml file, just copy and paste, it will work.
pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.javamakeuse.bd.poc</groupId>
  <artifactId>SecondarySorting</artifactId>
  <version>1.0</version>
  <name>SecondarySorting</name>
  <description>SecondarySorting Example in MapReduce</description>
  <dependencies>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-client</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-core</artifactId>
   <version>2.7.1</version>
  </dependency>
 </dependencies>
 <build>
  <plugins>
   <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
     <source>1.7</source>
     <target>1.7</target>
    </configuration>
   </plugin>
  </plugins>
 </build>
</project>

Step 3. CustomKey.java
CustomKey is a custom data type used as key in mapper and reducer phase,CustomKey is a combination of reviewerID and rating.

package com.javamakeuse.bd.poc;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;

public class CustomKey implements WritableComparable<CustomKey> {
	private Integer reviewerID;
	private Integer rating;

	public Integer getReviewerID() {
		return reviewerID;
	}
	public void setReviewerID(Integer reviewerID) {
		this.reviewerID = reviewerID;
	}
	public Integer getRating() {
		return rating;
	}
	public void setRating(Integer rating) {
		this.rating = rating;
	}
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(reviewerID);
		out.writeInt(rating);

	}
	@Override
	public void readFields(DataInput in) throws IOException {
		reviewerID = in.readInt();
		rating = in.readInt();
	}
	@Override
	public int compareTo(CustomKey o) {
		int comparedValue = reviewerID.compareTo(o.reviewerID);
		if (comparedValue != 0) {
			return comparedValue;
		}
		return rating.compareTo(o.getRating());
	}

}

Step 4. ProductReviewVO.java
Custom data type use for value in MapReduce program, the output of the mapper and reducer should be the ProductReviewVO object.

package com.javamakeuse.bd.poc;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class ProductReviewVO implements Writable {
	private String productId;
	private Integer reviewerId;
	private String reviewTxt;
	private int rating;

	public String getProductId() {
		return productId;
	}
	public void setProductId(String productId) {
		this.productId = productId;
	}
	public Integer getReviewerId() {
		return reviewerId;
	}
	public void setReviewerId(Integer reviewerId) {
		this.reviewerId = reviewerId;
	}
	public String getReviewTxt() {
		return reviewTxt;
	}
	public void setReviewTxt(String reviewTxt) {
		this.reviewTxt = reviewTxt;
	}
	public int getRating() {
		return rating;
	}
	public void setRating(int rating) {
		this.rating = rating;
	}
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(productId);
		out.writeInt(reviewerId);
		out.writeUTF(reviewTxt);
		out.writeInt(rating);
	}
	@Override
	public void readFields(DataInput in) throws IOException {
		productId = in.readUTF();
		reviewerId = in.readInt();
		reviewTxt = in.readUTF();
		rating = in.readInt();
	}
	@Override
	public String toString() {
		return "ProductReviewVO [productId=" + productId + ", reviewerId=" + reviewerId + ", reviewTxt=" + reviewTxt
				+ ", rating=" + rating + "]";
	}

}

Step 5. CustomPartitioner.java
As our mapper key is custom key which is a combination of reviewerID and rating,so here we have to implement custom partitioner, the default one which is based on HashPartitioner may not solved our problem in all cases.

package com.javamakeuse.bd.poc;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Partitioner;

public class CustomPartitioner extends Partitioner<CustomKey, NullWritable> {

	@Override
	public int getPartition(CustomKey key, NullWritable value, int numPartitions) {
		// multiply by 127 to perform some mixing
		return Math.abs(key.getReviewerID() * 127) % numPartitions;
	}
}

Step 6. KeyComparator.java
Custom implementation of WritableComparator to sort the values by comparing rating of the reviewer.

package com.javamakeuse.bd.poc;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class KeyComparator extends WritableComparator {
    protected KeyComparator() {
      super(CustomKey.class, true);
    }
    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
      CustomKey ip1 = (CustomKey) w1;
      CustomKey ip2 = (CustomKey) w2;
      int cmp = ip1.getReviewerID().compareTo(ip2.getReviewerID());
      if (cmp != 0) {
        return cmp;
      }
      return ip2.getRating().compareTo(ip1.getRating()); //reverse
    }
  }

Step 7. GroupComparator.java
Group the records based on the reviewerID.

package com.javamakeuse.bd.poc;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class GroupComparator extends WritableComparator {
    protected GroupComparator() {
      super(CustomKey.class, true);
    }
    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
      CustomKey ip1 = (CustomKey) w1;
      CustomKey ip2 = (CustomKey) w2;
      return ip1.getReviewerID().compareTo(ip2.getReviewerID());
    }
  }

Step 8. ProductMapper.java
Mapper class to process the product review dataset and prepare the key value for the reducer

package com.javamakeuse.bd.poc;

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class ProductMapper extends Mapper<LongWritable, Text, CustomKey, ProductReviewVO> {
	@Override
	protected void map(LongWritable key, Text value,
			Mapper<LongWritable, Text, CustomKey, ProductReviewVO>.Context context)
					throws IOException, InterruptedException {
		String[] columns = value.toString().split("\\t");
		if (columns.length > 3) {
			ProductReviewVO productReviewVO = new ProductReviewVO();
			productReviewVO.setReviewerId(Integer.parseInt(columns[0]));
			productReviewVO.setProductId(columns[1]);
			productReviewVO.setReviewTxt(columns[2]);
			productReviewVO.setRating(Integer.parseInt(columns[3]));
			CustomKey customKey = new CustomKey();
			customKey.setReviewerID(productReviewVO.getReviewerId());
			customKey.setRating(productReviewVO.getRating());

			context.write(customKey, productReviewVO);
		}
	}

}

Step 9. ProductReducer.java
Reducer class to process the mapper output and generate the output of the MapReduce program.

package com.javamakeuse.bd.poc;

import java.io.IOException;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

public class ProductReducer extends Reducer<CustomKey, ProductReviewVO, NullWritable, ProductReviewVO> {

	NullWritable nullKey = NullWritable.get();

	@Override
	protected void reduce(CustomKey key, Iterable<ProductReviewVO> values,
			Reducer<CustomKey, ProductReviewVO, NullWritable, ProductReviewVO>.Context context)
					throws IOException, InterruptedException {
		for(ProductReviewVO productReviewVO:values){
			context.write(nullKey, productReviewVO);
		}
	}
}

Step 10. ReviewDriver.java
Driver class to execute the MapReduce program.

package com.javamakeuse.bd.poc;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class ReviewDriver extends Configured implements Tool {
	public static void main(String[] args) {
		try {
			int status = ToolRunner.run(new ReviewDriver(), args);
			System.exit(status);
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

	@Override
	public int run(String[] args) throws Exception {
		if (args.length != 2) {
			System.err.printf("Usage: %s [generic options] <input1> <output>\n", getClass().getSimpleName());
			ToolRunner.printGenericCommandUsage(System.err);
			return -1;
		}
		Job job = Job.getInstance();
		job.setJarByClass(ReviewDriver.class);
		job.setJobName("ProductReview");
		// input path
		FileInputFormat.addInputPath(job, new Path(args[0]));

		// output path
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		job.setMapperClass(ProductMapper.class);
		job.setReducerClass(ProductReducer.class);
		job.setPartitionerClass(CustomPartitioner.class);
		job.setSortComparatorClass(KeyComparator.class);
		job.setGroupingComparatorClass(GroupComparator.class);

		job.setMapOutputKeyClass(CustomKey.class);
		job.setMapOutputValueClass(ProductReviewVO.class);

		job.setOutputKeyClass(NullWritable.class);
		job.setOutputValueClass(ProductReviewVO.class);

		return job.waitForCompletion(true) ? 0 : 1;

	}

}

Done, next to run this program, you can run it using any eclipse also, below are the steps to run using terminal.

Step 11. Steps to execute SecondarySorting project
i. Start Hadoop components,open your terminal and type

subodh@subodh-Inspiron-3520:~/software$ start-dfs.sh

subodh@subodh-Inspiron-3520:~/software$ start-yarn.sh

ii. Verify Hadoop started or not with jps command

subodh@subodh-Inspiron-3520:~/software$ jps
8385 NameNode
8547 DataNode
5701 org.eclipse.equinox.launcher_1.3.100.v20150511-1540.jar
9446 Jps
8918 ResourceManager
9054 NodeManager
8751 SecondaryNameNode

You can verify with web-ui also using "http://localhost:50070/explorer.html#/" url.
iii. Create input folder on HDFS using below command.

subodh@subodh-Inspiron-3520:~/software$ hadoop fs -mkdir /input

The above command will create an input folder on HDFS, you can verify it using web UI or hadoop fs -ls / command, Now time to move input file which we need to process, below is the command to copy the product_review.data input file on HDFS inside input folder.

subodh@subodh-Inspiron-3520:~$ hadoop fs -copyFromLocal /home/subodh/programs/input/product_review.data /input

Note - product_review.data dataset is available inside this project source code, you would be able to download it from our downloadable link of this project.

Step 12. Create & Execute jar file
We almost done,now create jar file of SecondarySorting source code. You can create jar file using eclipse or by using mvn package command.
To execute SecondarySorting-1.0.jar file use below command

hadoop jar /home/subodh/SecondarySorting-1.0.jar com.javamakeuse.bd.poc.ReviewDriver /input/product_review.data /output

Above will generate below output and also create an output folder with output of the SecondarySorting project.

16/04/06 23:01:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/06 23:01:55 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/04/06 23:01:55 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/04/06 23:01:56 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/04/06 23:01:56 INFO input.FileInputFormat: Total input paths to process : 1
16/04/06 23:01:56 INFO mapreduce.JobSubmitter: number of splits:1
16/04/06 23:01:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1432274046_0001
16/04/06 23:01:56 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/04/06 23:01:56 INFO mapreduce.Job: Running job: job_local1432274046_0001
16/04/06 23:01:56 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/04/06 23:01:56 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/04/06 23:01:56 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/04/06 23:01:56 INFO mapred.LocalJobRunner: Waiting for map tasks
16/04/06 23:01:56 INFO mapred.LocalJobRunner: Starting task: attempt_local1432274046_0001_m_000000_0
16/04/06 23:01:56 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/04/06 23:01:56 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
16/04/06 23:01:56 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/input/product_review.data:0+416
16/04/06 23:01:56 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/04/06 23:01:56 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/04/06 23:01:56 INFO mapred.MapTask: soft limit at 83886080
16/04/06 23:01:56 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/04/06 23:01:56 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/04/06 23:01:56 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/04/06 23:01:56 INFO mapred.LocalJobRunner: 
16/04/06 23:01:56 INFO mapred.MapTask: Starting flush of map output
16/04/06 23:01:56 INFO mapred.MapTask: Spilling map output
16/04/06 23:01:56 INFO mapred.MapTask: bufstart = 0; bufend = 407; bufvoid = 104857600
16/04/06 23:01:56 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214364(104857456); length = 33/6553600
16/04/06 23:01:56 INFO mapred.MapTask: Finished spill 0
16/04/06 23:01:56 INFO mapred.Task: Task:attempt_local1432274046_0001_m_000000_0 is done. And is in the process of committing
16/04/06 23:01:56 INFO mapred.LocalJobRunner: map
16/04/06 23:01:56 INFO mapred.Task: Task 'attempt_local1432274046_0001_m_000000_0' done.
16/04/06 23:01:56 INFO mapred.LocalJobRunner: Finishing task: attempt_local1432274046_0001_m_000000_0
16/04/06 23:01:56 INFO mapred.LocalJobRunner: map task executor complete.
16/04/06 23:01:56 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/04/06 23:01:56 INFO mapred.LocalJobRunner: Starting task: attempt_local1432274046_0001_r_000000_0
16/04/06 23:01:56 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/04/06 23:01:56 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
16/04/06 23:01:56 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@42c355f5
16/04/06 23:01:56 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/04/06 23:01:56 INFO reduce.EventFetcher: attempt_local1432274046_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/04/06 23:01:56 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1432274046_0001_m_000000_0 decomp: 427 len: 431 to MEMORY
16/04/06 23:01:56 INFO reduce.InMemoryMapOutput: Read 427 bytes from map-output for attempt_local1432274046_0001_m_000000_0
16/04/06 23:01:56 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 427, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->427
16/04/06 23:01:56 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/04/06 23:01:56 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/04/06 23:01:56 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
16/04/06 23:01:56 INFO mapred.Merger: Merging 1 sorted segments
16/04/06 23:01:56 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 417 bytes
16/04/06 23:01:56 INFO reduce.MergeManagerImpl: Merged 1 segments, 427 bytes to disk to satisfy reduce memory limit
16/04/06 23:01:56 INFO reduce.MergeManagerImpl: Merging 1 files, 431 bytes from disk
16/04/06 23:01:56 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/04/06 23:01:56 INFO mapred.Merger: Merging 1 sorted segments
16/04/06 23:01:56 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 417 bytes
16/04/06 23:01:56 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/04/06 23:01:56 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/04/06 23:01:57 INFO mapreduce.Job: Job job_local1432274046_0001 running in uber mode : false
16/04/06 23:01:57 INFO mapreduce.Job:  map 100% reduce 0%
16/04/06 23:01:57 INFO mapred.Task: Task:attempt_local1432274046_0001_r_000000_0 is done. And is in the process of committing
16/04/06 23:01:57 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/04/06 23:01:57 INFO mapred.Task: Task attempt_local1432274046_0001_r_000000_0 is allowed to commit now
16/04/06 23:01:57 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1432274046_0001_r_000000_0' to hdfs://localhost:9000/output/_temporary/0/task_local1432274046_0001_r_000000
16/04/06 23:01:57 INFO mapred.LocalJobRunner: reduce > reduce
16/04/06 23:01:57 INFO mapred.Task: Task 'attempt_local1432274046_0001_r_000000_0' done.
16/04/06 23:01:57 INFO mapred.LocalJobRunner: Finishing task: attempt_local1432274046_0001_r_000000_0
16/04/06 23:01:57 INFO mapred.LocalJobRunner: reduce task executor complete.
16/04/06 23:01:58 INFO mapreduce.Job:  map 100% reduce 100%
16/04/06 23:01:58 INFO mapreduce.Job: Job job_local1432274046_0001 completed successfully
16/04/06 23:01:58 INFO mapreduce.Job: Counters: 35
	File System Counters
		FILE: Number of bytes read=21604
		FILE: Number of bytes written=579171
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=832
		HDFS: Number of bytes written=848
		HDFS: Number of read operations=13
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
	Map-Reduce Framework
		Map input records=9
		Map output records=9
		Map output bytes=407
		Map output materialized bytes=431
		Input split bytes=112
		Combine input records=0
		Combine output records=0
		Reduce input groups=7
		Reduce shuffle bytes=431
		Reduce input records=9
		Reduce output records=9
		Spilled Records=18
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=5
		Total committed heap usage (bytes)=496500736
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=416
	File Output Format Counters 
		Bytes Written=848

Step 13. Verify the output

That's it.

Secondary Sort Example in Hadoop | Hadoop Tutorial

Tools and Technologies we are using to solve this use case

Download the complete example from here Source Code

0 comments:

Post a Comment