Thursday, January 5, 2012

Custom File Output in Hadoop

I had been curious about the best way to control text output from a job and decided it was probably best to create my own output class. By simply extending the default FileOutputFormat class you can control all aspects of the output.

The first step is to create the output class. You need to be aware of the form returned by the reduce function. In my case, it sends a Text key and a list of integers associated with it so I need to define the template definition accordingly. The simple class looks like this:


public class MyTextOutputFormat extends FileOutputFormat<Text, List<IntWritable>> {
@Override
public org.apache.hadoop.mapreduce.RecordWriter<Text, List<Intwritable>> getRecordWriter(TaskAttemptContext arg0) throws IOException, InterruptedException {
//get the current path
Path path = FileOutputFormat.getOutputPath(arg0);
//create the full path with the output directory plus our filename
Path fullPath = new Path(path, "result.txt");

//create the file in the file system
FileSystem fs = path.getFileSystem(arg0.getConfiguration());
FSDataOutputStream fileOut = fs.create(fullPath, arg0);

//create our record writer with the new file
return new MyCustomRecordWriter(fileOut);
}
}


After figuring out the full path and creating our output file, we then need to create an instance of the actual record writer to pass back to the reduce job. That class allows us to actual write to the file however we want. Again the template definition needs to match the form coming from the reducer. Mine class looks like this:

public class MyCustomRecordWriter extends RecordWriter<Text, List<IntWritable>> {
private DataOutputStream out;

public MyCustomRecordWriter(DataOutputStream stream) {
out = stream;
try {
out.writeBytes("results:\r\n");
}
catch (Exception ex) {
}
}

@Override
public void close(TaskAttemptContext arg0) throws IOException, InterruptedException {
//close our file
out.close();
}

@Override
public void write(Text arg0, List arg1) throws IOException, InterruptedException {
//write out our key
out.writeBytes(arg0.toString() + ": ");
//loop through all values associated with our key and write them with commas between
for (int i=0; i<arg1.size(); i++) {
if (i>0)
out.writeBytes(",");
out.writeBytes(String.valueOf(arg1.get(i)));
}
out.writeBytes("\r\n");
}
}


Finally we need to tell our job about our ouput format and the path before running it.


job.setOutputKeyClass(Text.class);
job.setOutputValueClass(ArrayList.class);
job.setOutputFormatClass(MyTextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("/home/hadoop/out"));



And that's it. Two simple classes allow us all the control we need.

8 comments:

  1. I get a lot of great information here and this is what I am searching for Hadoop. Thank you for your sharing. I have bookmark this page for my future reference.Thanks so much for the work you have put into this post.
    Hadoop Training in hyderabad

    ReplyDelete
  2. Best Big Data Hadoop Training in Hyderabad @ Kalyan Orienit

    Follow the below links to know more knowledge on Hadoop

    WebSites:
    ================
    http://www.kalyanhadooptraining.com/

    http://www.hyderabadhadooptraining.com/

    http://www.bigdatatraininghyderabad.com/

    Videos:
    ===============
    https://www.youtube.com/watch?v=-_fTzrgzVQc

    https://www.youtube.com/watch?v=Df2Odze87dE

    https://www.youtube.com/watch?v=AOfX-tNkYyo

    https://www.youtube.com/watch?v=Cyo3y0vlZ3c

    https://www.youtube.com/watch?v=jOLSXx6koO4

    https://www.youtube.com/watch?v=09mpbNBAmCo

    ReplyDelete
  3. This post was very useful!!! Simple and accurate. thanks.

    ReplyDelete
  4. could you please show the implementation of the custom output format or the above code in the reducer and mapper end. Say I am trying to run word count

    ReplyDelete
  5. This post is really nice and informative. The explanation given is really comprehensive and informative..

    Spark Training in Chennai

    ReplyDelete
  6. Thanks for helping me to understand basic Hadoop custom file output concepts. As a beginner in Hadoop your post help me a lot.
    Hadoop Training in Velachery | Hadoop Training .
    Hadoop Training in Chennai | Hadoop .

    ReplyDelete
  7. Very Nice Blog…Thanks for sharing this information with us. Here am sharing some information about training institute.
    best devops online training in hyderabad

    ReplyDelete