Johnny Programmer's Blog: Custom File Output in Hadoop

Thursday, January 5, 2012

Custom File Output in Hadoop

I had been curious about the best way to control text output from a job and decided it was probably best to create my own output class. By simply extending the default FileOutputFormat class you can control all aspects of the output.

The first step is to create the output class. You need to be aware of the form returned by the reduce function. In my case, it sends a Text key and a list of integers associated with it so I need to define the template definition accordingly. The simple class looks like this:


public class MyTextOutputFormat extends FileOutputFormat<Text, List<IntWritable>> {
  @Override
  public org.apache.hadoop.mapreduce.RecordWriter<Text, List<Intwritable>> getRecordWriter(TaskAttemptContext arg0) throws IOException, InterruptedException {
     //get the current path
     Path path = FileOutputFormat.getOutputPath(arg0);
     //create the full path with the output directory plus our filename
     Path fullPath = new Path(path, "result.txt");

     //create the file in the file system
     FileSystem fs = path.getFileSystem(arg0.getConfiguration());
     FSDataOutputStream fileOut = fs.create(fullPath, arg0);

     //create our record writer with the new file
     return new MyCustomRecordWriter(fileOut);
  }
}

After figuring out the full path and creating our output file, we then need to create an instance of the actual record writer to pass back to the reduce job. That class allows us to actual write to the file however we want. Again the template definition needs to match the form coming from the reducer. Mine class looks like this:

public class MyCustomRecordWriter extends RecordWriter<Text, List<IntWritable>> {
    private DataOutputStream out;

    public MyCustomRecordWriter(DataOutputStream stream) {
        out = stream;
        try {
            out.writeBytes("results:\r\n");
        }
        catch (Exception ex) {
        }  
    }

    @Override
    public void close(TaskAttemptContext arg0) throws IOException, InterruptedException {
        //close our file
        out.close();
    }

    @Override
    public void write(Text arg0, List arg1) throws IOException, InterruptedException {
        //write out our key
        out.writeBytes(arg0.toString() + ": ");
        //loop through all values associated with our key and write them with commas between
        for (int i=0; i<arg1.size(); i++) {
            if (i>0)
                out.writeBytes(",");
            out.writeBytes(String.valueOf(arg1.get(i)));
        }
        out.writeBytes("\r\n");  
    }
}

Finally we need to tell our job about our ouput format and the path before running it.


job.setOutputKeyClass(Text.class);
job.setOutputValueClass(ArrayList.class);
job.setOutputFormatClass(MyTextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("/home/hadoop/out"));

And that's it. Two simple classes allow us all the control we need.

8 comments:

mareddyonlineJuly 19, 2014 at 10:16 PM
I get a lot of great information here and this is what I am searching for Hadoop. Thank you for your sharing. I have bookmark this page for my future reference.Thanks so much for the work you have put into this post.
Hadoop Training in hyderabad
ReplyDelete
Replies
kalyan hadoopMay 1, 2015 at 6:45 AM
Best Big Data Hadoop Training in Hyderabad @ Kalyan Orienit

Follow the below links to know more knowledge on Hadoop

WebSites:
================
http://www.kalyanhadooptraining.com/

http://www.hyderabadhadooptraining.com/

http://www.bigdatatraininghyderabad.com/

Videos:
===============
https://www.youtube.com/watch?v=-_fTzrgzVQc

https://www.youtube.com/watch?v=Df2Odze87dE

https://www.youtube.com/watch?v=AOfX-tNkYyo

https://www.youtube.com/watch?v=Cyo3y0vlZ3c

https://www.youtube.com/watch?v=jOLSXx6koO4

https://www.youtube.com/watch?v=09mpbNBAmCo
ReplyDelete
Replies
TomasAugust 19, 2015 at 6:38 PM
This post was very useful!!! Simple and accurate. thanks.
ReplyDelete
Replies
AniruddhaRocksOctober 26, 2015 at 1:21 AM
could you please show the implementation of the custom output format or the above code in the reducer and mapper end. Say I am trying to run word count
ReplyDelete
Replies
YashJune 27, 2016 at 4:51 AM
This post is really nice and informative. The explanation given is really comprehensive and informative..

Spark Training in Chennai
ReplyDelete
Replies
UnknownApril 28, 2018 at 2:33 AM
Thanks for helping me to understand basic Hadoop custom file output concepts. As a beginner in Hadoop your post help me a lot.
Hadoop Training in Velachery | Hadoop Training .
Hadoop Training in Chennai | Hadoop .
ReplyDelete
Replies
RajuJuly 23, 2020 at 7:53 AM
The explanation given is really comprehensive and informative. Thank you for your sharing. I have bookmark this page for my future reference.Thanks so much for the work you have put into this post.
Salesforce Training in Chennai | Certification | Online Course | Salesforce Training in Bangalore | Certification | Online Course | Salesforce Training in Hyderabad | Certification | Online Course | Salesforce Training in Pune | Certification | Online Course | Salesforce Online Training | Salesforce Training

ReplyDelete
Replies
tektutesDecember 16, 2021 at 2:27 AM
Very Nice Blog…Thanks for sharing this information with us. Here am sharing some information about training institute.
best devops online training in hyderabad
ReplyDelete
Replies

Add comment

Johnny Programmer's Blog

Thursday, January 5, 2012

Custom File Output in Hadoop

8 comments:

Blog Archive

About Me