Hi, (I'm assuming 1.0~ MR here)
On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis <[email protected]> wrote: > Classes implementing InputFormat implement > public List<InputSplit> getSplits(JobContext job) which a List if > InputSplits. for FileInputFormat the Splits have Path.start and End > > 1) When is this method called and on which JVM on Which Machine and is it > called only once? Called only at a client, i.e. your "hadoop jar" JVM. Called only once. > 2) Do the number of Map task correspond to the number of splits returned by > getSplits? Yes, number of split objects == number of mappers. > 3) InputFormat implements a method > RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext > context ). Is this executed within the JVM of the Mapper on the slave > machine and does the RecordReader run within that JVM RecordReaders are not created on the client side JVM. RecordReaders are created on the Map task JVMs, and run inside it. > 4) The default RecordReaders read a file from the start position to the end > position emitting values in the order read. With such a reader, assume it is > reading lines of text, is it reasonable to assume that the values the mapper > received are in the same order they were found in a file? Would it, for > example, be possible for WordCount to see a word that was hyphen- > ated at the end of one line and append the first word of the next line it > sees (ignoring the case where the word is at the end of a split) If you speak of the LineRecordReader, each map() will simply read a line, i.e. until \n. It is not language-aware to understand meaning of hyphens, etc.. You can implement a custom reader to do this however - there should be no problems so long as your logic covers the case of not having any duplicate reads across multiple maps. -- Harsh J
