Hi Jay, This may be off-topic to you, but I feel its related: Use Avro DataFiles. There's Python support already available, as well as several other languages.
On Tue, Sep 25, 2012 at 10:57 PM, Jay Vyas <[email protected]> wrote: > Hi guys! > > Im trying to read some hadoop outputted thrift files in plain old java > (without using SequenceFile.Reader). The reason for this is that I > > (1) want to understand the sequence file format better and > (2) would like to be able to port this code to a language which doesnt have > robust hadoop sequence file i/o / thrift support (python). My code looks > like this: > > So, before reading forward, if anyone has : > > 1) Some general hints on how to create a Sequence file with thrift encoded > key values in python would be very useful. > 2) Some tips on the generic approach for reading a sequencefile (the > comments seem to be a bit underspecified in the SequenceFile header) > > I'd appreciate it! > > Now, here is my adventure into thrift/hdfs sequence file i/o : > > I've written a simple stub which , I think, should be the start of a > sequence file reader (just tries to skip the header and get straight to the > data). > > But it doesnt handle compression. > > http://pastebin.com/vyfgjML9 > > So, this code ^^ appears to fail with cryptic errors : "don't know what > type: 15". > > This error comes from a case statement, which attempts to determine what > type of thrift record is being read in: > "fail 127 don't know what type: 15" > > private byte getTType(byte type) throws TProtocolException { > switch ((byte)(type & 0x0f)) { > case TType.STOP: > return TType.STOP; > case Types.BOOLEAN_FALSE: > case Types.BOOLEAN_TRUE: > return TType.BOOL; > ........ > case Types.STRUCT: > return TType.STRUCT; > default: > throw new TProtocolException("don't know what type: " + > (byte)(type & 0x0f)); > } > > Upon further investigation, I have found that, in fact, the Configuration > object is (of course) heavily utilized by the SequenceFile reader, in > particular, to > determine the Codec. That corroborates my hypothesis that the data needs > to be decompressed or decoded before it can be deserialized by thrift, I > believe. > > So... I guess what Im assuming is missing here, is that I don't know how to > manually reproduce the Codec/GZip, etc.. logic inside of > SequenceFile.Reader in plain old java (i.e without cheating and using the > SequenceFile.Reader class that is configured in our mapreduce soruce > code). > > With my end goal being to read the file in python, I think it would be nice > to be able to read the sequencefile in java, and use this as a template > (since I know that my thrift objects and serialization are working > correctly in my current java source codebase, when read in from > SequenceFile.Reader api). > > Any suggestions on how I can distill the logic of the SequenceFile.Reader > class into a simplified version which is specific to my data, so that I can > start porting into a python script which is capable of scanning a few real > sequencefiles off of HDFS would be much appreciated !!! > > In general... what are the core steps for doing i/o with sequence files > that are compressed and or serialized in different formats? Do we > decompress first , and then deserialize? Or do them both at the same time > ? Thanks! > > PS I've added an issue to github here > https://github.com/matteobertozzi/Hadoop/issues/5, for a python > SequenceFile reader. If I get some helpful hints on this thread maybe I > can directly implement an example on matteobertozzi's python hadoop trunk. > > -- > Jay Vyas > MMSB/UCHC -- Harsh J
