Re: Extra output files from mapper ?

Harsh J Thu, 12 Jul 2012 10:45:37 -0700

You can ship the module along with a symlink and have Python
auto-import it since "." is always on PATH? I can imagine that helping
you get Pydoop on a cluster without Pydoop on all nodes (or other
libs).


On Thu, Jul 12, 2012 at 11:08 PM, Connell, Chuck
<[email protected]> wrote:
> Thanks yet again. Since my goal is to run an existing Python program, as is, 
> under MR, it looks like I need the os.system(copy-local-to-hdfs) technique.
>
> Chuck
>
>
>
> -----Original Message-----
> From: Harsh J [mailto:[email protected]]
> Sent: Thursday, July 12, 2012 1:15 PM
> To: [email protected]
> Subject: Re: Extra output files from mapper ?
>
> Unfortunately Python does not recognize hdfs:// URIs. It isn't a standard 
> like HTTP is, so to say, at least not yet :)
>
> You can instead use Pydoop's HDFS APIs though 
> http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#hdfs-api.
> Pydoop authors are pretty active and do releases from time to time.
> See the open() method in the API, and use that with the write flag (pythonic 
> style).
>
> On Thu, Jul 12, 2012 at 9:31 PM, Connell, Chuck <[email protected]> 
> wrote:
>> Thank you. I will try that.
>>
>> A related question... Shouldn't I just be able to create HDFS files directly 
>> from a Python open statement, when running within MR, like this? It does not 
>> seem to work as intended.
>>
>> outfile1 = open("hdfs://localhost/tmp/out1.txt", 'w')
>>
>> Chuck
>>
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:[email protected]]
>> Sent: Thursday, July 12, 2012 10:58 AM
>> To: [email protected]
>> Subject: Re: Extra output files from mapper ?
>>
>> Chuck,
>>
>> Note that the regular file opens from within an MR program (be it streaming 
>> or be it Java), will create files on the local file system of the node the 
>> task executed on.
>>
>> Hence, at the end of your script, move them to HDFS after closing them.
>>
>> Something like:
>>
>> os.system("hadoop fs -put outfile1.txt /path/on/hdfs/file.txt")
>>
>> (Or via a python lib API for HDFS)
>>
>> On Thu, Jul 12, 2012 at 8:08 PM, Connell, Chuck <[email protected]> 
>> wrote:
>>> Here is a test case...
>>>
>>>
>>>
>>>
>>>
>>> The Python code (file_io.py) that I want to run as a map-only job is below.
>>> It takes one input file (not stdin) and creates two output files (not
>>> stdout).
>>>
>>>
>>>
>>> #!/usr/bin/env python
>>>
>>>
>>>
>>> import sys
>>>
>>>
>>>
>>> infile = open(sys.argv[1], 'r')
>>>
>>> outfile1 = open(sys.argv[2], 'w')
>>>
>>> outfile2 = open(sys.argv[3], 'w')
>>>
>>>
>>>
>>> for line in infile:
>>>
>>>      sys.stdout.write(line)  # just to verify that infile is being
>>> read correctly
>>>
>>>      outfile1.write("1. " + line)
>>>
>>>      outfile2.write("2. " + line)
>>>
>>>
>>>
>>>
>>>
>>> But since MapReduce streaming likes to use stdio, I put my job in a
>>> Python wrapper (file_io_wrap.py):
>>>
>>>
>>>
>>> #!/usr/bin/env python
>>>
>>>
>>>
>>> import sys
>>>
>>> from subprocess import call
>>>
>>>
>>>
>>> # Eat input stream on stdin
>>>
>>> line = sys.stdin.readline()
>>>
>>> while line:
>>>
>>>     line = sys.stdin.readline()
>>>
>>>
>>>
>>> # Call real program.
>>>
>>> status = call (["python", "file_io.py", "in1.txt", "out1.txt",
>>> "out2.txt"])
>>>
>>>
>>>
>>> # Write to stdout.
>>>
>>> if status==0:
>>>
>>>      sys.stdout.write("Success.")
>>>
>>> else:
>>>
>>>      sys.stdout.write("Subprocess call failed.")
>>>
>>>
>>>
>>>
>>>
>>> Finally, I call the streaming job from this shell script...
>>>
>>>
>>>
>>> #!/bin/bash
>>>
>>>
>>>
>>> #Find latest streaming jar.
>>>
>>> STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar"
>>>
>>>
>>>
>>> # Input file should explicitly use hdfs: to avoid confusion with
>>> local file
>>>
>>> # Output dir should not exist.
>>>
>>> # The mapper and reducer should explicitly state "python XXX.py"
>>> rather than just "XXX.py"
>>>
>>>
>>>
>>> $STREAM  \
>>>
>>> -files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \
>>>
>>> -files "hdfs://localhost/tmp/out1.txt#out1.txt" \
>>>
>>> -files "hdfs://localhost/tmp/out2.txt#out2.txt" \
>>>
>>> -file file_io_wrap.py \
>>>
>>> -file file_io.py \
>>>
>>> -input "hdfs://localhost/tmp/input/empty.txt" \
>>>
>>> -mapper "python file_io_wrap.py" \
>>>
>>> -reducer NONE \
>>>
>>> -output /tmp/output20
>>>
>>>
>>>
>>>
>>>
>>> The result is that the whole job runs correctly and the input file is
>>> read correctly. I can see a copy of the  input file in part-0000. But
>>> the output files (out1.txt and out2.txt) are nowhere to be found. I
>>> suspect they were created somewhere, but where? And how can I control where 
>>> they are created?
>>>
>>>
>>>
>>> Thank you,
>>>
>>> Chuck Connell
>>>
>>> Nuance R&D Data Team
>>>
>>> Burlington, MA
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Connell, Chuck [mailto:[email protected]]
>>> Sent: Wednesday, July 11, 2012 4:48 PM
>>> To: [email protected]
>>> Subject: Extra output files from mapper ?
>>>
>>>
>>>
>>> I am using MapReduce streaming with Python code. It works fine, for
>>> basic for stdin and stdout.
>>>
>>>
>>>
>>> But I have a mapper-only application that also emits some other
>>> output files. So in addition to stdout, the program also creates
>>> files named output1.txt and output2.txt. My code seems to be running
>>> correctly, and I suspect the proper output files are being created
>>> somewhere, but I cannot find them after the job finishes.
>>>
>>>
>>>
>>> I tried using the -files option to create a link to the location I
>>> want the file, but no luck. I tried using some of the -jobconf
>>> options to change the various working directories, but no luck.
>>>
>>>
>>>
>>> Thank you.
>>>
>>>
>>>
>>> Chuck Connell
>>>
>>> Nuance R&D Data Team
>>>
>>> Burlington, MA
>>>
>>>
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: Extra output files from mapper ?

Reply via email to