You can ship the module along with a symlink and have Python auto-import it since "." is always on PATH? I can imagine that helping you get Pydoop on a cluster without Pydoop on all nodes (or other libs).
On Thu, Jul 12, 2012 at 11:08 PM, Connell, Chuck <[email protected]> wrote: > Thanks yet again. Since my goal is to run an existing Python program, as is, > under MR, it looks like I need the os.system(copy-local-to-hdfs) technique. > > Chuck > > > > -----Original Message----- > From: Harsh J [mailto:[email protected]] > Sent: Thursday, July 12, 2012 1:15 PM > To: [email protected] > Subject: Re: Extra output files from mapper ? > > Unfortunately Python does not recognize hdfs:// URIs. It isn't a standard > like HTTP is, so to say, at least not yet :) > > You can instead use Pydoop's HDFS APIs though > http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#hdfs-api. > Pydoop authors are pretty active and do releases from time to time. > See the open() method in the API, and use that with the write flag (pythonic > style). > > On Thu, Jul 12, 2012 at 9:31 PM, Connell, Chuck <[email protected]> > wrote: >> Thank you. I will try that. >> >> A related question... Shouldn't I just be able to create HDFS files directly >> from a Python open statement, when running within MR, like this? It does not >> seem to work as intended. >> >> outfile1 = open("hdfs://localhost/tmp/out1.txt", 'w') >> >> Chuck >> >> >> >> -----Original Message----- >> From: Harsh J [mailto:[email protected]] >> Sent: Thursday, July 12, 2012 10:58 AM >> To: [email protected] >> Subject: Re: Extra output files from mapper ? >> >> Chuck, >> >> Note that the regular file opens from within an MR program (be it streaming >> or be it Java), will create files on the local file system of the node the >> task executed on. >> >> Hence, at the end of your script, move them to HDFS after closing them. >> >> Something like: >> >> os.system("hadoop fs -put outfile1.txt /path/on/hdfs/file.txt") >> >> (Or via a python lib API for HDFS) >> >> On Thu, Jul 12, 2012 at 8:08 PM, Connell, Chuck <[email protected]> >> wrote: >>> Here is a test case... >>> >>> >>> >>> >>> >>> The Python code (file_io.py) that I want to run as a map-only job is below. >>> It takes one input file (not stdin) and creates two output files (not >>> stdout). >>> >>> >>> >>> #!/usr/bin/env python >>> >>> >>> >>> import sys >>> >>> >>> >>> infile = open(sys.argv[1], 'r') >>> >>> outfile1 = open(sys.argv[2], 'w') >>> >>> outfile2 = open(sys.argv[3], 'w') >>> >>> >>> >>> for line in infile: >>> >>> sys.stdout.write(line) # just to verify that infile is being >>> read correctly >>> >>> outfile1.write("1. " + line) >>> >>> outfile2.write("2. " + line) >>> >>> >>> >>> >>> >>> But since MapReduce streaming likes to use stdio, I put my job in a >>> Python wrapper (file_io_wrap.py): >>> >>> >>> >>> #!/usr/bin/env python >>> >>> >>> >>> import sys >>> >>> from subprocess import call >>> >>> >>> >>> # Eat input stream on stdin >>> >>> line = sys.stdin.readline() >>> >>> while line: >>> >>> line = sys.stdin.readline() >>> >>> >>> >>> # Call real program. >>> >>> status = call (["python", "file_io.py", "in1.txt", "out1.txt", >>> "out2.txt"]) >>> >>> >>> >>> # Write to stdout. >>> >>> if status==0: >>> >>> sys.stdout.write("Success.") >>> >>> else: >>> >>> sys.stdout.write("Subprocess call failed.") >>> >>> >>> >>> >>> >>> Finally, I call the streaming job from this shell script... >>> >>> >>> >>> #!/bin/bash >>> >>> >>> >>> #Find latest streaming jar. >>> >>> STREAM="hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar" >>> >>> >>> >>> # Input file should explicitly use hdfs: to avoid confusion with >>> local file >>> >>> # Output dir should not exist. >>> >>> # The mapper and reducer should explicitly state "python XXX.py" >>> rather than just "XXX.py" >>> >>> >>> >>> $STREAM \ >>> >>> -files "hdfs://localhost/tmp/input/in1.txt#in1.txt" \ >>> >>> -files "hdfs://localhost/tmp/out1.txt#out1.txt" \ >>> >>> -files "hdfs://localhost/tmp/out2.txt#out2.txt" \ >>> >>> -file file_io_wrap.py \ >>> >>> -file file_io.py \ >>> >>> -input "hdfs://localhost/tmp/input/empty.txt" \ >>> >>> -mapper "python file_io_wrap.py" \ >>> >>> -reducer NONE \ >>> >>> -output /tmp/output20 >>> >>> >>> >>> >>> >>> The result is that the whole job runs correctly and the input file is >>> read correctly. I can see a copy of the input file in part-0000. But >>> the output files (out1.txt and out2.txt) are nowhere to be found. I >>> suspect they were created somewhere, but where? And how can I control where >>> they are created? >>> >>> >>> >>> Thank you, >>> >>> Chuck Connell >>> >>> Nuance R&D Data Team >>> >>> Burlington, MA >>> >>> >>> >>> >>> >>> >>> >>> From: Connell, Chuck [mailto:[email protected]] >>> Sent: Wednesday, July 11, 2012 4:48 PM >>> To: [email protected] >>> Subject: Extra output files from mapper ? >>> >>> >>> >>> I am using MapReduce streaming with Python code. It works fine, for >>> basic for stdin and stdout. >>> >>> >>> >>> But I have a mapper-only application that also emits some other >>> output files. So in addition to stdout, the program also creates >>> files named output1.txt and output2.txt. My code seems to be running >>> correctly, and I suspect the proper output files are being created >>> somewhere, but I cannot find them after the job finishes. >>> >>> >>> >>> I tried using the -files option to create a link to the location I >>> want the file, but no luck. I tried using some of the -jobconf >>> options to change the various working directories, but no luck. >>> >>> >>> >>> Thank you. >>> >>> >>> >>> Chuck Connell >>> >>> Nuance R&D Data Team >>> >>> Burlington, MA >>> >>> >> >> >> >> -- >> Harsh J > > > > -- > Harsh J -- Harsh J
