pydoop.mapreduce.simulator
— Hadoop Simulator API¶
This module provides basic, stand-alone Hadoop simulators for debugging support.
-
class
pydoop.mapreduce.simulator.
HadoopSimulatorLocal
(factory, logger=None, loglevel=50, context_cls=None, avro_input=None, avro_output=None, avro_output_key_schema=None, avro_output_value_schema=None)¶ Simulates the invocation of program components in a Hadoop workflow.
from my_mr_app import Factory hs = HadoopSimulatorLocal(Factory()) job_conf = {...} hs.run(fin, fout, job_conf) counters = hs.get_counters()
-
run
(file_in, file_out, job_conf, num_reducers=1, input_split='')¶ Run the simulator as configured by
job_conf
, withnum_reducers
reducers. Iffile_in
is notNone
, simulate the behavior of Hadoop’sTextLineReader
, creating a record for each line infile_in
. Otherwise, assume that thefactory
argument given to the constructor defines aRecordReader
, and thatjob_conf
provides a suitableInputSplit
. Similarly, iffile_out
isNone
, assume thatfactory
defines aRecordWriter
with appropriate parameters injob_conf
.
-
-
class
pydoop.mapreduce.simulator.
HadoopSimulatorNetwork
(program=None, logger=None, loglevel=50, sleep_delta=3, context_cls=None, avro_input=None, avro_output=None, avro_output_key_schema=None, avro_output_value_schema=None)¶ Simulates the invocation of program components in a Hadoop workflow using network connections to communicate with a user-provided pipes program.
program_name = '../wordcount/bin/wordcount_full.py' data_in = '../input/alice.txt' output_dir = './output' data_in_path = os.path.realpath(data_in) data_in_uri = 'file://' + data_in_path data_in_size = os.stat(data_in_path).st_size os.makedirs(output_dir) output_dir_uri = 'file://' + os.path.realpath(output_dir) conf = { "mapred.job.name": "wordcount", "mapred.work.output.dir": output_dir_uri, "mapred.task.partition": "0", } input_split = InputSplit.to_string(data_in_uri, 0, data_in_size) hsn = HadoopSimulatorNetwork(program=program_name, logger=logger, loglevel=logging.INFO) hsn.run(None, None, conf, input_split=input_split)
The Pydoop application
program
will be launchedsleep_delta
seconds after framework initialization.-
run
(file_in, file_out, job_conf, num_reducers=1, input_split='')¶ Run the program through the simulated Hadoop infrastructure, piping the contents of
file_in
to the program similarly to what Hadoop’sTextInputFormat
does. Settingfile_in
toNone
implies that the program is expected to get its data from its ownRecordReader
, using the providedinput_split
. Analogously, the final results will be written tofile_out
unless it is set toNone
, in which case the program is expected to have aRecordWriter
.
-