pydoop.mapreduce.api
— MapReduce API¶
The MapReduce API allows to write the components of a MapReduce application.
The basic MapReduce components (Mapper
, Reducer
,
RecordReader
, etc.) are provided as abstract classes that
must be subclassed by the developer, providing implementations for all
methods called by the framework.
-
class
pydoop.mapreduce.api.
Closable
¶ -
close
()¶ Called after the object has finished its job.
Overriding this method is not required.
-
-
class
pydoop.mapreduce.api.
Context
¶ Context objects are used for communication between the framework and the Mapreduce application. These objects are instantiated by the framework and passed to user methods as parameters:
class Mapper(api.Mapper): def map(self, context): key, value = context.key, context.value ... context.emit(new_key, new_value)
-
emit
(key, value)¶ Emit a key, value pair to the framework.
-
get_counter
(group, name)¶ Get a
Counter
from the framework.Parameters: The counter can be updated via
increment_counter()
.
-
key
¶ Input key.
-
set_status
(status)¶ Set the current status.
Parameters: status (str) – a description of the current status
-
value
¶ Input value.
-
-
class
pydoop.mapreduce.api.
Counter
(counter_id)¶ An interface to the Hadoop counters infrastructure.
Counter objects are instantiated and directly manipulated by the framework; users get and update them via the
Context
interface.
-
class
pydoop.mapreduce.api.
Factory
¶ Creates MapReduce application components.
The classes to use for each component must be specified as arguments to the constructor.
-
create_combiner
(context)¶ Create a combiner object.
Return the new combiner or
None
, if one is not needed.
-
create_partitioner
(context)¶ Create a partitioner object.
Return the new partitioner or
None
, if the default partitioner should be used.
-
-
class
pydoop.mapreduce.api.
JobConf
(values)¶ Configuration properties assigned to this job.
JobConf objects are instantiated by the framework and support the same interface as dictionaries, plus a few methods that perform automatic type conversion:
>>> jc['a'] '1' >>> jc.get_int('a') 1
Warning
For the most part, a JobConf object behaves like a
dict
. For backwards compatibility, however, there are two important exceptions:-
get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶
-
get_bool
(key, default=None)¶ Same as
dict.get()
, but the value is converted to a bool.The boolean value is considered, respectively,
True
orFalse
if the string is equal, ignoring case, to'true'
or'false'
.
-
get_float
(key, default=None)¶ Same as
dict.get()
, but the value is converted to an float.
-
get_int
(key, default=None)¶ Same as
dict.get()
, but the value is converted to an int.
-
-
class
pydoop.mapreduce.api.
MapContext
¶ The context given to the mapper.
-
get_input_split
(raw=False)¶ Get the current input split.
If
raw
isFalse
(the default), return anInputSplit
object; if it’sTrue
, return a byte string (the unserialized split as sent via the downlink).
-
get_input_value_class
()¶ Return the type of the input value.
-
input_key_class
¶ Return the type of the input key.
-
input_split
¶ The current input split as an
InputSplit
object.
-
-
class
pydoop.mapreduce.api.
Mapper
(context)¶ Maps input key/value pairs to a set of intermediate key/value pairs.
-
map
(context)¶ Called once for each key/value pair in the input split. Applications must override this, emitting an output key/value pair through the context.
Parameters: context ( MapContext
) – the context object passed by the framework, used to get the input key/value pair and emit the output key/value pair.
-
-
class
pydoop.mapreduce.api.
Partitioner
(context)¶ Controls the partitioning of intermediate keys output by the
Mapper
. The key (or a subset of it) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.-
partition
(key, num_of_reduces)¶ Get the partition number for
key
given the total number of partitions, i.e., the number of reduce tasks for the job. Applications must override this.Parameters: Return type: Returns: the partition number for
key
.
-
-
exception
pydoop.mapreduce.api.
PydoopError
¶
-
class
pydoop.mapreduce.api.
RecordReader
(context=None)¶ Breaks the data into key/value pairs for input to the
Mapper
.-
get_progress
()¶ The current progress of the record reader through its data.
Return type: float Returns: the fraction of data read up to now, as a float between 0 and 1.
-
next
()¶ Called by the framework to provide a key/value pair to the
Mapper
. Applications must override this, making sure it raisesStopIteration
when there are no more records to process.Return type: tuple Returns: a tuple of two elements. They are, respectively, the key and the value (as strings)
-
-
class
pydoop.mapreduce.api.
RecordWriter
(context=None)¶ Writes the output key/value pairs to an output file.
-
class
pydoop.mapreduce.api.
ReduceContext
¶ The context given to the reducer.
-
class
pydoop.mapreduce.api.
Reducer
(context=None)¶ Reduces a set of intermediate values which share a key to a (possibly) smaller set of values.
-
reduce
(context)¶ Called once for each key. Applications must override this, emitting an output key/value pair through the context.
Parameters: context ( ReduceContext
) – the context object passed by the framework, used to get the input key and corresponding set of values and emit the output key/value pair.
-
-
class
pydoop.mapreduce.pipes.
InputSplit
(data)¶ Represents the data to be processed by an individual
Mapper
.Typically, it presents a byte-oriented view on the input and it is the responsibility of the
RecordReader
to convert this to a record-oriented view.The
InputSplit
is a logical representation of the actual dataset chunk, expressed through thefilename
,offset
andlength
attributes.InputSplit objects are instantiated by the framework and accessed via
MapContext.input_split
.
-
pydoop.mapreduce.pipes.
run_task
(factory, port=None, istream=None, ostream=None, private_encoding=True, context_class=<class 'pydoop.mapreduce.pipes.TaskContext'>, cmd_file=None, fast_combiner=False, auto_serialize=True)¶ Run the assigned task in the framework.
Return type: bool Returns: True
if the task succeeded.