pydoop.hadut
— Hadoop shell interaction¶
The hadut module provides access to some functionalities available via the Hadoop shell.
-
class
pydoop.hadut.
PipesRunner
(prefix=None, logger=None)¶ Allows to set up and run pipes jobs, optionally automating a few common tasks.
Parameters: - prefix (str) – if specified, it must be a writable directory path that all nodes can see (the latter could be an issue if the local file system is used rather than HDFS)
- logger (
logging.Logger
) – optional logger
If
prefix
is set, the runner object will create a working directory with that prefix and use it to store the job’s input and output — the intended use is for quick application testing. If it is not set, you must callset_output()
with an hdfs path as its argument, andput
will be ignored in your call toset_input()
. In any event, the launcher script will be placed in the output directory’s parent (this has to be writable for the job to succeed).-
clean
()¶ Remove the working directory, if any.
-
collect_output
(out_file=None)¶ Run
collect_output()
on the job’s output directory.
-
run
(**kwargs)¶ Run the pipes job. Keyword arguments are passed to
run_pipes()
.
-
set_exe
(pipes_code)¶ Dump launcher code to the distributed file system.
-
set_input
(input_, put=False)¶ Set the input path for the job. If
put
isTrue
, copy (local)input_
to the working directory.
-
set_output
(output)¶ Set the output path for the job. Optional if the runner has been instantiated with a prefix.
-
class
pydoop.hadut.
PydoopScriptRunner
(prefix=None, logger=None)¶ Specialization of
PipesRunner
to support the set up and running of pydoop script jobs.-
run
(script, more_args=None, pydoop_exe='/usr/bin/pydoop')¶ Run the pipes job. Keyword arguments are passed to
run_pipes()
.
-
-
exception
pydoop.hadut.
RunCmdError
(returncode, cmd, output=None)¶ This exception is raised by run_cmd and all functions that make use of it to indicate that the call failed (returned non-zero).
-
pydoop.hadut.
collect_output
(mr_out_dir, out_file=None)¶ Return all mapreduce output in
mr_out_dir
.Append the output to
out_file
if provided. Otherwise, return the result as a single string (it is the caller’s responsibility to ensure that the amount of data retrieved fits into memory).
-
pydoop.hadut.
dfs
(args=None, properties=None, hadoop_conf_dir=None)¶ Run the Hadoop file system shell.
All arguments are passed to
run_class()
.
-
pydoop.hadut.
find_jar
(jar_name, root_path=None)¶ Look for the named jar in:
root_path
, if specified- working directory –
PWD
${PWD}/build
/usr/share/java
Return the full path of the jar if found; else return
None
.
-
pydoop.hadut.
get_num_nodes
(properties=None, hadoop_conf_dir=None, offline=False)¶ Get the number of task trackers in the Hadoop cluster.
All arguments are passed to
get_task_trackers()
.
-
pydoop.hadut.
get_task_trackers
(properties=None, hadoop_conf_dir=None, offline=False)¶ Get the list of task trackers in the Hadoop cluster.
Each element in the returned list is in the
(host, port)
format. All arguments are passed torun_class()
.If
offline
isTrue
, try getting the list of task trackers from theslaves
file in Hadoop’s configuration directory (no attempt is made to contact the Hadoop daemons). In this case, ports are set to 0.
-
pydoop.hadut.
path_exists
(path, properties=None, hadoop_conf_dir=None)¶ Return
True
ifpath
exists in the default HDFS.Keyword arguments are passed to
dfs()
.This function does the same thing as
hdfs.path.exists
, but it uses a wrapper for the Hadoop shell rather than the hdfs extension.
-
pydoop.hadut.
run_class
(class_name, args=None, properties=None, classpath=None, hadoop_conf_dir=None, logger=None, keep_streams=True)¶ Run a Java class with Hadoop (equivalent of running
hadoop <class_name>
from the command line).Additional
HADOOP_CLASSPATH
elements can be provided viaclasspath
(either as a non-string sequence where each element is a classpath element or as a':'
-separated string). Other arguments are passed torun_cmd()
.>>> cls = 'org.apache.hadoop.fs.FsShell' >>> try: out = run_class(cls, args=['-test', '-e', 'file:/tmp']) ... except RunCmdError: tmp_exists = False ... else: tmp_exists = True
Note
HADOOP_CLASSPATH
makes dependencies available only on the client side. If you are running a MapReduce application, useargs=['-libjars', 'jar1,jar2,...']
to make them available to the server side as well.
-
pydoop.hadut.
run_jar
(jar_name, more_args=None, properties=None, hadoop_conf_dir=None, keep_streams=True)¶ Run a jar on Hadoop (
hadoop jar
command).All arguments are passed to
run_cmd()
(args = [jar_name] + more_args
) .
-
pydoop.hadut.
run_pipes
(executable, input_path, output_path, more_args=None, properties=None, force_pydoop_submitter=False, hadoop_conf_dir=None, logger=None, keep_streams=False)¶ Run a pipes command.
more_args
(after setting input/output path) andproperties
are passed torun_cmd()
.If not specified otherwise, this function sets the properties
hadoop.pipes.java.recordreader
andhadoop.pipes.java.recordwriter
to"true"
.This function works around a bug in Hadoop pipes that affects versions of Hadoop with security when the local file system is used as the default FS (no HDFS); see https://issues.apache.org/jira/browse/MAPREDUCE-4000. In those set-ups, the function uses Pydoop’s own pipes submitter application. You can force the use of Pydoop’s submitter by passing the argument force_pydoop_submitter=True.
-
pydoop.hadut.
run_tool_cmd
(tool, cmd, args=None, properties=None, hadoop_conf_dir=None, logger=None, keep_streams=True)¶ Run a Hadoop command.
If
keep_streams
is set toTrue
(the default), the stdout and stderr of the command will be buffered in memory. If the command succeeds, the former will be returned; if it fails, aRunCmdError
will be raised with the latter as the message. This mode is appropriate for short-running commands whose “result” is represented by their standard output (e.g.,"dfsadmin", ["-safemode", "get"]
).If
keep_streams
is set toFalse
, the command will write directly to the stdout and stderr of the calling process, and the return value will be empty. This mode is appropriate for long running commands that do not write their “real” output to stdout (such as pipes).>>> hadoop_classpath = run_cmd('classpath')