由于上面的代码缺少编译它所需的导入语句等,因此这里是一个更完整的版本,可以从命令行读取和转储 dict 文件的输出
转储字典.java:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.IOException;
class DumpDict {
public static void main(String[] args) {
try {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.Reader read = new SequenceFile.Reader(fs, new Path(args[0]), conf);
IntWritable dicKey = new IntWritable();
Text text = new Text();
// HashMap dictionaryMap = new HashMap();
while (read.next(text, dicKey)) {
// dictionaryMap.put(Integer.parseInt(dicKey.toString()), text.toString());
System.out.println(dicKey.toString()+" "+text.toString());
}
read.close();
} catch (IOException e) {
System.out.println(e.toString());
}
}
}
我发现有必要明确告诉 java 所有 jar 文件在哪里:
export CLASSPATH=`find /path/to/mahout /usr/share/java -name '*.jar' | perl -ne 'chomp; push @jars, $_; END { print "\".:",(join ":",@jars),"\$CLASSPATH\"\n"; }'`
像这样编译:
javac dumpdict.java
像这样运行:
java -cp .:$CLASSPATH DumpDict {path to dict}
(这对于使用 java 的人来说可能有点过分,但对于我们这些不经常使用它的人来说,它可能会节省时间。)