hadoop - 使用 MapReduce 读取 ORC 文件

Question

我正在尝试通过 MapReduce 读取使用 SNAPPY 压缩的 ORC 文件。我的意图只是利用 IdentityMapper，本质上是合并小文件。但是，我继续NullPointerException这样做。我可以从日志中看到正在推断架构，我还需要为 mapper 的输出文件设置架构吗？

public class Test {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    Configuration conf = new Configuration();


    Job job = new Job(conf, "test");

    job.setJarByClass(Test.class);
    job.setMapperClass(Mapper.class);
     conf.set("orc.compress", "SNAPPY");
    job.setOutputKeyClass(NullWritable.class);
    job.setOutputValueClass(Writable.class);
    job.setInputFormatClass(OrcInputFormat.class);
    job.setOutputFormatClass(OrcOutputFormat.class);
    job.setNumReduceTasks(0);


    String source = args[0];
    String target = args[1];

    FileInputFormat.setInputPath(job, new Path(source))
    FileOutputFormat.setOutputPath(job, new Path(target));

    boolean result = job.waitForCompletion(true);

    System.exit(result ? 0 : 1);
}

错误：org.apache.orc.OrcFile.createWriter(OrcFile.java:559) 的 org.apache.orc.impl.WriterImpl.(WriterImpl.java:178) 的 java.lang.NullPointerException 在 org.apache.orc.mapreduce .OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:55) 在 org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.(MapTask.java:644) 在 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)在 org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) 在 org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) 在 java.security.AccessController.doPrivileged(Native Method ) 在 javax.security.auth.Subject.doAs(Subject.java:415) 在 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) 在 org.apache.hadoop.mapred.YarnChild.main( YarnChild.java:163)

hadoop - 使用 MapReduce 读取 ORC 文件

0 回答 0

Related

Reference