java写入orc数据文件

2026-04-29 其他常见问题

内容纲要

概要描述

在 Windows 环境下，使用 Java 程序通过 Kerberos 认证连接 TDH 集群，将数据以 ORC 格式写入 HDFS，供 Quark/Inceptor 的 ORC 表直接读取。

前置条件

项目	要求
JDK	1.8（1.8.0_391 验证通过）
网络	Windows 机器需能访问 HDFS NameNode 端口（默认 8020）
hosts	本地 `C:\Windows\System32\drivers\etc\hosts` 需配置集群节点映射
TDH 服务器	需能 SSH 登录到任一集群节点，拷贝依赖文件

操作步骤

步骤一：配置本地 hosts

在 Windows 的 C:\Windows\System32\drivers\etc\hosts 中添加集群节点映射：

172.18.131.171 kv1
172.18.131.172 kv2
172.18.131.173 kv3
172.18.131.174 kv4

注意：节点主机名需与集群实际配置一致，可在服务器的 /etc/hosts 中确认。

步骤二：从服务器拷贝依赖文件

在 Windows 本地创建目录结构：

C:\Users\<用户名>\tdh-libs\
├── conf\          ← HDFS 配置文件
└── jars\          ← 依赖 JAR 包

2.1 拷贝配置文件

从 TDH 服务器 /etc/hdfs1/conf/ 目录下载以下文件到本地 conf\ 目录：

文件	来源路径
`core-site.xml`	`/etc/hdfs1/conf/core-site.xml`
`hdfs-site.xml`	`/etc/hdfs1/conf/hdfs-site.xml`
`krb5.conf`	`/etc/hdfs1/conf/krb5.conf`
`hdfs.keytab`	`/etc/hdfs1/conf/hdfs.keytab`

2.2 拷贝依赖 JAR

从 TDH 服务器的 TDH Client 目录下载以下 JAR 到本地 jars\ 目录：

核心 JAR（必须）：

JAR 文件	服务器路径
`hadoop-common-transwarp-9.3.3.jar`	`/root/TDH-Client/hadoop/hadoop/share/common/`
`inceptor-exec-8.43.2.jar`	`/root/TDH-Client/inceptor/lib/`
`hadoop-hdfs-transwarp-9.3.3.jar`	`/root/TDH-Client/hadoop/hadoop/share/hdfs/`
`hadoop-hdfs-client-transwarp-9.3.3.jar`	`/root/TDH-Client/hadoop/hadoop/share/hdfs/`
`hadoop-mapreduce-client-core-transwarp-9.3.3.jar`	`/root/TDH-Client/hadoop/hadoop/share/mapreduce/`
`hadoop-mapreduce-client-common-transwarp-9.3.3.jar`	`/root/TDH-Client/hadoop/hadoop/share/mapreduce/`
`hadoop-auth-transwarp-9.3.3.jar`	`/root/TDH-Client/hadoop/hadoop/share/common/lib/`

快捷方式：如果不想逐个下载，可以直接将服务器上以下三个目录的所有 JAR 拷贝到本地 jars\ 目录：

/root/TDH-Client/hadoop/hadoop/share/common/lib/*.jar

/root/TDH-Client/hadoop/hadoop/share/hdfs/lib/*.jar

/root/TDH-Client/inceptor/lib/inceptor-exec-8.43.2.jar

步骤三：编写 Java 代码

创建 HdfsOrcWriter.java：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.Writer;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import java.util.ArrayList;
import java.util.List;

public class HdfsOrcWriter {
    public static void main(String[] args) throws Exception {
        // 配置文件目录（通过命令行参数指定，默认使用本地路径）
        String confDir = args.length > 0 ? args[0] : "C:\\Users\\17171\\tdh-libs\\conf";

        Configuration conf = new Configuration();
        conf.addResource(new Path(new java.io.File(confDir, "core-site.xml").getAbsolutePath()));
        conf.addResource(new Path(new java.io.File(confDir, "hdfs-site.xml").getAbsolutePath()));
        // 禁用 OAuth2（TDH Guardian 插件在本地客户端环境无法工作）
        conf.set("hadoop.security.authentication.oauth2.enabled", "false");
        conf.set("hadoop.security.authentication.web.oauth2.enabled", "false");

        // Kerberos 认证
        org.apache.hadoop.security.UserGroupInformation.setConfiguration(conf);
        String keytabPath = new java.io.File(confDir, "hdfs.keytab").getAbsolutePath();
        org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab("hdfs/kv1@KTDH", keytabPath);

        FileSystem fs = FileSystem.get(conf);

        // HDFS 输出路径
        Path hdfsDir = new Path("/user/hdfs/orc_test");
        fs.mkdirs(hdfsDir);
        Path hdfsOrcPath = new Path(hdfsDir, "output.orc");
        if (fs.exists(hdfsOrcPath)) {
            fs.delete(hdfsOrcPath, false);
        }

        // 定义 Schema：id INT, name STRING, age INT
        List colNames = new ArrayList();
        colNames.add("id");
        colNames.add("name");
        colNames.add("age");

        List colOIs = new ArrayList();
        colOIs.add(PrimitiveObjectInspectorFactory.writableIntObjectInspector);
        colOIs.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
        colOIs.add(PrimitiveObjectInspectorFactory.writableIntObjectInspector);

        StructObjectInspector soi = ObjectInspectorFactory.getStandardStructObjectInspector(colNames, colOIs);

        // 创建 ORC Writer
        Writer orcWriter = OrcFile.createWriter(
            hdfsOrcPath,
            OrcFile.writerOptions(conf)
                .inspector(soi)
                .stripeSize(67108864)
                .bufferSize(262144)
                .rowIndexStride(10000)
                .fileSystem(fs)
        );

        // 写入数据
        Object[][] data = {
            {new IntWritable(1), new Text("zhangsan"), new IntWritable(20)},
            {new IntWritable(2), new Text("lisi"), new IntWritable(25)},
            {new IntWritable(3), new Text("wangwu"), new IntWritable(30)},
            {new IntWritable(4), new Text("zhaoliu"), new IntWritable(28)},
            {new IntWritable(5), new Text("tianqi"), new IntWritable(22)}
        };
        for (Object[] row : data) {
            orcWriter.addRow(row);
        }

        orcWriter.close();
        fs.close();
        System.out.println("ORC file written to HDFS successfully!");
        System.out.println("Path: " + hdfsOrcPath.toString());
    }
}

注意：代码中 loginUserFromKeytab 的 principal（hdfs/kv1@KTDH）需根据实际集群配置修改，可通过在服务器上执行 klist -kt /etc/hdfs1/conf/hdfs.keytab 查看 keytab 中包含的 principal。

步骤四：编译与运行

构建classpath

将 jars\ 目录下所有 JAR 和 conf\ 目录加入 classpath：

$jarDir = "C:\Users\<用户名>\tdh-libs\jars"
$confDir = "C:\Users\<用户名>\tdh-libs\conf"
$jars = Get-ChildItem -Path $jarDir -Filter "*.jar" | Select-Object -ExpandProperty FullName
$cp = ($jars -join ";") + ";" + $confDir

编译

javac -cp $cp HdfsOrcWriter.java

运行

java -cp $cp `
  -Djava.security.krb5.conf="$confDir\krb5.conf" `
  -Dhadoop.home.dir=C:\Users\<用户名>\tdh-libs `
  HdfsOrcWriter $confDir

JVM 参数	说明
`-Djava.security.krb5.conf`	指定 Kerberos 配置文件路径
`-Dhadoop.home.dir`	Hadoop 主目录（避免 Windows 下找不到 WINUTILS 的警告）

运行成功后输出：

ORC file written to HDFS successfully!
Path: /user/hdfs/orc_test/output.orc

步骤五：验证结果

5.1 通过 Beeline 创建 ORC 表并查询

在 TDH 服务器上执行：

source /root/TDH-Client/init.sh n n
beeline -u "jdbc:hive2://:10000/default" -n hive -p <密码> \
  -e "CREATE EXTERNAL TABLE IF NOT EXISTS test_orc_table(
        id INT,
        name STRING,
        age INT
      ) STORED AS ORC
      LOCATION '/user/hdfs/orc_test';

      SELECT * FROM test_orc_table;"

查询结果：

+-----+-----------+------+
| id  |   name    | age  |
+-----+-----------+------+
| 1   | zhangsan  | 20   |
| 2   | lisi      | 25   |
| 3   | wangwu    | 30   |
| 4   | zhaoliu   | 28   |
| 5   | tianqi    | 22   |
+-----+-----------+------+

5.2 直接查看 HDFS 文件

kinit -kt /etc/hdfs1/conf/hdfs.keytab hdfs/kv1@KTDH
hdfs dfs -ls -h /user/hdfs/orc_test/