0

我们有一个小型的关键 Hadoop-hawq 集群。我们在其上创建了外部表并指向 hadoop 文件。

给定环境:

产品版本:(HAWQ 1.3.0.2 build 14421)在 x86_64-unknown-linux-gnu 上,由 GCC gcc (GCC) 4.4.2 编译

试过:

当我们尝试使用命令从外部表中读取数据时。IE

test=# select count(*) from EXT_TAB ; GETTING following error : ERROR: data line too long. likely due to invalid csv data (seg0 slice1 SEG0.HOSTNAME.COM:40000 pid=447247) 
DETAIL: External table trcd_stg0, line 12059 of pxf://hostname/tmp/def_rcd/?profile=HdfsTextSimple: "2012-08-06 00:00:00.0^2012-08-06 00:00:00.0^6552^2016-01-09 03:15:43.427^0005567^COMPLAINTS ..."  :

附加信息:

外部表的 DDL 是:

CREATE READABLE EXTERNAL TABLE sysprocompanyb.trcd_stg0
(
    "DispDt" DATE,
    "InvoiceDt" DATE,
    "ID" INTEGER,
    time timestamp without time zone,
    "Customer" CHAR(7),
    "CustomerName" CHARACTER VARYING(30),
    "MasterAccount" CHAR(7),
    "MasterAccName" CHAR(30),
    "SalesOrder" CHAR(6),
    "SalesOrderLine" NUMERIC(4, 0),
    "OrderStatus" CHAR(200),
    "MStockCode" CHAR(30),
    "MStockDes" CHARACTER VARYING(500),
    "MWarehouse" CHAR(200),
    "MOrderQty" NUMERIC(10, 3),
    "MShipQty" NUMERIC(10, 3),
    "MBackOrderQty" NUMERIC(10, 3),
    "MUnitCost" NUMERIC(15, 5),
    "MPrice" NUMERIC(15, 5),
    "MProductClass" CHAR(200),
    "Salesperson" CHAR(200),
    "CustomerPoNumber" CHAR(30),
    "OrderDate" DATE,
    "ReqShipDate" DATE,
    "DispatchesMade" CHAR(1),
    "NumDispatches" NUMERIC(4, 0),
    "OrderValue" NUMERIC(26, 8),
    "BOValue" NUMERIC(26, 8),
    "OrdQtyInEaches" NUMERIC(21, 9),
    "BOQtyInEaches" NUMERIC(21, 9),
    "DispQty" NUMERIC(38, 3),
    "DispQtyInEaches" NUMERIC(38, 9),
    "CustomerClass" CHAR(200),
    "MLineShipDate" DATE
)
LOCATION (
    'pxf://HOSTNAME-HA/tmp/def_rcd/?profile=HdfsTextSimple'
)
FORMAT 'CSV' (delimiter '^' null '' escape '"' quote '"')
ENCODING 'UTF8';

任何帮助将非常感激 ?

4

2 回答 2

2

基于源代码: https ://github.com/apache/incubator-hawq/blob/e48a07b0d8a5c8d41d2d4aaaa70254867b11ee11/src/backend/commands/copy.c

cstate->line_buf.len >= gp_max_csv_line_length为真时发生错误。根据:http ://hawq.docs.pivotal.io/docs-hawq/guc_config-gp_max_csv_line_length.html

csv 的默认长度为 1048576 字节。您是否检查过您的 csv 文件长度并尝试增加此设置的值?

于 2016-01-11T20:20:04.017 回答
2

检查第 12059 行的列数是否与分隔字段数匹配。如果在解析过程中某些行被组合在一起,那么我们可能会超过最大行长度。这通常是因为错误的数据 echo $LINE | awk -F "^" '(总计 = 总计 + NF); END {打印总数}'

于 2016-01-11T20:27:35.677 回答