Hive load CSV with commas in quoted fields

HadoopHbaseHiveHdfsDelimiter

Hadoop Problem Overview


I am trying to load a CSV file into a Hive table like so:

CREATE TABLE mytable
(
num1 INT,
text1 STRING,
num2 INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

LOAD DATA LOCAL INPATH '/data.csv'
OVERWRITE INTO TABLE mytable;    


The csv is delimited by an comma (,) and looks like this:

1, "some text, with comma in it", 123, "more text"

This will return corrupt data since there is a ',' in the first string.
Is there a way to set an text delimiter or make Hive ignore the ',' in strings?

I can't change the delimiter of the csv since it gets pulled from an external source.

Hadoop Solutions


Solution 1 - Hadoop

If you can re-create or parse your input data, you can specify an escape character for the CREATE TABLE:

ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';

Will accept this line as 4 fields

1,some text\, with comma in it,123,more text

Solution 2 - Hadoop

The problem is that Hive doesn't handle quoted texts. You either need to pre-process the data by changing the delimiter between the fields (e.g: with a Hadoop-streaming job) or you can also give a try to use a custom CSV SerDe which uses OpenCSV to parse the files.

Solution 3 - Hadoop

As of Hive 0.14, the CSV SerDe is a standard part of the Hive install

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

(See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde)

Solution 4 - Hadoop

keep the delimiter in single quotes it will work.

ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

This will work

Solution 5 - Hadoop

Add a backward slash in FIELDS TERMINATED BY ';'

For Example:

CREATE  TABLE demo_table_1_csv
COMMENT 'my_csv_table 1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'your_hdfs_path'
AS 
select a.tran_uuid,a.cust_id,a.risk_flag,a.lookback_start_date,a.lookback_end_date,b.scn_name,b.alerted_risk_category,
CASE WHEN (b.activity_id is not null ) THEN 1 ELSE 0 END as Alert_Flag 
FROM scn1_rcc1_agg as a LEFT OUTER JOIN scenario_activity_alert as b ON a.tran_uuid = b.activity_id;

I have tested it, and it worked.

Solution 6 - Hadoop

ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE Serde worked for me. My delimiter was '|' and one of the columns is enclosed in double quotes.

Query:

CREATE EXTERNAL TABLE EMAIL(MESSAGE_ID STRING, TEXT STRING, TO_ADDRS STRING, FROM_ADDRS STRING, SUBJECT STRING, DATE STRING)
ROW FORMAT SERDE 'ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE'
WITH SERDEPROPERTIES (
     "SEPARATORCHAR" = "|",
     "QUOTECHAR"     = "\"",
     "ESCAPECHAR"    = "\""
)    
STORED AS TEXTFILE location '/user/abc/csv_folder';

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMartijn LenderinkView Question on Stackoverflow
Solution 1 - HadooplibjackView Answer on Stackoverflow
Solution 2 - HadoopLorand BendigView Answer on Stackoverflow
Solution 3 - HadoopwrschneiderView Answer on Stackoverflow
Solution 4 - HadoopsuyashView Answer on Stackoverflow
Solution 5 - HadoopMantej SinghView Answer on Stackoverflow
Solution 6 - Hadoopvivesh saladiView Answer on Stackoverflow