Splitting a csv file with quotes as text-delimiter using String.split()

JavaCsvSplit

Java Problem Overview


I have a comma separated file with many lines similar to one below.

Sachin,,M,"Maths,Science,English",Need to improve in these subjects.

Quotes is used to escape the delimiter comma used to represent multiple values.

Now how do I split the above value on the comma delimiter using String.split() if at all its possible?

Java Solutions


Solution 1 - Java

public static void main(String[] args) {
	String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
	String[] splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
	System.out.println(Arrays.toString(splitted));
}

Output:

[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

Solution 2 - Java

As your problem/requirements are not all that complex a custom method can be utilized that performs over 20 times faster and produces the same results. This is variable based on the data size and number of rows parsed, and for more complicated problems using regular expressions is a must.

import java.util.Arrays;
import java.util.ArrayList;
public class SplitTest {

public static void main(String[] args) {
	
	String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
	String[] splitted = null;
	
 //Measure Regular Expression
	long startTime = System.nanoTime();
	for(int i=0; i<10; i++)
	splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
	long endTime =   System.nanoTime();
	
	System.out.println("Took: " + (endTime-startTime));
	System.out.println(Arrays.toString(splitted));
	System.out.println("");
	
	
	ArrayList<String> sw = null;		
 //Measure Custom Method
            startTime = System.nanoTime();
	for(int i=0; i<10; i++)
	sw = customSplitSpecific(s);
	endTime =   System.nanoTime();
	
	System.out.println("Took: " + (endTime-startTime));
	System.out.println(sw);			
}

public static ArrayList<String> customSplitSpecific(String s)
{
	ArrayList<String> words = new ArrayList<String>();
	boolean notInsideComma = true;
	int start =0, end=0;
	for(int i=0; i<s.length()-1; i++)
	{
		if(s.charAt(i)==',' && notInsideComma)
		{
			words.add(s.substring(start,i));
			start = i+1;				
		}	
		else if(s.charAt(i)=='"')
		notInsideComma=!notInsideComma;
	}
	words.add(s.substring(start));
	return words;
}	

}

On my own computer this produces:

Took: 6651100
[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

Took: 224179
[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]

Solution 3 - Java

If your strings are all well-formed it is possible with the following regular expression:

String[] res = str.split(",(?=([^\"]|\"[^\"]*\")*$)");

The expression ensures that a split occurs only at commas which are followed by an even (or zero) number of quotes (and thus not inside such quotes).

Nevertheless, it may be easier to use a simple non-regex parser.

Solution 4 - Java

While working on csv string we need to know following points.

  1. Every tuple in row will start with either "(quotes) or not. a) If it is starts with "(quotes) then it must be value of a particular column. b) If it is starts directly then it must be header. Ex : 'Header1,Header2,Header3,"value1","value2","value3"'; Here Header1,Header2,Header3 are column names remaining are values.

Main point we need to remember while doing split is you need check that spiting is done properly or not. a) Get the split value and check number of quotes in value (count must be even) b) If count is odd then append next split value. c) Repeat process a,b until quotes are equal.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionFarSh018View Question on Stackoverflow
Solution 1 - JavaAchintya JhaView Answer on Stackoverflow
Solution 2 - JavaMenelaosView Answer on Stackoverflow
Solution 3 - JavaHowardView Answer on Stackoverflow
Solution 4 - JavaShiva PrasadhView Answer on Stackoverflow