String.split() method - add a "quote character" argument to improve delimiter recognition


#1

Hey everyone!

Coming from a python background, I was really surprised to see that the best way to handle string delimitation where the string to be split contains the delimiter inside quotation marks involves an absurd amount of regex:

str.split(",(?=(?:[^\"]\"[^\"]\")[^\"]$)".toRegex())

Pandas handles this nicely with a ‘quotechar’ argument, where delimiters contained between the specified character are ignored and produce a singular string. What are your thoughts?

Example, where mystring is one singular string imported from, say, reading a line from .csv:

var mystring = "Hi there, "What is your opinion on, say, delimiters?", such as this."
println(mystring.split("," quotechar="""))

output:

("Hi there", "What is your opinion on, say, delimiters?", "such as this.")

instead of

("Hi there", "What is your opinion on", "say", "delimiters?", "such as this.")


#2

I suspect that this would open up a can of worms…

The problem is that CSV isn’t a single, well-specified standard; everyone implements it in a different way.

For example: Should it allow the quote character to be included within the quoted string? If so, should it expect a doubled quote character, or one preceded by an escape character? If the latter, which character (e.g. backslash)? If there’s no closing quote on a line, should it treat that as an invalid line, or as part of a multi-line value? Should it allow values that aren’t quoted? Is whitespace allowed around delimiters, and should it be stripped (as in your example)? And so on…

There are whole libraries devoted to parsing CSV-style data, with lots of configuration options and workarounds!

I suspect it’s far too much to do properly in a single extension method. (And a simplified version would fail in subtle ways.) It could be considered for a utility class, perhaps. But I’m not sure it would be used widely enough to justify that. What do other people think?


#3

Yeah, Apache Commons CSV is probably the most well-known…