February 13, 2011

ANTLR or Scala?

I like ANTLR and wanted to make a nice clean CSV grammar, as I didn't like the one on the ANTLR wiki (eats leading/trailing field white space, no good per RFC4180) and didn't see any others via Google. So I did, it's here.

As I was about to hack it up into something usable as a library I happened to be perusing some Scala docs and was reminded about Scala's built-in combinators. I gave it a shot even though there are others out there, though the existing docs/examples are pretty rough. I have to say after some effort it's pretty clean and far shorter than my ANTLR grammar at 23 lines (if you don't count the "parseCSV" convenience methods):
import scala.util.parsing.combinator.RegexParsers

trait CSVParser extends RegexParsers {

    import scala.util.parsing.input.CharSequenceReader
    import scala.util.parsing.input.StreamReader
    import java.io._

    override def skipWhitespace = false

    protected def records = repsep(record, """\r?\n|\r""".r)

    protected def record: Parser[List[String]] = repsep(field, ",".r)

    protected def field: Parser[String] = quoted_field | unquoted_field

    protected def quoted_field: Parser[String] = """"(""|[^"])*"""".r ^^ {
        s: String => s.substring(1, s.length()-1).replaceAll("\"\"", "\"")
    }

    protected def unquoted_field: Parser[String] = """[^,"\r\n]*""".r

    /**
     * Returns a list of CSV records, each a list of strings. If there were no records found or there was an error, None is returned
     */
    def parseCSV(reader: scala.util.parsing.input.Reader[Elem]): Option[List[List[String]]] = {
        parseAll(records, reader) match {
            case s: Success[List[List[String]]] => Some(s.result)
            case _ => None
        }
    }

    def parseCSV(input: CharSequence): Option[List[List[String]]] = parseCSV(new CharSequenceReader(input))

    def parseCSV(input: InputStream): Option[List[List[String]]] = parseCSV(StreamReader(new InputStreamReader(input)))

    def parseCSV(reader: Reader): Option[List[List[String]]] = parseCSV(StreamReader(reader))

    def parseCSV(file: File): Option[List[List[String]]] = {
        val fr = new java.io.FileReader(file)
        try {
            parseCSV(fr)
        } finally {
            fr.close
        }
    }

}
Even with convenience wrapper methods, coming in at 48 lines to get to something readily usable is pretty impressive. I really do like ANTLR, but at this point I can't justify the effort of hacking up my grammar to get to something usable (in java).

Scala 1, ANTLR 0.