Friday, April 25, 2008

Best Practices of Well-Formed Data?

Update 2008/05/06: It occurs to me now that this works for me due to my dependence on Western (American English) style characters: http://benfry.com/writing/archives/82... My concern is readability by humans and for me that means American English. Unicode is a super set, so I have modified my post accordingly.

I have come to realize that there are exactly three (3) formats for well-formed data which suit all of my interoperability needs:

  1. XML
  2. ASCII Unicode standard object notation
    1. (i.e., script object notation, i.e., SON, i.e., Object.property as text, ex. JSON)
  3. ASCII Unicode standard delimited notation
    1. (ex., comma-separated values)
Each of these has upsides and downsides.

Use XML when you want to be the most versatile with your data. Besides being ubiquitous, XML serializes and deserializes quickly and easily, represents OO clearly, and can be easily, automatically validated. XML presents itself well for both people and software. Especially use XML when you want to do advanced searches on data or transform the data from one domain-specific language (DSL) to another. DOM and XPath queries do lose their utility as data sets get large, but these situations can be avoided and parsing remains fast even on files many megabytes in size. Batch processing gigs of data is not a job for XML. (Use a relational database for that.)

Use SON when all you want to do is serialize/deserialize data to/from business objects or standard data structures (ex., arrays, maps). JavaScript Object Notation, or JSON, is super convenient for Javascript apps and JSON "liblets" (very small libraries, usually only 1 file) are easily found and maintained with your own code. SON also represents OO clearly, and given a pretty-print function, presents itself well to both people and software. But, there is currently no way to validate a SON. Not everything can handle a SON either. JSON doesn't work with languages that aren't JavaScript*, although there's nothing about JSON that is proprietary to Javascript so it's just a matter of time. JSON is just formatted text. ASCII Unicode-based SONs may one day be the XML killer but that time is a long way off.

Use CSVs when you want to be super fast on the throughput and can make assumptions that the data is simple to interpret. CSVs work well when moving very large amounts of data over a wire and/or into and out of relational databases. CSV data does not represent OO in any way, it does not present itself well to people, and there is no standard way to automatically validate it. CSV can be super efficient, but it is error-prone, can be difficult to debug, and is not nearly as versatile as the other two formats.

A good data service should support all three of these whenever and wherever it is feasible. And here is the obvious point that I do not want to escape from me - these 3 formats are all any data service ever need provide. Any service that supports each of these will be "universally" (practically) interoperable. If all data services supported exactly these 3 then all the Web would be an oyster.

*Update: over at JSON.org you can find ports for every language under the sun. Actually, what I meant is that support for JSON isn't native in the main platforms. Hopefully it will be built into all of these languages in the future.

No comments: