Friday, April 25, 2008

Best Practices of Well-Formed Data?

Update 2008/05/06: It occurs to me now that this works for me due to my dependence on Western (American English) style characters: http://benfry.com/writing/archives/82... My concern is readability by humans and for me that means American English. Unicode is a super set, so I have modified my post accordingly.

I have come to realize that there are exactly three (3) formats for well-formed data which suit all of my interoperability needs:

  1. XML
  2. ASCII Unicode standard object notation
    1. (i.e., script object notation, i.e., SON, i.e., Object.property as text, ex. JSON)
  3. ASCII Unicode standard delimited notation
    1. (ex., comma-separated values)
Each of these has upsides and downsides.

Use XML when you want to be the most versatile with your data. Besides being ubiquitous, XML serializes and deserializes quickly and easily, represents OO clearly, and can be easily, automatically validated. XML presents itself well for both people and software. Especially use XML when you want to do advanced searches on data or transform the data from one domain-specific language (DSL) to another. DOM and XPath queries do lose their utility as data sets get large, but these situations can be avoided and parsing remains fast even on files many megabytes in size. Batch processing gigs of data is not a job for XML. (Use a relational database for that.)

Use SON when all you want to do is serialize/deserialize data to/from business objects or standard data structures (ex., arrays, maps). JavaScript Object Notation, or JSON, is super convenient for Javascript apps and JSON "liblets" (very small libraries, usually only 1 file) are easily found and maintained with your own code. SON also represents OO clearly, and given a pretty-print function, presents itself well to both people and software. But, there is currently no way to validate a SON. Not everything can handle a SON either. JSON doesn't work with languages that aren't JavaScript*, although there's nothing about JSON that is proprietary to Javascript so it's just a matter of time. JSON is just formatted text. ASCII Unicode-based SONs may one day be the XML killer but that time is a long way off.

Use CSVs when you want to be super fast on the throughput and can make assumptions that the data is simple to interpret. CSVs work well when moving very large amounts of data over a wire and/or into and out of relational databases. CSV data does not represent OO in any way, it does not present itself well to people, and there is no standard way to automatically validate it. CSV can be super efficient, but it is error-prone, can be difficult to debug, and is not nearly as versatile as the other two formats.

A good data service should support all three of these whenever and wherever it is feasible. And here is the obvious point that I do not want to escape from me - these 3 formats are all any data service ever need provide. Any service that supports each of these will be "universally" (practically) interoperable. If all data services supported exactly these 3 then all the Web would be an oyster.

*Update: over at JSON.org you can find ports for every language under the sun. Actually, what I meant is that support for JSON isn't native in the main platforms. Hopefully it will be built into all of these languages in the future.

Sunday, April 6, 2008

Is Google Search Appliance Appropriate for Government and Corporate Enterprises?

Question to Ponder (QTP)

Google's smarts are based on Page Rank, which is tied to popularity. Popularity on Google is heavily influenced by financial and other driving incentives. Google's strengths are based on Gi-normous machine clusters and on huge critical mass of the widest diversity of users, who are mixing it up and providing feed back into the system explicitly and implicitly by way of their behavior.

So is a Google search appliance a smart recommendation for government or corporate enterprise? I don't know exactly but last good info I had was that they cost about $150K per appliance rack that could search 2 or 3M items.  [Update 01.14.2009 - new box searches 10M docs]  One NASA center has two racks and found they chewed up 4-6M items in no time. I don't know what percentage that was, but I do know that they were want for more racks. Last I talked to them they were getting good at setting up filters to weed stuff out. (Weeding the garden is probably not a bad thing in any event.)

Now, I suppose you could - antitrust laws and fear not withstanding - scale Google's model to apply to the whole of the US federal government's information infrastructure and we'd approach some aspect of critical mass. Popularity behavior would surely be different than the commercial advertising model (I think). But that's not necessarily bad, if true.

It actually might not cost a lot of money in terms of federal budget scales. The way I figure, 1 billion records would cost $50M at full retail cost, i.e., no negotiated discount:

1,000,000,000 items/3,000,000 items per rack = 333 racks : 333*150,000 = 50,000,000

Tack on 20% to implement? +10M gets us to $60M
20% per year for maintenance? +10M per year * 10 years and we're at $160M

Double it all for 2B items. $320M. Even with no write downs for discount that's a rounding error in the federal budget.

To avoid monoply and other sticky issues with a deal like this, the government should buy Google's intellectual property and arrange a reseller license agreement so the government can contract out to the fairest qualified bidders (plural) the secure right to implement the IP on hardware on the government's behalf. Google could even be forbidden from competing in that way for some number of years. Government then leases time on the GIG to cover the bandwidth resources consumed by the huge clusters serving this purpose. (Ooh - is that in the cost?) Government already has the ability to secure this at multiple levels of security (Is that in the cost?). It is essentially, if perhaps not trivially, a matter of organizing search clusters to operate within NIPR or within SIPR or within JWICS firewalls.

[Update on 11 April based on conversation with Katharine: It might not be one big Google search farm pointed at these networks, at least not without a radical change in policy that may be intractable due to politics. There are many, many organizational firewalls. But at least get the bulk purchase order in, get a huge discount...for taxpayers, and place clusters where they need to be to get the biggest bang for those bucks. The essence is to take a strategic approach to search for the whole of government information. Search is so very fundamental that it deserves this kind of treatment.]

Ok...that demon thought is expelled now. I'm writing too many proposals. Approaching temporary burnout threshold. My brainstorming synapses may be totally misfiring at this point.


[Update 3 December 08:  2B searchable items is nary a drop in the bucket, I reckon.  Are we talking about 2 trillion?  200 trillion?  Large numbers that sound obscene when spoken?]

Wednesday, April 2, 2008

On Accessors and Mutators

I was just exploring the Google Visualization API reference and found myself a little hung up something. It's not just Google's API, but they are certainly a huge propagator of code. It's not the first time I've been hung up on this, either. The problem is that most APIs that use the Accessor/Mutator pattern follow it in the same, and in my opinion wrong, way. What is the point of declaring and implementing a unique function - no - two functions for every property contained by an Object? In fact why do we need any more than exactly two functions total to get or set any property or, in fact, any Object?

getColor()
getName()
getNumberOfColumns()
getNumberOfRows()

Why not just get(namedAttribute) and set(namedAttribute, value)?

get("color")
get("name")
get("numberOfColumns")
get("numberOfRows")
set("numberOfRows", 6);

Realize that 6 is probably not the literal integer 6, but rather an Object itself or perhaps a Collection (of size 1) containing literal, primitively typed values. But that is hardly of consequence in the scheme of things, it seems to me. I dread even having to make the case for efficiency. Programmers still get hung up on efficiency too much in terms of cycles and RAM. While I am sensitive to these concerns (for reasons of achieving enough throughput to support fluid user interaction) such concerns pale in comparison to what is lost overall to complexity of integration. How many API's do we want to learn, really? Domain-specific languages are much easier to learn and use, and more so when there aren't too many competing DSLs floating around a domain.

Perhaps I have it wrong and it is all a trivial point I am making. But it seems an awful waste of time doing it the other way. What kind of code bloat is being propagated? What happens when you add a new attribute? In fact, the ability to declare an arbitrary property and assign it to any Object is a powerful mechanism in software. It will be used and abused one way or the other so we should anticipate and handle it. But specifically, this kind of abstract polymorphism is ideal for serialization and deserialization of database records without having to tightly bind classes to database constructs through object relational mapping. Whether using XML, standard object notation, or standard delimited formats, there are multiple ways to deserialize any Object into memory for use by business logic without having to declare and implement a class for the Object.

In fact, one interface for get and one for set is much more flexible on the whole. Integration and interoperability with other programs is much easier with a single get/set. The ability for interoperating software programs to bind solely to a universal input/output protocol, and for interoperating software to remain blissfully unconcerned with the inner workings of peers is beyond powerful. Then we can focus on the functional roles of software. What does it do? I don't care how (beyond the I/O interface)(and I do care how well). When you get to that point, software can program other software (within a security model, of course). This kind of integration is also easier to maintain. Consider an implementation that uses XML as the carrier. The main Do-Until-Error loop is as simple as: While(Element is recognized) Process Element; Else Throw/Ignore (Unrecognized Element Error). A secondary error condition will exist in the event that expected inputs are no longer generated as outputs from peers*. In both cases, the situations are reduced to a matter of choosing how to process output from peers as it is provided and not also a matter of refactoring the interface between peers. It's all the UNIX command line applied to the Web. And I am certainly not the first or only person to have noticed.

* A third condition occurs when input is malformed