Sunday, June 27, 2010

Questions I Should Be Able to Ask My City (On the Web)

It's too hard to ask a simple question about a city and get a simple answer.


This notion goes well beyond satisfying the need to know the answers. Our inability to ask basic questions about any city and provide simple answers is a much more profound shortcoming compared to just the answers themselves. Simple answers to basic questions are the foundational elements of more complex analyses.

When we rely on experts and representatives, elected or otherwise, to perform these analyses we expect accuracy and completeness. But how is this measured? No matter what the field of study, the answer has always been references, data sets, and peer review.

Today, especially in matters of public interest, the word "peer" is hard to define. All the more reason why satisfying the first two elements, references and data sets, is of paramount importance.

Cities should provide simple answers to basic questions.

Cities should provide simple answer to basic questions in a URL-addressable, well-formed, text format.

There are many mechanisms cities can use to satisfy this requirement; Web API, online spreadsheet, posted XML documents. Ideally, technology exists and can be integrated so that cities can answer questions via several of these mechanisms.

So what are the questions we would ask our cities, if we could? Following are a few to which I would like to know the answers. Moreover, given the answers to these questions in URL-addressable, well-formed, text format(s) the answers to these questions could make for some rather interesting, useful mash-ups.
  1. Where is the location of each road and intersection under construction in my city?
  2. How long has road/intersection X been under construction?
  3. When will road/intersection X construction project be complete?
  4. What is the list of each building construction permit in my city?
  5. Where is the location of each building construction permit in my city?
  6. What is the list of businesses that pay taxes in my city?
  7. What is the list of businesses that pay taxes in my city, by type of business?
  8. What is the list of how much each business paid in taxes in my city?
  9. Where is the location of each business that pays taxes in my city?
  10. What is the list of neighborhoods in my city?
  11. How many houses are in each neighborhood in my city?
  12. What is the list of schools in the city? (This should include private and church schools, too, if these schools are required to be registered with the city.)
  13. What is the location of each school in my city?
  14. What is the enrollment at each school in my city?
  15. How many cars are registered in my city?
  16. What is the list of most traveled streets in my city?
  17. What is the list of traffic accidents in my city?
  18. Where are the locations of each traffic accident in my city?
  19. How many pounds of garbage are collected in my city?
  20. How many pounds of garbage are collected in my city, by zipcode?
  21. How many pounds of garbage are collected in my city, by neighborhood?
  22. What is the list of events in my city?
  23. When are events occurring in my city, by time period?
  24. Where are events occurring in my city?
  25. What is the list of conditional use permits in my city?
  26. What is the list of all emergency response stations (fire, police, medical) in my city?
  27. Where are the locations of each emergency response station in my city?
  28. What is the number of emergency calls in my city, by time period?
  29. What is the list of stations responding to emergency calls, by time period?
  30. What is the frequency of emergency calls by station and time period?
[To be continued...]

Without a doubt, this list is far from exhaustive. Crime data is one of the most popular and reported-on data sets. I tried to come up with relevant questions for which data are not likely accessible. Certainly, schema are required for declaring the answers to these questions in our required format. Probably every question does not have to be anticipated a priori. We should also not assume that a particular question implies a specific data set. We often can easily get the answer of, for example, the location of each crime. Rarely, however, can we get aggregate data, useful in trend analyses. Perhaps what we need are structured data sets and Wolfram Alpha. In any case, a foundational platform of data is required. In any case, we should be able to formulate a basic question and get a simple answer (on the Web).

Given the simplicity of what is being asked, however, this should not be complicated.

Transparency and Visualization of City Data

Last year my city was considering closing one or two schools (HamptonRoads.com, WAVY.com). One of the justifications cited was decline in student enrollment. Though I can no longer find it anywhere online, one of the enrollment artifacts put forth was this image of a column chart from a (presumably) Excel spreadsheet.


From Flow

To look at this picture (of data) one would assume there was a HUGE drop in enrollment, something in the vicinity of 50-60%.

But a closer look at the values reveals quite a different quantitative result. In fact, the actual decline from the peak in 1994-95 to the valley in 2008-09 appears to be 5,600 students. Keep in mind, I'm doing my best to derive the actual number based on the Y-axis scale on the picture. The data weren't provided. To make matters more difficult, the column chart is shown in 3D. This undermines the utility of the visualization by making it hard to determine an actual value on the chart.

The actual difference is 6,824 students, from the high in 1994-95 to the projected low in 2012-13.

6,824 students is 18.1% of the peak (6,824/37,707). While significant, 18.1% is a far cry from what appears to be a 50-60% decline.

Fortunately, when I requested the data they were provided to me.

This is valuable for two reasons. First and foremost, I get the data, not an interpretation of the
data. Second, and more the point of this post, I can show what this chart should look like and what else the data have to say.

Notice the slope of the decline and the differences between yearly observations. Neither are so exaggerated as the first image.

The differences in view are the aspect ratio of the graph and the value of the Y-axis origin. In the first chart the aspect ratio of the picture is nearly square, or 1:1. The starting value of the Y-axis is 28000 (an arbitrary value not found anywhere in the data). In the second chart the Y-axis begins at zero (0) and the aspect ratio is closer to 3:1 (which is probably too wide).

So then, which image and which set of chart variables is correct?

The answer depends on the basis for the query...or what you are trying to say. Actually the answer has nothing to do with "what you are trying to say" since you/we shouldn't be trying to say anything except to represent quantitative values in visual form. The right thing is to understand the nature of the analysis and the decision trying to be made.

In this case, we can assume with confidence that the Y-axis has no business starting at any value other than zero. In this case what is being claimed, via what is shown, is a long term (since 1990) decline, not a year-to-year fluctuation. Over the 14 year span the decline is obviously not in the range of 50-60%, therefore the chart should no imply so. Otherwise, if year-to-year deltas are of interest it probably does not make sense to show a 14 year span.

By the way, the data for recent yearly decline are -216, -200, +30, -17 between the years 2006-2010. 200 students city wide is hardly justification for closing a school. (Our school has over 600 students.)

In the end, my city chose not to close a school. I think that was a wise choice. But I have to wonder what was going on in the minds of the decision makers. What analyses were they using? Did they, too, realize the truth in these numbers or was it some other (political) factor. Did we just get lucky?

When cities make claims based on data they should publish the data. When cities, or anyone, shows you a picture of data you should question the picture and demand politely request a reference to the data. At minimum, know what you are seeing.

Thursday, June 24, 2010

Mashing up Hampton Roads School System Data

I'm tempted to not provide a narrative for this. The pictures really tell the story.

I was reading an article from my local "paper," The Virginian-Pilot, online. When I finished the article I noticed the "Quick Links" just below. To my pleasant surprise, the first category of links was "Databases." Truth be told, I'm not that surprised to find data at The Pilot, but what unfolded when I followed that link was far from the norm when it comes to online (local) journalism and data.

Pilot_Databases_QuickLink


I clicked the link to "School Violence" and found the following page. If you look closely you will see there is a form:


Pilot_Databases_Form


Clicking "Search" took me to this next page. I was not surprised to see that the results were available in HTML only:


Pilot_Databases_Example

Whenever I come across data like this the first thing I want to know is whether or not it can actually be used as data. In order to be used/usable as data the contents of this HTML table need to be, at minimum, copy-and-paste-able into a spreadsheet.

Lo and behold... In about 20 minutes I had pasted all of the contents of the database into a Google spreadsheet:

Pilot_Databases_Google_Spreadsheet

From there I was able to create a few transformations of the data and about a half-dozen meaningful charts like this:
number_of_violations_by_type_and_school_division

What if more news organizations published databases? What if journalists had a habit of digging into those databases with with these simple tools? Would we create better news products? Would we become more accurately and insightfully informed as citizens? Would our governments, businesses, and civic organizations make wiser strategy and better decisions? And if we think that the answer to any of these questions is yes, how do we sustain the "model?"