23 November 2010

Opening the information floodgates

An unexpectedly quick return to the Royal Society was again caused by the word "information".

This time it was bundled up in the phrase "Opening the information floodgates: the technologies and challenges of a web of linked data", which is enough to get any geek moist with anticipation.

Rising to that challenge was Professor Nigel Shadbolt of the University of Southampton (which is where I learnt all the maths that I have now forgotten) who gave us a view of how the web is evolving to encompass structured data.

Along the way he gave us this illuminating star system for assessing connectivity:
  1. Put your data on the web (any format)
  2. Make it available as structured data, e.g. csv
  3. Use open, standard data formats
  4. Use URLs to point to your data (so that people and machines can get to it)
  5. Link your data to other people's data
This is the essence of the semantic web where the content has meaning allowing new deeper connections to be exploited. Some examples were given, such as the ASBOrometer iPhone app, but these were the familiar mash-ups of geographical data against one other set of data that have been around for years.

So far so good, but there is a big problem. And that's quality.

The example Professor Shadbolt gave us was the official data on the location of bus stops which has 5% error records in it.

This problem, while admitted, was rather glossed over with the enthusiastic claim that the crowd will fix the problem, as it has with matter-of-fact issues in Wikipedia.

But that is to gloss over the examples that go against this.

For example, if you Google "Slovak Currency" you are still told that "1 Slovak koruna = 0.0280875591 British pounds", almost two years after the Slovaks upgraded to the Euro.

And I've pointed out problems with map data previously.

Data interpretation, or Information Literacy if you prefer, is another big issue that has yet to be addressed too. Sharing data makes lots of assumptions about what it means, as anybody who has tried benchmarking knows.

For examples, to compare data about hospitals you need to know about any specialities that they have (more people die of cancer in hospitals that specialise in cancer simply because they take proportionally more cancer patients) and the catchment areas they serve (proportionally more people die in hospitals that serve unhealthy regions).

These concerns were obvious to the audience and most of the questions that were asked at the end were about quality or interpretation of data.

The semantic web sounds a good idea in principle but there is an awfully long way to go from PowerPoint to implementation.

No comments:

Post a comment

All comments are welcome. Comments are moderated only to keep out the spammers and all valid comments are published, even those that I disagree with!