My interest in standards and exchange formats is driven by two facts. Me and my colleagues are in the process of creating a statistical publication data warehouse. That is a centralized repository for all published statistical data. One of the big reasons for doing so is to be able to centralize and effectively build services for our customers. We want to provide system to system access, we want them to be able to subscribe and browse our data and metadata, free to efficiently use it to whatever purpose they see fit. Number two, we have experienced much interest in system to system access of data from all over society. We would like to provide our customers with the service they want and need.
If you look at the spectrum of those who want to harvest data in a systematic way, there are on the other end big users that are interested in large portions of data, either in a specific domain or as a whole. These are typically scientists such as universities doing researches. Entepreneurs are part of the group and already there is a company called Datamarket http://datamarket.com that has downloaded incredible amount of data from our website. This was done in good ccoperation with us. The main characteristics of the scientific user are large and complex datasets with metadata but not frequently.
On the other end there are quite different type of users. They are typically interested in some small part of our data, for instance the monthly CPI or export of a set of goods to a certain country. In my experience these are typically enterprises interested in getting statistical data to use within their own information systems. Due to indexation of loans in Iceland, banks and other financial institutions have a great interest in getting indices in a systematic way as soon as they are published. In addition software developers are interested in adding functionality into their application. Lets call this group small data enterprises. Then we have all kinds of users in between in all kinds of flavours. I think it is helpful to look at the requests of the users in this perspective.
In order to succeed we need to come up with ways to serve the data in a way that the customers are content with and which allows them to gain something by using the system to system service. It will always be measured against going to a website, selecting the data and click the Excel, csv, xml button. If the system to system soluction doesn´t beat that, it will not be used at all. What are then the key questions system developers ask, faced with task such as described?
There are two key elements at work, the frequency of the data transfer and the development time/cost. If one is to write a program that harvests data every month and the lifetime of the program is five years, then we have 60 transmissions. If it takes one day (8 hrs) to write it, each transmission costs 8 minutes. So if it takes less than 8 minutes to get it by hand ...
There are other things to take into account, is the timing critical?, what are the consequences if the person responsible forgets? and the simple fact that these kind of tasks are boring. But from a cost/benefit point of view this is largely the case. Statistical data is seldomly published with more frequency than one month even though some have weekly publications. What if the developer doesn´t have to download the data and store it in their own system? What if they could simply create program that gets the data from the statistical office every time someone in their office needs it? The frequency is much higher, lets say that it is used within the enterprise only twice a day, over the same 5 years the data would be transmitted around 2000 times and the development time is likely to be less than before, because you dont have to design and create storage for the data. But even with same development time, 8 hrs. each transmission would cost around 15 seconds.
This is important because we need to think about the motives and the usability of the service that we want to create. From this simple example it seems that we need to design the service for small data enterprises in such a way that they can program their systems to get the data directly from the service. There isn´t a very big case for them to create a program that fetches small amount of data and stores it in their systems. The programming isn´t likely to save them much. You need considerable frequency to justify the program, which is hardly the case with official statistical data.
What about the scientists, what is their biggest concern? They usually want data for the purpose of storing it within their own systems, so they will probably not consider fetching the data on demand. They are usually getting large numbers of data and they are interested in details. They are likely to be interested in all the metadata, particularly classifications. So they will quite likely settle for longer time in implementing and getting the data if they get access to good metadata. Since the frequency is of little concern the main factor is development cost. How much time will it take to get the data from the statistical office and import it into my information structure, whether it is a database or something else.