Geodata or more specifically places and locations play a key role on the Web of Linked Data by serving as nexuses that interconnect different data and data sources. Geonames, for instance, is one of the most linked hubs. In fact, most Linked Data are either directly or indirectly linked through various spatial and non-spatial relations to locations. Thus, it makes sense to dive deeper into investigating the role that place and location play, how many degrees (links) it takes before all Linked Data is connected to some sort of geo-feature, how these geo-features are represented, how their density is distributed, how to clean them up, and so forth. We started to do research on these topics some time ago and are preparing a paper. In the meantime we would like to show you some maps that illustrate the current state of locations on the Linked Data Web and the amount and types of errors we encountered. It turns out that more than 10% of these data have wrong or even impossible locations. We have found systematic errors and are beginning to figure out methods for cleaning them up. In case you are interested in Geospatial Semantics, Linked Spatiotemporal Data, and Geo-Ontologies you may also check out this short overview as starting point.
Fig. 1: A representative fraction of Linked Spatiotemporal Data (EPSG:4326, Plate Carrée).
Figure 1 shows the location information extracted from a representative fraction of the Linked Data Cloud using SPARQL Endpoints and their geoindexing capabilities. What is very remarkable is that there is no base map underlying this figure, i.e., all you see are millions and millions of single point locations. In other terms, the coverage of location data is amazing! You can easily distinguish continents and countries which is even true for the far north and south. While this is great news, you can also see that there are a lot of errors, e. g., the huge cross in the middle of the map. There are many more (often systematic) errors that are better visible on the high resolution version of this map that you can find here (6MB in size; you can also enlarge the maps below). For example you can find parts of the USA in China and we are beginning to understand what causes these errors. The grid you see on the high resolution version is caused by very coarse location information, e.g., by only using degrees, as well as by (mostly meaningless) centroids for larger regions. Features on the oceans are not necessarily errors but, e.g., report about oil spills and so forth. We will explain these and other problems in the aforementioned paper.
Fig. 2: A kernel density map (standard deviation 1.2) showing areas with a very high coverage per area unit.
Figure 2 shows a kernel density map to highlight those regions that are very dense with respect to the distribution of point-like features. A Gaussian kernel with a standard deviation of 1.2 is used to produce small regions. The results do not only show how massive the errors are but also nicely highlights that most Linked Data is about the USA and especially Europe. However, we have to say that the extreme peak in central Europe is a data artifact, again - more later. If you would do the same study using Twitter or Flickr, you would get very similar results (there is a lot of research that you can use as comparison). The reason is simply that most of these Volunteered Geographic Information (VGI) sources have a specific set of contributors and the coverage of these data depends on multiple social and physical factors. Regions in India, China, South America, and Central Africa are always underrepresented in these data sets.
Fig. 3: Population density data as of 2000; blue colors indicate low densities and the color ramp moves towards green, yellow, and finally red for densely populated areas .
Figure 3, for comparison, shows the population density from 2000. This is not to say that the Web of Linked Data should only report from populated areas but to give a simple example of the blind spots in Linked Data.
Fig. 4: Areas in black have a relatively high population density and at the same time a low density in LD point cloud.
Figure 4, is a simple map algebra example that reclassifies those cells that are relatively high in population density and at the same time relatively low in point-like feature density from the investigated Linked Data sets. Thus, the black areas may indicate that the coverage is not very good - India, for instance, is a clear example.
Fig. 5: The famous Copernicus crater.
Figure 5 and the following pictures are a warning that as a community we need to move away from the yet another domain mentality in triplyfing data sets that we don't fully understand (or care to understand). Otherwise we will turn Neil Armstrong's famous first step into a safari in South Sudan (by using the wrong CRS to locate the Sea of Tranquility). For instance, figure 5 shows the Wikipedia page of the Copernicus crater on the Moon together with its centroid position.
Fig. 6: Wikipedia's GeoHack toolserver showing the crater's location.
Figure 6 rightfully shows the position of the crater on the Moon.
Fig. 7: Fluidops displays Linked Data about the Copernicus crater taken from DBpedia.
Figure 7 shows the Fluidops interface that renders the DBpedia RDF data from the Copernicus crater and places it on the Surface of the Earth instead of realizing that the given coordinates are selenographic coordinates.
Fig. 8: DBpedia data about the Copernicus crater.
Figure 8 shows DBpedia data about the Copernicus crater. As you can see geo:lat and geo:long are used and the information that this place is not on Earth and the coordinates use a different coordinate reference system got lost. The same is true for the landing site of Apollo 11 all other locations on distant planets and their moons. While this should be easy to fix by including CRS, semantic reference systems will be required to ensure that the same kind of errors does occur in more subtle ways in all of our datasets. This is not a criticism of the wonderful work on Linked Data but a reminder that data does not speak for itself but requires methods that restrict the interpretation of domain vocabularies towards their intended meaning.
Let us dive into another example. Again, we are not interested in erroneous single data entries as they will be present in any dataset but in systematic errors as well as misconceptions about spatial and temporal data. The following example is again based on a lack of domain knowledge/understanding. For some reason, Geonames.org has decided to specify a point-like postition to represent the Earth. Defining a centroid for the surface of a sphere is mathematically speaking venturesome but if you really, really want to do this you can calculate the center of land mass and call this the geographical center of the Earth. However, the computed position should not be confused with the Earth as such. Trying to use geographic coordinate system to locate the Earth is impossible as the coordinate system uses proprieties of the Earth (and social convention, e.g., for the prime meridian) to set up this reference system in the first place. Geonames decided to assign the coordinates 0,0 for the Earth; see Figure 9.
Fig. 9: Geonames trying to position the Earth via a point-like feature within a geographic coordinate system (which is impossible).
You may argue now that we are overly picky here and nobody will be affected. As a matter of fact, representing areas via point-like features is always dangerous. Yes, but there is a big difference here. Those regions can be represented this way, the Earth not. Let us have a look at some simple consequences. What about a nearby query? Such queries are very popular to find the next airport or movie theater, and whenever you use google maps you are invoking exactly such a query. A nearby query for the Earth would, for instance, return the Moon using a so-called celestial coordinate system. However, in our case it will unfortunately return geo-features from West Africa such as Ghana. In other words, Ghana and Nigeria are closer to Earth than Germany -- interesting.
Fig. 10: The wikipedia page about the Gulf of Guinea.
IMHO, FactForge is a fantastic starting point to experiment with Linked Data. Thus, we will also use it for the next example. At the end, Linked Data will only be valuable if it can be used to find, integrate, and conflate data to answer some interesting questions. Hence, let us do a bit Linked Science. The Gulf of Guinea (see Figure 10) is a fascinating region for many reason, especially if you are interested in oil. Any meaningful query must be able to handle the huge heterogeneity resulting from the many neighboring states and involved geo-political interests. This makes semantics, ontologies, and Linked Data the perfect infrastructure for data exploration. How many people are living along the coastline? This should be easy to answer. To keep the example simple we will only use PROTON's populationCount relation here, but you can easily add the population relations from DBpedia to get more data. The following simplified listing will query for all entities within a buffer of 300 miles around the point-like feature representing the Gulf of Guinea.
SELECT distinct ?place ?populationCount
?place omgeo:nearby(?lat ?long "300mi");
This only gives us a small set of geo-features so that you can directly see what is happening. If you would ask for a 500 miles buffer you would not be able to keep track all of the different data anymore and would probably use the SPARQL SUM function. In our example query this will return about 7 billion people! Wow, this is even more than the population of the Earth. Figure 11 shows you why -- thanks to geonames the nearby query will contain the Earth as such and thus also its full population (and you would, of course, count some people twice). If you like to argue, you could now state that this could be easily fixed by asking for populated places only. Well, that is only possible if you (i) know about the typical pitfalls before and (ii) can write a query that contains all rdfs:types that you consider populated places across the different datasets -- keep in mind, in many regions of the world people live as nomads. Usually, you would do the following. For each returned entity you would add the population count and then run a contains query on geonames to see whether the next entities are spatially contained in the previous entities. This is a very reasonable approach and geonames even directly provides such a query capability. However, as you can see by following the link, this is also true for the Earth and, hence, you would run into the same trouble again. Okay, enough of this, we hope these examples are useful demonstrations of some of the work that needs to be done. Next time we will dive a bit more into the idea of the n-degrees of spatial.
Fig. 11: Population of the coastal regions around the Gulf of Guinea.
While we will update this blog posting from time to time and you are most welcome to reuse the maps, let me finally point out that by far not all depicted entities are places, popular classes also include peoples and events and points out future directions to improve the underlying ontologies. Anyway, there is a lot of Linked Data out there and millions and millions places and locations. Keep in mind, this is not supposed to be nagging or blind criticism; our goal is to document the state of the art, provide best practice to reduce errors, and increase the quality of Linked Spatiotemporal Data.