Category Archives: Linked Data

Loading Geonames in Virtuoso

  1. Covert the Geonames RDF XML dump to N-Triples.
    The RDF dump contains all the geonames entries in a text formatted as a feature URI in one line followed by the RDF XML description in the next line for each feature. For example:

    
    http://sws.geonames.org/6/
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#">
    <gn:Feature rdf:about="http://sws.geonames.org/6/">
    <rdfs:isDefinedBy>http://sws.geonames.org/6/about.rdf</rdfs:isDefinedBy><gn:name>Āb-e Yasī</gn:name><gn:featureClass rdf:resource="http://www.geonames.org/ontology#H"/>
    <gn:featureCode rdf:resource="http://www.geonames.org/ontology#H.STM"/><gn:countryCode>IR</gn:countryCode><wgs84_pos:lat>32.8</wgs84_pos:lat><wgs84_pos:long>48.8</wgs84_pos:long><gn:parentFeature rdf:resource="http://sws.geonames.org/127082/"/><gn:parentCountry rdf:resource="http://sws.geonames.org/130758/"/><gn:parentADM1 rdf:resource="http://sws.geonames.org/127082/"/><gn:nearbyFeatures rdf:resource="http://sws.geonames.org/6/nearby.rdf"/><gn:locationMap rdf:resource="http://www.geonames.org/6/ab-e-yasi.html"/></gn:Feature></rdf:RDF>
    
    

    It’s not really possible to parse this file directly using RDF parsers. I wrote a python script which converts the RDF dump file to a single file with all the triples represented in the dump file. The python script  serializes everything in ntriple in a file. This ntriple file can be easily loaded into Virtuoso and other triple stores.

  2. Install and configure Virtuoso
    I used yum in my Fedora to install virtuoso opensource and related packages. I guess other package managers can do the same job. The virtuoso-opensource package is for virtuoso opensource database server. The virtuoso-opensource-utils package comes with isql-v commandline-based sql client. It can be also used for SPARQL queries. The virtuoso-opensource-conductor package gives a nice web user interface.

    sudo yum install virtuoso-opensource
    sudo yum install virtuoso-opensource-utils
    sudo yum install virtuoso-opensource-conductor
    

    Now, follow the steps below to configure virtuoso.

    • Create a config directory for your current user(the user that will run the server). It my case it was /[my home]/virtuoso.
    • Copy the default virtuoso.ini (normally it’s located in /var/lib/virtuoso/db/virtuoso.ini) to this directory (make sure you modify the access permissions to be able to modify it).
    • Modify the following parameters in virtuoso.ini
      ;Depending on your memory size, change the following two parameters.
      ;You will find instruction in the default virtuoso.ini file
      NumberOfBuffers=680000
      MaxDirtyBuffers=500000
      
      [Database]
      DatabaseFile = [path to user's virtuoso config directory]/virtuoso.db
      ErrorLogFile = [path to user's virtuoso config directory]/virtuoso.log
      LockFile = [path to user's virtuoso config directory]/virtuoso.lck
      TransactionFile = [path to user's virtuoso config directory]/virtuoso.trx
      xa_persistent_file = [path to user's virtuoso config directory]/virtuoso.pxa
      
      [TempDatabase]
      DatabaseFile = [path to user's virtuoso config directory]/virtuoso-temp.db
      TransactionFile = [path to user's virtuoso config directory]/virtuoso-temp.trx
      
    • Add the directory that contains the geonames.nt file to the allowed directories of Virtuoso.
      DirsAllowed = ., /usr/share/virtuoso/vad, [directory that contains the geonames.nt]
      
    • Configure the odbc.ini file as below (create it if it doesn’t exist).
      [Local Virtuoso]
      Driver=/usr/lib64/virtodbc_r.so
      Address=127.0.0.1
      Port=1111
      UID=dba
      
    • Start virtuoso by executing the following command in your linux shell.
      virtuoso-t -df +configfile /[path to user's config directory]/virtuoso.ini
      
    • Login to virtuoso using isql virtuoso client and change the default password (‘dba’ is the default password of user dba). If the change password doesn’t work from isql client, try logging in the conductor web client (http://localhost:8890/conductor) login using user: dba, pass: dba then execute the set password command from interactive SQL option.
      $ /usr/libexec/virtuoso/isql 127.0.0.1:1111 dba
      SQL> set password 'dba' 'new-password'
      
  3. Load the converted Geonames N-Triples into Virtuoso
    • Copy the Bulk Loader Procedure and Sub-procedures creation SQL script from the link here, save it as rdfloader.sql in the path to user’s config directory and modify the line 331from
      DECLARE gr VARCHAR;
      

      to

      DECLARE gr INT;
      
    • From the isql console execute the following command.
      SQL> load [path to ]rdfloader.sql;
      
    • If anything goes wrong, drop the load_list and ldlock tables by executing the commands below and then load again by using the previous command.
      SQL> drop table load_list;
      SQL> drop table ldlock;
      
    • Select the geonames.nt file that you want to load (the third parameter is the graph name where the triples will be loaded).
      SQL> ld_dir ('path to the directory where geonames.nt is located', 'geonames.nt', 'http://www.geonames.org');
      
    • Execute the loader (it will take a long time, 9 hours in my computer).
      SQL> rdf_loader_run ();
      
  4. TestYou can access the SPARQL endpoint web interface at localhost:8890/sparql. In the default dataset uri field, type http://www.geonames.org. We will run a query for getting all the regions of France. Regions of France are represented by the <http://www.geonames.org/ontology#parentADM1&gt; relation. The query will look like:
    select distinct ?uri, ?name where {
    ?uri <http://www.geonames.org/ontology#parentCountry> <http://sws.geonames.org/3017382/>.
    ?t <http://www.geonames.org/ontology#parentADM1> ?uri.
    ?uri <http://www.geonames.org/ontology#name> ?name}
    

    Type this query in the query text filed and press run query. It will return a list of region URIs and names.

This tutorial has been adapted from the following tutorials:

How to produce Linked Data from SPARQL endpoints.

Imagine you have a triplestore which allows SPARQL queries. Now, how can someone link to the resources in your triplestore using the identifiers (must be HTTP URIs) of those resources? Let’s elaborate a bit more with an example. In our example, we have three triples shown below in Turtle notation.

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix contact: <http://www.w3.org/2000/10/swap/pim/contact#> .

<http://localhost:8080/mydataset/People/Rakebul_Hasan> rdf:type contact:Person .
<http://localhost:8080/mydataset/People/Rakebul_Hasan> contact:fullName "Rakebul Hasan" .
<http://localhost:8080/mydataset/People/Rakebul_Hasan> contact:mailbox <mailto:me@mail.com> .

Now, imagine these triples reside in a triplestore and can be queried from a SPARQL endpoint http://localhost:8080/openrdf-sesame/repositories/pubbytest. The idea is that if someone performs a HTTP GET request to http://localhost:8080/mydataset/People/Rakebul_Hasan, he will get the description of this resource. This notion of providing information about a resource is one of the core principles of Linked Data outlined by Tim Berners-Lee.

We will use a tool called Pubby to do this. Pubby allows to produce Linked Data from SPARQL endpoints. We will use Tomcat as a webserver to host pubby. Now, let’s install pubby as a webapp in our Tomcat. To do this, please follow the steps below:

  1. Unzip pubby and copy the webapp directory in the webapps directory of your tomcat. Rename the copied webapp directory to mydataset (or whatever to suit your needs).
  2. Modify the WEB-INF/config.ttl as below (or according to your needs).
# Prefix declarations to be used in RDF output
@prefix conf: <http://richard.cyganiak.de/2007/pubby/config.rdf#> .
@prefix meta: <http://example.org/metadata#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix dbpedia: <http://localhost:8080/resource/> .
@prefix p: <http://localhost:8080/property/> .
@prefix yago: <http://localhost:8080/class/yago/> .
@prefix units: <http://dbpedia.org/units/> .
@prefix geonames: <http://www.geonames.org/ontology#> .
@prefix prv:      <http://purl.org/net/provenance/ns#> .
@prefix prvTypes: <http://purl.org/net/provenance/types#> .
@prefix doap:     <http://usefulinc.com/ns/doap#> .
@prefix void:     <http://rdfs.org/ns/void#> .
@prefix ir:       <http://www.ontologydesignpatterns.org/cp/owl/informationrealization.owl#> .

# Server configuration section
<> a conf:Configuration;

    # Project name for display in page titles
    conf:projectName "Pubby Test";

    # Homepage with description of the project for the link in the page header
    conf:projectHomepage <www-sop.inria.fr/members/Hasan.Rakebul/>;

    # The Pubby root, where the webapp is running inside the servlet container.
    conf:webBase <http://localhost:8080/mydataset/>;

    # Dataset configuration section
    conf:dataset [
        # SPARQL endpoint URL of the dataset
        conf:sparqlEndpoint <http://localhost:8080/openrdf-sesame/repositories/pubbytest>;

        # Common URI prefix of all resource URIs in the SPARQL dataset
        conf:datasetBase <http://localhost:8080/mydataset/>;
        #The unmatched part between conf:webBase and the request url will be appended with conf:datasetBase
    ];
    .

Now, if you access the resources URI http://localhost:8080/mydataset/People/Rakebul_Hasan using a browser, it will return you an HTML-based page aimed at human users (a redirection happens behind the scene, the URL in the address bar of the image below is different for this reason).

We will use the cURL tool to perform the HTTP GET operation from commandline. We will set the accept header as ‘Accept: text/turtle’ in order to receive the response in turtle format. The curl command will be:

curl -L -H 'Accept: text/turtle' http://localhost:8080/mydataset/People/Rakebul_Hasan

The response should be:

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .

<http://localhost:8080/mydataset/People/Rakebul_Hasan>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2000/10/swap/pim/contact#Person> ;
<http://www.w3.org/2000/10/swap/pim/contact#fullName>
"Rakebul Hasan" ;
<http://www.w3.org/2000/10/swap/pim/contact#mailbox>
<mailto:me@mail.com> .

<http://localhost:8080/mydataset/data/People/Rakebul_Hasan>
rdfs:label "RDF description of Rakebul_Hasan" ;
foaf:primaryTopic <http://localhost:8080/mydataset/People/Rakebul_Hasan> .

Pubby added two additional triples to the set of triples that describes our resource. One to specify its labe using the rdfs:label property and another one to specify its topic using the foaf:primaryTopic property.

Behind the scene, a 303 redirection happens. The -L in the curl command makes sure that the redirection link is followed. If you remove -L from the curl command, then your will see the 303 response with the link as shown below.

curl -H 'Accept: text/turtle' http://localhost:8080/mydataset/People/Rakebul_Hasan
303 See Other: For a description of this item, see http://localhost:8080/mydataset/data/People/Rakebul_Hasan

The link returned with the 303 response is the location of machine readable description of the requested resource. If you perform another curl with this new link, you will get the same response as we got with our first curl command.

To conclude, we have seen how to publish Linked Data where the original RDF data are behind a SPARQL endpoint. We have seen an example of dereferenceable HTTP URI with content negotiation.