Author Archives: rakeb

Install graphframe in Spark

All you need to do is just specify the graphframe version using –packages. For example, I run a notebook with the following command:

PYSPARK_DRIVER_PYTHON=`which ipython` PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip= ' ~/spark/bin/pyspark --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11 --master yarn --num-executors 4 --driver-memory 8g
--executor-memory 2gb

Then get started with the Quick Start tutorial!!

Make sure to remove your $HOME/.ivy2/ directory and also $HOME/.m2/repository/org/scala-lang/ and $HOME/.m2/repository/org/slf4j/ must be removed you get errors like these:

[NOT FOUND ] org.scala-lang#scala-reflect;2.11.0!scala-reflect.jar (1ms)

==== local-m2-cache: tried


[NOT FOUND ] org.slf4j#slf4j-api;1.7.7!slf4j-api.jar (0ms)

==== local-m2-cache: tried




:: ^ see resolution messages for details ^ ::


:: org.scala-lang#scala-reflect;2.11.0!scala-reflect.jar

:: org.slf4j#slf4j-api;1.7.7!slf4j-api.jar



Linux screen commands

These commands are after replacing ctrl+a by ctrl+t, to avoid emacs shortcut confilcts.
Tabs and navigation

ctrl+t ?          list of all the key binding that are available on screen.
ctrl+t ctrl+u     move between tabs up.
ctrl+t ctrl+j     move between tabs down.
ctrl+t 0          jump to tab number 0 (tab number).
ctrl+d            close a tab.
ctrl+t ctrl+c     create a new tab.

Copy and paste

ctrl+t [          goes into copy mode.
ctrl+p            move up in the copy mode.
ctrl+n            move down in the copy mode.
ctrl+b            move backward in the copy mode.
ctrl+f            move forward in the copy mode.
ctrl+space        abort copy mode.
space             set marker to select.
shift+>           copy selected text into buffer.
ctrl+t ]          write from buffer/paste.

Detach and reattach sessions

screen -d         detach a screen session.
screen -r         resume a screen session.

Splitting tabs.

ctrl+t then S     To split horizontally.
ctrl+t then |     To split vertically.
ctrl+t then Q     To unsplit:  (uppercase one).
ctrl+t then tab    To switch from one to the other.

Useful video tutorial

Adding an existing eclipse project to github

cd to the project directory then execute the following commands

git init
git remote add origin <github repository location>
git config --add branch.master.remote origin
git config --add branch.master.merge refs/heads/master

Loading Geonames in Virtuoso

  1. Covert the Geonames RDF XML dump to N-Triples.
    The RDF dump contains all the geonames entries in a text formatted as a feature URI in one line followed by the RDF XML description in the next line for each feature. For example:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <rdf:RDF xmlns:cc="" xmlns:dcterms="" xmlns:foaf="" xmlns:gn="" xmlns:owl="" xmlns:rdf="" xmlns:rdfs="" xmlns:wgs84_pos="">
    <gn:Feature rdf:about="">
    <rdfs:isDefinedBy></rdfs:isDefinedBy><gn:name>Āb-e Yasī</gn:name><gn:featureClass rdf:resource=""/>
    <gn:featureCode rdf:resource=""/><gn:countryCode>IR</gn:countryCode><wgs84_pos:lat>32.8</wgs84_pos:lat><wgs84_pos:long>48.8</wgs84_pos:long><gn:parentFeature rdf:resource=""/><gn:parentCountry rdf:resource=""/><gn:parentADM1 rdf:resource=""/><gn:nearbyFeatures rdf:resource=""/><gn:locationMap rdf:resource=""/></gn:Feature></rdf:RDF>

    It’s not really possible to parse this file directly using RDF parsers. I wrote a python script which converts the RDF dump file to a single file with all the triples represented in the dump file. The python script  serializes everything in ntriple in a file. This ntriple file can be easily loaded into Virtuoso and other triple stores.

  2. Install and configure Virtuoso
    I used yum in my Fedora to install virtuoso opensource and related packages. I guess other package managers can do the same job. The virtuoso-opensource package is for virtuoso opensource database server. The virtuoso-opensource-utils package comes with isql-v commandline-based sql client. It can be also used for SPARQL queries. The virtuoso-opensource-conductor package gives a nice web user interface.

    sudo yum install virtuoso-opensource
    sudo yum install virtuoso-opensource-utils
    sudo yum install virtuoso-opensource-conductor

    Now, follow the steps below to configure virtuoso.

    • Create a config directory for your current user(the user that will run the server). It my case it was /[my home]/virtuoso.
    • Copy the default virtuoso.ini (normally it’s located in /var/lib/virtuoso/db/virtuoso.ini) to this directory (make sure you modify the access permissions to be able to modify it).
    • Modify the following parameters in virtuoso.ini
      ;Depending on your memory size, change the following two parameters.
      ;You will find instruction in the default virtuoso.ini file
      DatabaseFile = [path to user's virtuoso config directory]/virtuoso.db
      ErrorLogFile = [path to user's virtuoso config directory]/virtuoso.log
      LockFile = [path to user's virtuoso config directory]/virtuoso.lck
      TransactionFile = [path to user's virtuoso config directory]/virtuoso.trx
      xa_persistent_file = [path to user's virtuoso config directory]/virtuoso.pxa
      DatabaseFile = [path to user's virtuoso config directory]/virtuoso-temp.db
      TransactionFile = [path to user's virtuoso config directory]/virtuoso-temp.trx
    • Add the directory that contains the geonames.nt file to the allowed directories of Virtuoso.
      DirsAllowed = ., /usr/share/virtuoso/vad, [directory that contains the geonames.nt]
    • Configure the odbc.ini file as below (create it if it doesn’t exist).
      [Local Virtuoso]
    • Start virtuoso by executing the following command in your linux shell.
      virtuoso-t -df +configfile /[path to user's config directory]/virtuoso.ini
    • Login to virtuoso using isql virtuoso client and change the default password (‘dba’ is the default password of user dba). If the change password doesn’t work from isql client, try logging in the conductor web client (http://localhost:8890/conductor) login using user: dba, pass: dba then execute the set password command from interactive SQL option.
      $ /usr/libexec/virtuoso/isql dba
      SQL> set password 'dba' 'new-password'
  3. Load the converted Geonames N-Triples into Virtuoso
    • Copy the Bulk Loader Procedure and Sub-procedures creation SQL script from the link here, save it as rdfloader.sql in the path to user’s config directory and modify the line 331from


      DECLARE gr INT;
    • From the isql console execute the following command.
      SQL> load [path to ]rdfloader.sql;
    • If anything goes wrong, drop the load_list and ldlock tables by executing the commands below and then load again by using the previous command.
      SQL> drop table load_list;
      SQL> drop table ldlock;
    • Select the geonames.nt file that you want to load (the third parameter is the graph name where the triples will be loaded).
      SQL> ld_dir ('path to the directory where geonames.nt is located', 'geonames.nt', '');
    • Execute the loader (it will take a long time, 9 hours in my computer).
      SQL> rdf_loader_run ();
  4. TestYou can access the SPARQL endpoint web interface at localhost:8890/sparql. In the default dataset uri field, type We will run a query for getting all the regions of France. Regions of France are represented by the <; relation. The query will look like:
    select distinct ?uri, ?name where {
    ?uri <> <>.
    ?t <> ?uri.
    ?uri <> ?name}

    Type this query in the query text filed and press run query. It will return a list of region URIs and names.

This tutorial has been adapted from the following tutorials:

Installing mysql-python in Mac OS X 10.6.7 Snow Leopard

Installing mysql-python in my Mac OS X 10.6.7 Snow Leopard was a painful experience. After some hours of trial and error and googling I got it working. Here are the steps:

1. Make sure you have gcc installed. I installed Xcode 3.2.6 which is the Apple’s development environment and it includes gcc. I lost my DVD somewhere. So I had to download it from their website. It’s around 4.4 GB. In the process of installing it, it might ask you to close your iTunes and might keep asking you even if iTunes is closed. You will have to close iTunesHelper to continue the installation. Open Activity Monitor (located in Applications/Utilities) and quit it from the list of processes.

2. Download the mysql-python tar.gz file from here. Untar it and then build and install it. To do this (assuming that you have uncompressed the tar.gz file and pointed your terminal to the uncompressed directory):

$ sudo su
# python build
# python install

3. Finally, add your mysql lib directory in DYLD_LIBRARY_PATH. You can add the following line in your .bash_profile file modifying your mysql directory location.

export DYLD_LIBRARY_PATH='/usr/local/mysql-5.5.25-osx10.6-x86_64/lib/'

4. Now, in your python interactive console try to import MySQLdb. If there is no error you are done! Otherwise please google 😉 These steps worked for me.

How to produce Linked Data from SPARQL endpoints.

Imagine you have a triplestore which allows SPARQL queries. Now, how can someone link to the resources in your triplestore using the identifiers (must be HTTP URIs) of those resources? Let’s elaborate a bit more with an example. In our example, we have three triples shown below in Turtle notation.

@prefix rdfs:    <> .
@prefix rdf: <> .
@prefix contact: <> .

<http://localhost:8080/mydataset/People/Rakebul_Hasan> rdf:type contact:Person .
<http://localhost:8080/mydataset/People/Rakebul_Hasan> contact:fullName "Rakebul Hasan" .
<http://localhost:8080/mydataset/People/Rakebul_Hasan> contact:mailbox <> .

Now, imagine these triples reside in a triplestore and can be queried from a SPARQL endpoint http://localhost:8080/openrdf-sesame/repositories/pubbytest. The idea is that if someone performs a HTTP GET request to http://localhost:8080/mydataset/People/Rakebul_Hasan, he will get the description of this resource. This notion of providing information about a resource is one of the core principles of Linked Data outlined by Tim Berners-Lee.

We will use a tool called Pubby to do this. Pubby allows to produce Linked Data from SPARQL endpoints. We will use Tomcat as a webserver to host pubby. Now, let’s install pubby as a webapp in our Tomcat. To do this, please follow the steps below:

  1. Unzip pubby and copy the webapp directory in the webapps directory of your tomcat. Rename the copied webapp directory to mydataset (or whatever to suit your needs).
  2. Modify the WEB-INF/config.ttl as below (or according to your needs).
# Prefix declarations to be used in RDF output
@prefix conf: <> .
@prefix meta: <> .
@prefix rdf: <> .
@prefix rdfs: <> .
@prefix xsd: <> .
@prefix owl: <> .
@prefix dc: <> .
@prefix dcterms: <> .
@prefix foaf: <> .
@prefix skos: <> .
@prefix geo: <> .
@prefix dbpedia: <http://localhost:8080/resource/> .
@prefix p: <http://localhost:8080/property/> .
@prefix yago: <http://localhost:8080/class/yago/> .
@prefix units: <> .
@prefix geonames: <> .
@prefix prv:      <> .
@prefix prvTypes: <> .
@prefix doap:     <> .
@prefix void:     <> .
@prefix ir:       <> .

# Server configuration section
<> a conf:Configuration;

    # Project name for display in page titles
    conf:projectName "Pubby Test";

    # Homepage with description of the project for the link in the page header
    conf:projectHomepage <>;

    # The Pubby root, where the webapp is running inside the servlet container.
    conf:webBase <http://localhost:8080/mydataset/>;

    # Dataset configuration section
    conf:dataset [
        # SPARQL endpoint URL of the dataset
        conf:sparqlEndpoint <http://localhost:8080/openrdf-sesame/repositories/pubbytest>;

        # Common URI prefix of all resource URIs in the SPARQL dataset
        conf:datasetBase <http://localhost:8080/mydataset/>;
        #The unmatched part between conf:webBase and the request url will be appended with conf:datasetBase

Now, if you access the resources URI http://localhost:8080/mydataset/People/Rakebul_Hasan using a browser, it will return you an HTML-based page aimed at human users (a redirection happens behind the scene, the URL in the address bar of the image below is different for this reason).

We will use the cURL tool to perform the HTTP GET operation from commandline. We will set the accept header as ‘Accept: text/turtle’ in order to receive the response in turtle format. The curl command will be:

curl -L -H 'Accept: text/turtle' http://localhost:8080/mydataset/People/Rakebul_Hasan

The response should be:

@prefix rdfs:    <> .
@prefix foaf:    <> .

<> ;
"Rakebul Hasan" ;
<> .

rdfs:label "RDF description of Rakebul_Hasan" ;
foaf:primaryTopic <http://localhost:8080/mydataset/People/Rakebul_Hasan> .

Pubby added two additional triples to the set of triples that describes our resource. One to specify its labe using the rdfs:label property and another one to specify its topic using the foaf:primaryTopic property.

Behind the scene, a 303 redirection happens. The -L in the curl command makes sure that the redirection link is followed. If you remove -L from the curl command, then your will see the 303 response with the link as shown below.

curl -H 'Accept: text/turtle' http://localhost:8080/mydataset/People/Rakebul_Hasan
303 See Other: For a description of this item, see http://localhost:8080/mydataset/data/People/Rakebul_Hasan

The link returned with the 303 response is the location of machine readable description of the requested resource. If you perform another curl with this new link, you will get the same response as we got with our first curl command.

To conclude, we have seen how to publish Linked Data where the original RDF data are behind a SPARQL endpoint. We have seen an example of dereferenceable HTTP URI with content negotiation.