I started looking at Neo4j and thought: I need to write a simple but non-trivial application to really try it out. Something with lots of nodes and relationships. I need to find a large dataset that I can import into a graph database.
To my delight, I found that Wikipedia provides database dumps for download. That serves my purpose beautifully: I can represent each wiki page as a node, and the links between pages as relationships.
So I wrote some code to parse the Wikipedia XML dump and extract links from each article body, and some other code to import everything into a Neo4j store. It’s now on github, as project Graphipedia.
I ended up with a graph database containing 9,006,704 nodes (pages, titles only) and 82,537,500 relationships (links). The whole database takes up 3.8G on disc, of which 650M is a Lucene index (on page titles).
With this wealth of well-connected data at my disposal, I can now do some interesting stuff. For one, I can simply open the database with the Neoclipse tool, find a page by title and visualise all links to/from that page. Here’s an example with the Neo4j page at its centre.
Another fun thing is calculating the shortest path between two wiki pages – this requires only a few lines of code thanks to Neo4j’s graph algorithms. For example:
From Neo4j to Kevin Bacon:
Neo4j > Structured storage > NoSQL > InfiniteGraph > Kevin Bacon
From Kevin Bacon to Neo4j:
Kevin Bacon > Internet Movie Database > SQL > NoSQL > Neo4j
(I’m certainly not the first person applying the six degrees of separation hypothesis to Wikipedia: there even is a Six degress of Wikipedia page already.)
Where does all this leave me in my evaluation of Neo4j I’m not quite sure, as I haven’t done any benchmarks yet. But it certainly shows that graph databases allow you to do some interesting stuff that would be much harder to achieve with a relational database.