Creating a Neo4j graph of Wikipedia links

I started looking at Neo4j and thought: I need to write a simple but non-trivial application to really try it out. Something with lots of nodes and relationships. I need to find a large dataset that I can import into a graph database.

To my delight, I found that Wikipedia provides database dumps for download. That serves my purpose beautifully: I can represent each wiki page as a node, and the links between pages as relationships.

So I wrote some code to parse the Wikipedia XML dump and extract links from each article body, and some other code to import everything into a Neo4j store. It’s now on github, as project Graphipedia.

I ended up with a graph database containing 9,006,704 nodes (pages, titles only) and 82,537,500 relationships (links). The whole database takes up 3.8G on disc, of which 650M is a Lucene index (on page titles).

With this wealth of well-connected data at my disposal, I can now do some interesting stuff. For one, I can simply open the database with the Neoclipse tool, find a page by title and visualise all links to/from that page. Here’s an example with the Neo4j page at its centre.

Link Graph

Another fun thing is calculating the shortest path between two wiki pages –  this requires only a few lines of code thanks to Neo4j’s graph algorithms. For example:

From Neo4j to Kevin Bacon:
Neo4j > Structured storage > NoSQL > InfiniteGraph > Kevin Bacon

From Kevin Bacon to Neo4j:
Kevin Bacon > Internet Movie Database > SQL > NoSQL > Neo4j

(I’m certainly not the first person applying the six degrees of separation hypothesis to Wikipedia: there even is a Six degress of Wikipedia page already.)

Where does all this leave me in my evaluation of Neo4j I’m not quite sure, as I haven’t done any benchmarks yet. But it certainly shows that graph databases allow you to do some interesting stuff that would be much harder to achieve with a relational database.

14 thoughts on “Creating a Neo4j graph of Wikipedia links

    • Nadeem says:

      @Mirko…Thanks for your time. Mine is Core 2 duo on 4GB with 64bit OS. Unable to finish due to slow speed of my platform.

      Is it possible to send a link of the graphdb you created from the english dataset?

      I thank you for your time in advance.
      Nadeem

      Like

  1. Johnny Five says:

    This is amazing. Thanks for sharing. Your code really helped me learn neo4j better. Please post more cool stuff on your blog!

    Like

    • mirko says:

      How long depends heavily on your disk – in fact it’s an interesting I/O benchmark.

      Took me anywhere from 10-15 minutes with an SSD to “way too long” (killed the process after a few hours) with a 5,400 rpm disk.

      Like

  2. Rune says:

    Hi – Interesting work. Could you tell us how fast does Neo4J calcalute the two examples?

    From Neo4j to Kevin Bacon
    and
    From Kevin Bacon to Neo4j:

    Also – Are category pages (and links from articles to categories) included in the graph?

    Thx.
    Rune

    Like

  3. Matt Dee says:

    Hi there,

    I recently decided to begin a similar project, only later finding your code. I have been unable to get anywhere near your speeds due to the massive number of links.

    You said that your version “contains almost 10M pages, resulting in over 92M links to be extracted.”
    However, I am finding an average of about 85 links per page. I know that it has been some time, and maybe wikipedia has changed, but are you sure that that number is accurate? a 9:1 link to article ratio seems very low…

    Like

  4. Deniz says:

    Hello,

    First of all, great work. Iy really helped me. I wonder if itis possible to construct Links in the form of LinksTo and LinksFrom. How can I do that?

    Thx, Deniz

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s