wikimedia google to neo

TLDR;

During the last weekend, I came accross a tweet announcing that Wikimedia released the dataset of the pages clickstreams for February 2015.

Not that I had nothing to do but I found it interesting to download this dataset and see how people arrive on the Neo4j’s Wikipedia page.

The data is quite simple, you have pages entities that relates to other pages. Pages can be or a Wikipedia page or representing a non Wikipedia page like Google. Relationships can be or a user click from on wikipedia page to another or a user using a search box like Google search or the Wikipedia search feature. The number of types the event occurs is also part of the dataset.

Importing the dataset

You can download the dataset here : http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82

Importing the dataset is really straightforward and the code I used can be found on this gist.

wikimedia

Analysing the data

We can start with simple queries like the pages from where people are clicking before landing on the Neo4j page :

MATCH (neo:Page {title:'Neo4j'})
MATCH (neo)<-[r:TARGET_BY_LINK]-(from)
RETURN from.title, r.occurences

NoSQL   282
FlockDB 19
Paxos_(computer_science)        22
Cypher_Query_Language   19
Graph_database  814
List_of_AGPL_web_applications   11
Spatial_database        25

Or maybe in the second degree :

MATCH (neo:Page {title:'Neo4j'})
MATCH p=(neo)<-[:TARGET_BY_LINK*2..2]-(from)
WHERE neo <> from
RETURN extract(n in nodes(p) | n.title) as pages,
reduce(x=0, r in rels(p) | x + r.occurences) as occ
ORDER BY occ DESC
LIMIT 10

Neo4j, Graph_database, NoSQL    1450
Neo4j, Graph_database, Big_data 1045
Neo4j, Graph_database, Graph_(abstract_data_type)       967
Neo4j, Graph_database, Triplestore      897
Neo4j, Graph_database, Network_model    882
Neo4j, Graph_database, Nested_set_model 865
Neo4j, Graph_database, Graph_theory     863
Neo4j, Graph_database, DEX_(Graph_database)     856
Neo4j, Graph_database, Graph_(mathematics)      856
Neo4j, Graph_database, Cypher_Query_Language    852

So the NoSQL movement and the Graph ecosystem are the most common referers, which is not suprising.

From Google to Neo4j

The last insight I wanted to gain is to know how people arrive at the Neo4j page from another one that they found via a google search.

The query seemed obvious to me and I expected really fast response times as I’m describing the whole pattern and the page titles are indexed :

MATCH (neo:Page {title:'Neo4j'}), (google:Page {title:'other-google'})
MATCH (neo)<-[:TARGET_BY_LINK]-(x)<-[:TARGET_BY_SEARCH]-(google)
RETURN x.title

Unfortunately, this query took 24 seconds to return on my machine with a default 4GB heap settings for Neo4j.

NB: Note that the other-google Page node has more than 2 millions TARGET_BY_SEARCH outgoing relationships.

My first reaction of course was to analyse the execution plan with PROFILE :

query plan wikimedia

So why so much time and database accesses ? I was really enjoying the 2.2 release however I was confused why the cost based planner couldn’t figure out to run this simple query efficiently.

I tried different ways of doing the same query. Variable length paths, multiple relationships types etc…​ without any performance improvement.

I spoke about this behavior with my friend Michael and he asked me to do change the query to the following. He said that calculating matches on a simple pattern like (x)←[:TARGET_BY_SEARCH]-(google) now takes the degree of both nodes at runtime into account and checks from the side with the smaller degree. The same applies to calculating the counts of simple path expressions like size( (page)-[:TARGET_BY_LINK]→()) where also the degree of the given node is used instead of iterating the relationships.

MATCH (neo:Page {title:'Neo4j'}), (google:Page {title:'other-google'})
MATCH (neo)<-[:TARGET_BY_LINK]-(x)
WHERE (x)<-[:TARGET_BY_SEARCH]-(google)
RETURN count(*);

count(*)
5
Returned 1 row in 43 ms.
plan where

Wow, stunning, amazing, query returning results in 14ms as I expected in my first attempts. It looks like Cypher needs more hints than in the previous 2.1.x versions.

You can still use the previous Cypher rule planner by prepending PLANNER RULE to your queries :

PLANNER RULE
MATCH (neo:Page {title:'Neo4j'}), (google:Page {title:'other-google'})
MATCH (neo)<-[:TARGET_BY_LINK]-(x)<-[:TARGET_BY_SEARCH]-(google)
RETURN x.title;

x.title
FlockDB
Cypher_Query_Language
Spatial_database
List_of_AGPL_web_applications
Graph_database
Returned 5 rows in 27596 ms.
plan rule

However I couldn’t accept it as a long-term solution but more as a temporary workaround. Mostly because such queries make summing the relationship properties not so user-friendly anymore.

Thanks again to Michael, he asked Neo4j Cypher team and their answer was the following :

The cost based planner only knows from the database statistics that a :Page node has between 0 and x millions relationships. So when planning the query it doesn’t have any runtime information and does not prefer either side of the query. At planning time there is no runtime information available, only counts for labels, relationships and property cardinalities and selectivities, so it actually doesn’t know if the assumption holds true later.

The solution is to tell Cypher which node to start from with the USING clause.

MATCH (page:Page {title:'Neo4j'}), (google:Page {title:'other-google'})
USING INDEX page:Page(title)
MATCH (page)<-[:TARGET_BY_LINK]-(x)<-[:TARGET_BY_SEARCH]-(google)
RETURN count(*);

count(*)
5
Returned 1 row in 11 ms.
plan using

And the query is done in 11ms in the browser !!!

Conclusion

As always, the PROFILER allows you to quickly identify performance problems of your queries.

Thanks to Michael and the Neo4j Cypher team for a fast support.

Arghh il n'y a pas de systèmes de commentaires ! Je sais je pourrais facilement ajouter un bloc disqus ou similaire, mais afin de ne pas devoir regarder partout à la fois, je vous propose de poser vos questions sur le Forum Neo4j FR ou sur StackOverflow en taggant votre question avec "neo4j".

Si vous avez aimé mon billet, n'hésitez pas à le partager sur la toile.