Analysis of MovieLens rating network using a novel Bipartite Graph Layout
- 1 Members
- 2 Description of dataset used
- 3 Headlines
- 4 Conclusions and critique of NodeXL
Awalin Sopan awalinnabila at gmail dot com
Ching Lik Teo cteo at cs dot umd dot edu
Date: November 04, 2009
Description of dataset used
We have decided to work with bipartite/bi-modal network (a network with two different types of entities ) having moderate dimensionality, therefore we have chosen the MovieLens dataset. In our dataset, we have two types of nodes, Movies and User. An edge exists between a user node and a movie node if the user has rated that movie. The edge represents the rating which can be within 1 to 5.
Attributes of Movie node:
- Movie Name
- IMDB URL
- Release Date
- Genre [ Drama, Horror, Comedy, Animation, Action, Children's, Adventure, Thriller, Documentary etc. ]
Attributes of User node:
The data set can be downloaded from this site: MovieLens 10k data
Several datasets are available in varying degrees of complexity. We have chosen the smallest dataset (10k in size) for this exercise knowing that NodeXL has known limitations in dealing with larger networks. This data contains 100,000 ratings for 1682 movies by 943 users. Used MATLAb to clean up and preprocess the data.
First try at visualizing the data
Due to the large number of edges in the 10k data, NodeXL was unsuccessful in visualizing the entire dataset in full. After consultation with Cody Dunne and Dr. Shneiderman, we realized that it was necessary to prune the number of edges in the data to around 10000, which is a more manageable. Since we have 1600+ movies versus 900+ users, we have decided to remove certain movies so that enough ratings remain for the analysis to be meaningful. The final reduced dataset therefore contains 100 movies with 943 users with a total of 14000 edges. The initial attempt to visualize the data using NodeXL looks like this:
Although NodeXL was able to visualize the data, we have the usual mess or hairball visualization which is the bane of large complex networks. For the case of the MovieLens data, the bipartite network presents even greater challenges as it is different from standard network graphs in terms of:
- There are at least two different types of nodes - in this case movie nodes and users. Users can have multiple edges to any movie with no particular pattern.
- There are no edges between users or between movies, only a directional edge that represents a rating between the user and the movie. This lack of topology is unique for this network and increases the visualization challenge as standard clustering techniques and layouts will not work.
- The high dimensionality of the data means that for any insight to be gained, the various correlations and patterns of the users voting/rating patterns must be laid out in a clear and concise manner for the information to be easily inferred by the user. There exist no standard layout technique in NodeXL that will allow us to achieve this.
From these observations and initial experiments with the data, we came to the conclusion that in order to discover useful information from the data in the first place, we need to find a novel layout for bipartite networks that will enable us to see the data in a more structured manner that will allow the embedded information to be visualized. We conjectured that this is possible, whilst keeping the rules of "Network Nirvana" and the concept of "semantic substrate" in mind. As we are using a bimodal network, we have focused more on the bipartite layout as traditional network features like betweenness centrality, clustering coefficient etc. are not so meaningful for such networks.
The Bipartite Graph Layout
In this section we briefly introduce the novel Bipartite Graph Layout (BGL) which we we believe is a step forward in bipartite network visualization. We show that this layout has several interesting properties that lends itself to several well known visualization techniques and includes several aspects of "Network Nirvana" as well. An example of this layout (which will be used throughout the report) is shown below - visualized using NodeXL.
Comparing this layout to the original hairball above makes it obvious that BGL is superior in several aspects:
- The two different nodes are completely separated. This reveals the intrinsic bipartite nature of this network immediately.
- The nodes are stacked along a grid in the y direction as well. In the example above, the "age" of the users are used together with the "ordinal age" of the movies. This arrangement of the nodes in a meaningful substrate allows us to add more dimension into the data - for e.g., are certain movies preferred by people of certain ages or are they popular over a range of ages? No other standard layout algorithms allows for this.
- Stacking nodes along several x coordinates create "poles" which shows a cluster, while preventing nodes from being occluded. Multiple poles can be created, lending this layout scalable to N-partite graphs.
- The poles creates greatly reduces the amount of edge crossings - in fact - no edge will cross a node in this layout.
- This layout is extremely interesting in its resemblance to the parallel coordinates visualization which is a powerful technique for visualizing high dimensional data. Similarly, the BGL affords the user the same visual information that parallel coordinates provides - patterns in edges are clearly obvious and symmetries/asymmetries between poles are obvious.
We will show in the several headlines presented to validate the usefulness of the proposed BGL visualization.
In this series of headlines, we explore how age and gender differences affects the choice of different genres of movies. We demonstrate the usefulness of the proposed bipartite graph layout (BGL) with two variants:
- A simple BGL with 2 poles representing Movies and Users ordered by age. Movies are on left side, users are on right side.
- A 3 poles BGL with 2 clusters for the user nodes - Male (blue discs) and Female (pink triangles) - with the Movie nodes in the middle.
We have chosen to represent the movie ratings by coloring the edges from red (lowest rating) to green (highest rating).
The popularity of each movie is determined by the number of votes they receive and is again indicated by a range of colors from yellow - low popularity - to bright green squares indicating movies that have been reviewed by many viewers.
The in-degree of a movie node indicates how many people have watched/rated the movie.
We have deliberately chosen to use colors instead of size to encode these information as we want to reduce the clutter and occlusion in the data - which become more likely when nodes and edges increase in size.
Finally, opacity of the edges are set to 30% to reveal underlying edge crossings. A black background was chosen to increase the contrast for easy viewing and analysis of the resulting graph.
Viewers are generous
From our general overview of the BGL, we have clustered the movie nodes and user nodes in opposite parts of the graph. the overview shows the edges color-coded according to the rating.
After overview, we filtered the edges according to rating, here 3 visualizations are presented with edges having highest, medium and lowest rating respectively.
It is clear that high rated edges are more dominant where low rated edges make a sparse graph. Fewer movies got the lowest ratings from people.
Women are from Venus, Men are from Mars
In this section, we explore how gender could affect ratings of different movies in general and focus on a specific movie genre to prove our point. We also show how BGL's unique representation aids insight discovery of the underlying high dimensional data.
The general overview of the data is shown below where the men and women are on opposite poles with movie nodes in the middle:
The real movie buffs are men!
Among the 100 movies rated, the ladies seemed to be more forgiving in giving low ratings to movies in general, with fewer red (ratings <= 2) edges handed out by these ladies compared to the gentlemen who do not hesitate to rate movies badly. Does this finding mean that men are better movie critiques? Or are the men just too ignorant to appreciate certain movies? This distinct difference in the ratings of men and women does indicate the effects of gender in assessing the quality of the film and warrants further research. Showing only the edges with the highest rating of 5 reveals a further interesting result:
From the graph above, we can see that at the other end of the spectrum, the male critics are just as eager to provide the best rating to the movies they liked given the density of the edges coming from the male cluster(poles). This is in contrast with the ratings from their female counterparts on the left. From these two graphs, we can safely conclude that the men in this particular network are more into movies than the ladies who tend to give rather neutral ratings. We are not sure if this result is generalizable, given the small number of users and movies in this dataset but it is certainly a good hint!
Two "Stars" with different fates
We now turn our attention to one of our all time favorite movie types - Science Fiction - to see if we can gain any genre specific insights and gender specific rating traits or patterns. Filtering the movie data to show only Sci-Fi titles reveals the following graph:
Immediately we can see two general comments. Most of the green (high rating) arrows are going from the younger audience to the older science fiction movies - notably Star Wars and Blade Runner which are in fact the 2 most popular movies (their nodes are colored green). It is encouraging to see that old classics like these two movies are able to capture so many good ratings from such a diverse range of people - male, female, young and old and is testimony to their long standing nature as classic Sci-Fi movies. This observation is confirmed when we filter the data to reveal only the highest rated edges (5):
Notice that once again the asymmetry in ratings between the ladies and gentlemen. Without doubt, Sci-Fi movies is more attractive to men (and boys) of all age. Star Wars is a classic that defies this trend, and its universal appeal is shared among people of all ages:
The interesting insight that follows is Stargate, the poor cousin of Star Wars, which was a long running TV series that never really got off. It had very few high ratings as shown above and it was unanimously given low ratings by (some ladies) and by many of the male movie critics in the data as shown below:
It is interesting to see that most of the highest rated Sci-Fi movies were the classics while none of the later ones could quite match up to their quality and timeless appeal. This is even more poignant when young viewers still preferred them. This could perhaps be a wake-up call to the movie industry to look into making new Sci-Fi movies that will continue to capture our imagination of the future.
Kids don't actually care about age restriction of movies
Our viewers are of age from 7 to 80, and by analyzing movies of various genre, we found that most users of our given dataset are younger people as edges are dense near people having age from 20 to 30.
We see that popular movies attract viewers of all ages irrespective to their actual age-ratings of Motion Picture Association of America. The fact we should be more concerned about is that even very young people, specially the kids are watching movies that are not meant for them. For example, the movie Silence of the Lamb gets high recommendation from several viewers whose age were between 7 to 16, where the movie was actually rated R.
Conclusions and critique of NodeXL
We have shown in this report several interesting headlines from the MovieLens dataset, using graphs generated by NodeXL. Throughout this exercise, we have gained considerable experience in NodeXL, and has come to appreciate the many issues faced in graph network visualization - of which the importance of having an effective and meaningful layout was clearly demonstrated. We have also shown the usefulness of the proposed BGL in enabling insight generation; which would not have been possible if standard and classical layout algorithms were used. We believe that BGL is an important contribution to the growing arsenal of graph layout algorithms that addresses the difficult problem of visualizing bipartite graphs. Future work includes the application of BGL on visualizing larger and more complex N-partite graphs, and when multiple clusters exists.
The importance of a network visualization tool like NodeXL cannot be overemphasized. Our experience of working with NodeXL throughout this exercise have been one of ups and downs. Recognizing that NodeXL is still a product under development, we have been considerate with the occasional crash, hang and glitch; and overall slowness of NodeXL when handling the massive dataset size that we have attempted to use. The several functions to compute graph metrics are pretty good and the ability to save the graph in a variety of resolution is plus. In the end, NodeXL did its job; and the fact that we had been able to test a new layout in it is testimony to its potential simplicity and ease of use (once the final version is released). In order to aid the developers in achieving this goal, we have compiled a list of bugs that we have encountered and updated faithfully during the course of this exercise.
- Bug in the newest version (v. 22.214.171.124) of NodeXL, where the autofilling of the Primary label does not work anymore. Need to change Primary Label heading to "Label" before autofill can work. Tested with v 126.96.36.199 and everything worked well.
- Some options in NodeXL remains the same for different sessions. The parameters of autofill remains the same and even the global options for different graphs remain the same. For e.g. on a very dense graph we had set the edges opacity to 1% to see it more clearly, but that setting when used on another graph made me thought that the edges were not drawn - and we thought something was wrong with our input data.
- The hiding of certain columns in the vertex layout is troublesome - as it resets itself after each session, unlike the behavior of other parameters. Such inconsistency in restoring previous program states must be fixed.
- NodeXL shows inappropriate error message if older data from previous versions are not converted into new format - conversion may cause system crash. Furthermore, user have to convert the data themselves.
- NodeXL had crashed numerous times during the projects. This occurs most often when the system is computing some data or become unresponsive while doing something
- It is hard to clear up previously autofilled columns. There is no way to reset it. We had to create a dummy column of zeros/ones to help us in this case - Autofill should have an autoempty too.
- Autofill is really slow with filtered cells, and this is a limitation of Excel, since we cannot copy and paste filtered cells. Should really ask Microsoft to fix these Excel limitations first.
- Laying out graph does not work properly when in filtering mode. Look at the example below:
It creates a cluster which is non-existent! It is very frustrating to press the big "LAYOUT GRAPH" button only to realize that it will be wrong as you forgot to turn off filtering; and there is no way to cancel or stop the process.
- NodeXL cannot do multiple filtering of several parameters. We had wanted to filter ACTION movies that have high in-degree and are rated above 5, but can only filter action movies and ratings, not genre AND high degree. There should be extensions similar to Spotfire to support custom filtering.
- Saving a graph as an image file with axis and legend shown results in certain parts of the graph going missing.