PubMed Document Network
- 1 Project Description
- 2 Team Members
- 3 Literature Review
- 4 Software/Links of Interest
- 5 To-Do
- 6 Meeting Notes
- 7 Software Design
The National Center for Biotechnology Information, or NCBI, maintains over 30 public databases containing biomedical information of various types, such as published medical documents, gene listings, protein listings, and DNA sequence information. It also manages associations between the various databases according to the various types of content. For example, a particular publication might be associated with all genes mentioned in the document. Likewise, a particular gene might have associations with proteins for which the gene codes.
Note that the NCBI's multiple databases can be viewed as a massive information graph. In this graph, documents, genes, proteins, and other object types correspond to nodes of the graph, while associations correspond to links between the nodes. Furthermore, nodes are of different general types (document, gene, protein, ...) and links may likewise have different types (document citation link, "contained in" link, ...). Links may also have weights (e.g., content similarity edges, where weights correspond to the measure of similarity) and may be directed (e.g., document citation links) or undirected. Even though the NCBI databases form an implicit graph, the website offers no interface to navigate multiple nodes of this graph explicitly. Instead, users explore the NCBI databases by retrieving at most one record at a time, essentially limiting them to viewing a single node at a time.
We believe that viewing and exploring multiple nodes in parallel will improve user efficiency in finding relevant information. Therefore, our primary task is to create a visual, interactive interface to NCBI's online databases that allows for easier retrieval of query-relevant information, by allowing exploration of multiple nodes in parallel. Our current plan is to start with a query-based subset of PubMed, by starting with a traditional keyword search. A graphical tool such as SocialAction or NVSS could then be used to explore the returned results, including selective expansion of result documents' links to other nodes of various types within NCBI's databases. Furthermore, this graph could be built and expanded dynamically, using NCBI's Entrez XML API to retrieve additional nodes and links as necessary.
For example, a user might search for the keyword melanoma to retrieve several document nodes. She could then expand three result nodes according to their gene links to discover what genes those three documents reference, and then expand some of those genes according to the proteins they code for. We are currently enlisting the help of several domain-experts who query the NCBI databases regularly in their work, and who will therefore be able to comment on our design's comparative ease of use. We will involve them in informal user studies comparing NCBI's existing exploratory interface (textual, web-based) with our own visual interface, to measure the effectiveness of our design.
|Jimmy Lin||jimmylin at umd||Main project contact|
|Adam Perer||adamp at cs||SocialAction developer|
|Aleks Aris||aris at cs||NVSS developer|
|Louiqa Raschid||louiqa at umiacs||Potential domain-expert|
|Adam Lee||adamlee at umiacs||Potential domain-expert|
|Adam Phillippy||amp at umiacs||Potential domain-expert|
|Michael Schatz||mschatz at umiacs||Potential domain-expert|
|G. Craig Murray||gcraigm at umd||Potential domain-expert|
|Michael Galperin||galperin at ncbi nlm nih gov||NCBI scientist|
|NCBI (PubMed, Genbank, Molec, ...)||Contains many types of data of interest to the biomedical community, such as medical document citations, genes, proteins, sequences, etc. 17m+ medical documents as part of PubMed. Features XML API to allow easy dynamic interactivity in apps.|
|IMDB||A database of 15m+ individual TV/film credits, available as downloadable text files. Includes listings for actors, producers, directors, awards, genres, ratings, release dates, etc. (data description)|
|ACM Digital Library||A collection of more than 40 ACM journals, magazines, peer reviewed articles, conference proceedings and ACM SIG newsletters. Contains nearly 2m pages of text. Search by Keyword or Phrase, by the ACM Computing Classification Scheme, by Publication, or by Affiliation. Table of contents, abstracts, and citations are free to the world.|
|arXiv||460 thousands more e-prints in fields of Physics (main), Mathematics, Computer Science, Quantitative Biology and Statistics. Papers are submitted, not crawled as in CiteSeer. Simultaneous view of content, network, and usage.|
|CiteSeer||A collection of over 700,000 documents, primarily in the fields of computer and information science and engineering. The first digital library and search engine to provide automated citation indexing and citation linking using the method of Autonomous Citation Indexing. Provides Open Archives Initiative metadata. The CiteSeer.IST algorithms, software(source code), and data are available.|
|Sports dataset (?)|
|Physics dataset (?)|
|Edinburgh Word Association Thesaurus||A linguistic resource - Not a developed semantic network, but empirical association data. Use stimulus words to generate a 'growing' network from a small nucleus. Included in MRC Psycholinguistic Database, a machine usable dictionary.|
|Huimin Guo||hmguo at cs|
|Mike Lieberman||codepoet at cs|
|Fatemeh Mir Rashed||fatemeh at cs|
|Sima Taheri||taheri at cs|
|Inbal Yahav||iyahav at rhsmith|
|Quickref||Source||Reviewer||Link to Review|
|SocialAction|| Adam Perer and Ben Shneiderman, Integrating Statistics and Visualization: Case Studies of Gaining Clarity during Exploratory Data Analysis SIGCHI Conference on Human Factors in Computing Systems (CHI 2008).
|CiteSpace|| Chen, C. 2006. CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. J. Am. Soc. Inf. Sci. Technol. 57, 3 (Feb. 2006), 359-377.
|Melanoma||Boyack, K. W., Mane, K., and Borner, K. 2004. Mapping Medline Papers, Genes, and Proteins Related to Melanoma Research. In Proceedings of the information Visualisation, Eighth international Conference on (Iv'04) - Volume 00 (July 14 - 16, 2004).
|NetLens|| Kang, H., Plaisant, C., Lee, B., and Bederson, B. B. 2007. NetLens: iterative exploration of content-actor network data. Information Visualization 6, 1 (Mar. 2007), 18-31.
|Community Structure|| Girvan, M. and Newman, M. E. J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002).
|Science Mapping|| Boyack, Kevin W., Klavans, R. and Börner, Katy. (2005). Mapping the Backbone of Science. Scientometrics. 64(3), 351-374.
|NVSS 06|| Aris, A. 2006. Network Visualization by Semantic Substrates. IEEE Transactions on Visualization and Computer Graphics 12, 5 (Sep. 2006), 733-740.
|NVSS 07|| Aris, A. and Shneiderman, B. 2007. Designing semantic substrates for visual network exploration. Information Visualization, 6, 4, (2007), 1-20.
|VxInsight||Boyack, Kevin W., Wylie, Brian N., and Davidson, George S. Domain Visualization Using VxInsight for Science and Technology Management. Journal of American Society for Information Science and Technology, 53, 2002, 764-774.
|Grouse||Archambault, D., Munzner, T., and Auber, D. Grouse: Feature-Based, Steerable Graph Hierarchy Exploration. Proceedings of 2007 Eurographics / IEEE VGTC Symposium on Visualization, pages 67--74.
|DualNet||Namata, G. M., Staats, B., Getoor, L., and Shneiderman, B. 2007. A dual-view approach to interactive network visualization. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007). CIKM '07. ACM, New York, NY, 939-942.
|Pattern Recognition||W.J Lee, L. Raschid, H. Sayyadi and P, Srinivasan 2007
Exploiting Ontology Structure and Patterns of Annotation to Mine Significant Associations between Pairs of CV Terms. Under Review. To the authors' request, we do not upload the paper
|Ranking||R. Varadarajan, V. Hristidis, L. Raschid, M.E. Vidal, H. Rodriguez and L. Ibanez 2007
Flexible and Efficient Querying and Ranking on Hyperlinked Data Sources. Under Review. To the authors' request, we do not upload the paper
|Data Collection|| Hasan Davulcu, Zoe Lacroix, Kaushal Parekh, I. V. Ramakrishnan, Nikeeta Julasana
Exploiting Agent and Database Technologies for Biological Data Collection Designing. Proceedings of DEXA Workshops'2004. pp.376~381
|Data Collection|| Zoe Lacroix
Public data sources and applications used by scientists. ASU SDML TR-01-03
Evaluating implementations of a scientific protocol on NCBI resources. ASU SDML TR-01-05 and Appendices
|ConeTree|| Marti A. Hearst, Chandu Karadi
Cat-a-Cone: an interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. SIGIR Forum 31, SI (Dec. 1997), 246-255.
Software/Links of Interest
- "Dummy" dataset (see README.txt in archive for usage)
NCBI E-Utils Jar File for Java
- Java API Documentation at NCBI website
NetBLAST: program that retrieves BLAST results from NCBI
Existing Vis/Analysis Systems
- NVSS - Network Visualization by Semantic Substrates by Aleks Aris. Groups nodes by attribute and selectively displays links.
- NetLens - Iterative Exploration of Content-Actor Network Data.
- Arrowsmith - A system for analysis of Medline and Pubmed data.
- Map of Science - A visualization by Richard Klavins and Kevin Boyack (cf. "Mapping Medline Papers, Genes, and Proteins") that displays communities within various fields of science.
- VxInsight - A knowledge mining tool that displays objects as a 3D landscape.
- Prefuse - Tools/APIs for building visualizations.
- Piccolo - A toolkit for creating full-featured graphical applications in Java and C#.
- JUNG - Java Universal Network/Graph Framework.
|Get copies of SocialAction + NVSS and test them||Mike||Mar. 4|
|Download sample NCBI data||Inbal||Mar. 4|
|Contact Louiqa Raschid, Adam Lee re: domain-expert||Inbal||Mar. 4|
|Contact G. Craig Murray re: domain-expert||Fatemeh||Mar. 4|
|Contact Adam Phillippy+Mike Schatz re: domain-expert||Mike||Mar. 4|
|Remind Jimmy Lin about domain-expert||Fatemeh||Mar. 4|
|Two paper reviews||Sima||Mar. 4|
|Annotate network datasets||Huimin||Mar. 4|
|New data collectoer||Inbal||Apr. 10|
|Apr. 1, 10am||Ben Shneiderman||Seek to reduce scope of project|
|Mar. 18, 12pm||(group sync)||Define tasks|
|Mar. 18, 10am NLM||Michael Galperin||Learn tasks that he does using NCBI site|
|Mar. 12, 10am AVW 3165||Adam Phillippy||Learn tasks that AP & MS do on NCBI site|
|Mar. 11, after class||(group sync)||Update status on experts, data, tasks|
|Mar. 7, 6pm VMH (Louiqa's office?)||Louiqa Raschid||Find out possible queries and data to use|
|Mar. 4, after class||(group sync)||Format data, arrange meetings with experts|
|Feb. 28, after class||(group sync)||Move toward using PubMed and related NCBI datasets|
|Feb. 27, 12pm AVW 2120||Jimmy Lin||Try to get in contact with biomedical domain expert|
|Feb. 26, 1pm HCIL||Adam Perer||Offered his assistance and described general challenges to consider|
|Feb. 26, 10am Office Hrs||Ben Shneiderman||Feedback on project ideas and potential directions for project|
|Feb. 13, 1pm HCIL||Jimmy Lin||Initial orientation to general ideas and dataset|
Expert Tasks to Accomplish
- Louiqa: Start with a MeSH term query, and explore genes and documents related to the terms.
- Adam: Start with a sequence query (FASTA file) as input to NCBI BLAST database, and display results grouped by species. Clicking BLAST results will further expand them to find genes.
- Michael: Start with a query to OMIM, and retrieve links to other types of data referenced in OMIM: protein, publication, and so on.
Envision: illustration of the end goal system
- Initial network is a search result from a single DB of NCBI
- Network is extended on demand, and on the fly (using NCBI eUtils)
- Important: visualization should change smoothly, such as the user can follow the addition of new nodes/ arcs
- Important: show only one level of arcs (from a specific node in one window, to all linked nodes in other windows)
- (S) What about the links among nodes in one window, like citation or co-authorship links for papers and co-mention links for genes/proteins.
- Detail on demand: open the NCBI website with the relevant page
- By node attributes (e.g., year, relevance)
- By arc attributes (e.g., confidence)
- By DB
- Number of windows (DBs) is dynamic
(S) Additional features
- Size coding for the nodes based on their importance, e.g.,
- Papers: number of citations
- Genes/Proteins: number of times they have been mentioned in the corresponding papers
- There can also be location coding for genes/proteins using the time they have been explored (if available).
- Clustering-on-demand (like HCE and SocialAction): cluster the nodes in each window using some criteria like co-authorship, citation (for papers), co-mentioned in one paper (for genes/proteins).
- Overview: all nodes are shown in one window (shape/color coding for papers, genes, proteins,...) to see their relative distributions with respect to time and each other.
(M) Additional features (thanks Alex for ideas)
- An "undo" button to undo the last action(s).
- As the user keeps clicking and clicking, the graph will get cluttered with lots of nodes. We should therefore have an option to remove one or multiple nodes from the graph, along with all attached links.
- It may be problematic to add new nodes to existing substrate regions. For example, say a document region has 20 nodes, and user just clicked a different gene that added more nodes to the same document region. There needs to be some explicit way of showing which new nodes were just added --- e.g. with color coding. We can't just place nodes left-to-right, because the region may have some other ordering already (for example, ordering documents by publication date).
Abstract idea: substrate tree
Slightly different approach: a user may be interested in recounting or seeing how they explored the graph. This would require keeping track of what nodes were added when and where.
To this end we could display, instead of a single substrate with multiple regions, a tree of substrates. This substrate tree corresponds to the exploration hierarchy that the user engaged in. Nodes of the tree are substrates. Expanding nodes within a substrate is equivalent to adding a child node to the substrate tree. The user can then zoom around the substrate tree and easily navigate back to where they were before, possibly performing different queries on different sets of nodes.
Producing an appropriate view of the substrate tree could be challenging as it would quickly get cluttered with many nodes. We should therefore have some way of collapsing or hiding nodes, and only give details for those few nodes the user is interested in.
An additional interesting feature here would be to show links between sibling nodes in the substrate tree.