PubMed Document Network

From Cmsc734_08
Jump to: navigation, search

Final Project Report

Project Description

The National Center for Biotechnology Information, or NCBI, maintains over 30 public databases containing biomedical information of various types, such as published medical documents, gene listings, protein listings, and DNA sequence information. It also manages associations between the various databases according to the various types of content. For example, a particular publication might be associated with all genes mentioned in the document. Likewise, a particular gene might have associations with proteins for which the gene codes.

Note that the NCBI's multiple databases can be viewed as a massive information graph. In this graph, documents, genes, proteins, and other object types correspond to nodes of the graph, while associations correspond to links between the nodes. Furthermore, nodes are of different general types (document, gene, protein, ...) and links may likewise have different types (document citation link, "contained in" link, ...). Links may also have weights (e.g., content similarity edges, where weights correspond to the measure of similarity) and may be directed (e.g., document citation links) or undirected. Even though the NCBI databases form an implicit graph, the website offers no interface to navigate multiple nodes of this graph explicitly. Instead, users explore the NCBI databases by retrieving at most one record at a time, essentially limiting them to viewing a single node at a time.

We believe that viewing and exploring multiple nodes in parallel will improve user efficiency in finding relevant information. Therefore, our primary task is to create a visual, interactive interface to NCBI's online databases that allows for easier retrieval of query-relevant information, by allowing exploration of multiple nodes in parallel. Our current plan is to start with a query-based subset of PubMed, by starting with a traditional keyword search. A graphical tool such as SocialAction or NVSS could then be used to explore the returned results, including selective expansion of result documents' links to other nodes of various types within NCBI's databases. Furthermore, this graph could be built and expanded dynamically, using NCBI's Entrez XML API to retrieve additional nodes and links as necessary.

For example, a user might search for the keyword melanoma to retrieve several document nodes. She could then expand three result nodes according to their gene links to discover what genes those three documents reference, and then expand some of those genes according to the proteins they code for. We are currently enlisting the help of several domain-experts who query the NCBI databases regularly in their work, and who will therefore be able to comment on our design's comparative ease of use. We will involve them in informal user studies comparing NCBI's existing exploratory interface (textual, web-based) with our own visual interface, to measure the effectiveness of our design.


Contacts

Contact E-mail Description
Jimmy Lin jimmylin at umd Main project contact
Adam Perer adamp at cs SocialAction developer
Aleks Aris aris at cs NVSS developer
Louiqa Raschid louiqa at umiacs Potential domain-expert
Adam Lee adamlee at umiacs Potential domain-expert
Adam Phillippy amp at umiacs Potential domain-expert
Michael Schatz mschatz at umiacs Potential domain-expert
G. Craig Murray gcraigm at umd Potential domain-expert
Michael Galperin galperin at ncbi nlm nih gov NCBI scientist

Questions for Experts

Network Datasets

Dataset Description
NCBI (PubMed, Genbank, Molec, ...) Contains many types of data of interest to the biomedical community, such as medical document citations, genes, proteins, sequences, etc. 17m+ medical documents as part of PubMed. Features XML API to allow easy dynamic interactivity in apps.
IMDB A database of 15m+ individual TV/film credits, available as downloadable text files. Includes listings for actors, producers, directors, awards, genres, ratings, release dates, etc. (data description)
ACM Digital Library A collection of more than 40 ACM journals, magazines, peer reviewed articles, conference proceedings and ACM SIG newsletters. Contains nearly 2m pages of text. Search by Keyword or Phrase, by the ACM Computing Classification Scheme, by Publication, or by Affiliation. Table of contents, abstracts, and citations are free to the world.
arXiv 460 thousands more e-prints in fields of Physics (main), Mathematics, Computer Science, Quantitative Biology and Statistics. Papers are submitted, not crawled as in CiteSeer. Simultaneous view of content, network, and usage.
CiteSeer A collection of over 700,000 documents, primarily in the fields of computer and information science and engineering. The first digital library and search engine to provide automated citation indexing and citation linking using the method of Autonomous Citation Indexing. Provides Open Archives Initiative metadata. The CiteSeer.IST algorithms, software(source code), and data are available.
Sports dataset (?)
Physics dataset (?)
Edinburgh Word Association Thesaurus A linguistic resource - Not a developed semantic network, but empirical association data. Use stimulus words to generate a 'growing' network from a small nucleus. Included in MRC Psycholinguistic Database, a machine usable dictionary.

Team Members

Huimin Guo hmguo at cs
Mike Lieberman codepoet at cs
Fatemeh Mir Rashed fatemeh at cs
Sima Taheri taheri at cs
Inbal Yahav iyahav at rhsmith

Literature Review

Quickref Source Reviewer Link to Review
SocialAction Adam Perer and Ben Shneiderman, Integrating Statistics and Visualization: Case Studies of Gaining Clarity during Exploratory Data Analysis SIGCHI Conference on Human Factors in Computing Systems (CHI 2008).
  • One case study in SocialAction is to visualize clusters of medical documents when performing the related article search in PubMed.
Huimin Link+
CiteSpace Chen, C. 2006. CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. J. Am. Soc. Inf. Sci. Technol. 57, 3 (Feb. 2006), 359-377.
  • CiteSpace II is a system for detecting and visualizing trends and changes in scientific disciplines over time. Two complementary visualization views are designed and implemented: cluster views and time-zone views. The main connection to our project is the co-citation (cluster) network mapping.
unassigned Link
Melanoma Boyack, K. W., Mane, K., and Borner, K. 2004. Mapping Medline Papers, Genes, and Proteins Related to Melanoma Research. In Proceedings of the information Visualisation, Eighth international Conference on (Iv'04) - Volume 00 (July 14 - 16, 2004).
  • The goal of the paper is to provide answers to how genes, protein and papers are interconnected via co-occurrence patterns in the melanoma research field, which is similar to our task. This paper generates a Paper-Gene-Protein Map (papers from Medline, genes from the Entrez Gene database, and proteins from UniProt.) to see their co-occurrence relationships. Given the three different node types (papers, genes, and proteins) and their diverse associations, a number of association networks can be mapped within a common context. The process of generating the map is described as: Data collection, Calculation of pair-wise similarities between records, Layout of the records based on calculated similarities, Visualization and exploration of data.
Sima Link+
NetLens Kang, H., Plaisant, C., Lee, B., and Bederson, B. B. 2007. NetLens: iterative exploration of content-actor network data. Information Visualization 6, 1 (Mar. 2007), 18-31.
  • Our work is a visual interface to digital libraries. The paper uses a subset of the ACM Digital Library to describe the NetLens interface. NetLens shows paired networks of content and actors (e.g. publications and authors) in coordinated views. It supports complex queries by allowing users to pose a series of elementary queries and iteratively refine them with visual overviews and sorted lists.
Fatemeh Link
Community Structure Girvan, M. and Newman, M. E. J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002).
  • The article introduces a method for detecting community structure in networks (eg. the large collaboration networks). Our initial goal is to visualize a large network of Pubmed documents in order to enable some kind of community structure discovery or evaluation.
Sima Link
Science Mapping Boyack, Kevin W., Klavans, R. and Börner, Katy. (2005). Mapping the Backbone of Science. Scientometrics. 64(3), 351-374.
  • This paper maps journal articles. Science mapping firstly defines the relationships between objects (articles, terms, authors, ...) and computes the similarity matrix based on co-citation, co-occurrence, co-authorship, ... Then it clusters the data using common clustering methods such as, hierarchical clustering, k-mean algorithms. multidimensional scaling, principle component analysis, and self-organizing maps.
unassigned Link
NVSS 06 Aris, A. 2006. Network Visualization by Semantic Substrates. IEEE Transactions on Visualization and Computer Graphics 12, 5 (Sep. 2006), 733-740.
  • We are likely to develp on top of NVSS to explore a query-based subset of the database from NCBI, by expanding the result documents' links to other nodes of various types.
Fatemeh Link
NVSS 07 Aris, A. and Shneiderman, B. 2007. Designing semantic substrates for visual network exploration. Information Visualization, 6, 4, (2007), 1-20.
  • This paper demonstrates how users can design their own substrates to explore network data with NVSS2. It also shows the process in two case studies with domain experts.
Huimin Link+
VxInsight Boyack, Kevin W., Wylie, Brian N., and Davidson, George S. Domain Visualization Using VxInsight for Science and Technology Management. Journal of American Society for Information Science and Technology, 53, 2002, 764-774.
  • This knowledge visualization tool VxInsight transforms information such as documents, patents, or even genomic data into an intuitive visual format that is easy to interpret and that allows natural navigation and query. It presents information as a landscape which allows very large datasets to be represented (info about the implicit structure of the data).
Sima Link+
Grouse Archambault, D., Munzner, T., and Auber, D. Grouse: Feature-Based, Steerable Graph Hierarchy Exploration. Proceedings of 2007 Eurographics / IEEE VGTC Symposium on Visualization, pages 67--74.
  • Grouse uses multiscale visualization to reduce display complexity of network: nodes are grouped into metanodes using hierarchy information and the layout is based on topological features that are computed from the graph structure. Analysts are able to open and close metanodes on demand and layouts are computed as needed as the user explores the network, which enables fast response for very large networks.
unassigned Link
DualNet Namata, G. M., Staats, B., Getoor, L., and Shneiderman, B. 2007. A dual-view approach to interactive network visualization. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007). CIKM '07. ACM, New York, NY, 939-942.
  • The paper shows how using multiple coordinated views improves navigation and provides insight into large networks with multiple node and link types in network data visualization. The tool enhances display of attributes at one time and allow comparisons of different subsets. Our work might need multiple representation and comparisons of different subsets. The features of the tool (Network, Filters, Properties, Search) also provide some ideas for our software design.
unassigned Link
Pattern Recognition W.J Lee, L. Raschid, H. Sayyadi and P, Srinivasan 2007

Exploiting Ontology Structure and Patterns of Annotation to Mine Significant Associations between Pairs of CV Terms. Under Review. To the authors' request, we do not upload the paper

Inbal --
Ranking R. Varadarajan, V. Hristidis, L. Raschid, M.E. Vidal, H. Rodriguez and L. Ibanez 2007

Flexible and Efficient Querying and Ranking on Hyperlinked Data Sources. Under Review. To the authors' request, we do not upload the paper

Inbal --
Data Collection Hasan Davulcu, Zoe Lacroix, Kaushal Parekh, I. V. Ramakrishnan, Nikeeta Julasana

Exploiting Agent and Database Technologies for Biological Data Collection Designing. Proceedings of DEXA Workshops'2004. pp.376~381

  • The proposed apporach combines two software tools( WinAgent, for building agents, and dbXML, for XML data management )for biological data integration. The interface can retrieve information from different Web sites and store the extracted data in a standardized format for efficient later use. They give an example of navigating from OMIM to PubMed.
Huimin --
Data Collection Zoe Lacroix

Public data sources and applications used by scientists. ASU SDML TR-01-03

Evaluating implementations of a scientific protocol on NCBI resources. ASU SDML TR-01-05 and Appendices

  • These two technical reports experiment a query similar to us in "retrieve bibliographical references related to a genetic disorder". They use four NCBI data sources (OMIM, Nucleotide, Protein, PubMed) and 21 links. Their experiment shows adequate resource selection is critical to the quality of the data collection process. Data collection process:
Huimin --
ConeTree Marti A. Hearst, Chandu Karadi

Cat-a-Cone: an interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. SIGIR Forum 31, SI (Dec. 1997), 246-255.

unassigned --

Software/Links of Interest

NVSS Datasets

Visulization of the OMIM, PubMed, and Gene databases and the links among them for "cervical cancer" query.

NCBI E-Utils

NCBI E-Utils Jar File for Java

NetBLAST: program that retrieves BLAST results from NCBI

Existing Vis/Analysis Systems

  • NVSS - Network Visualization by Semantic Substrates by Aleks Aris. Groups nodes by attribute and selectively displays links.
  • NetLens - Iterative Exploration of Content-Actor Network Data.
  • Arrowsmith - A system for analysis of Medline and Pubmed data.
  • Map of Science - A visualization by Richard Klavins and Kevin Boyack (cf. "Mapping Medline Papers, Genes, and Proteins") that displays communities within various fields of science.
  • VxInsight - A knowledge mining tool that displays objects as a 3D landscape.

Programming APIs

  • Prefuse - Tools/APIs for building visualizations.
  • Piccolo - A toolkit for creating full-featured graphical applications in Java and C#.
  • JUNG - Java Universal Network/Graph Framework.

To-Do

Task Assigned To Deadline
Get copies of SocialAction + NVSS and test them Mike Mar. 4
Download sample NCBI data Inbal Mar. 4
Contact Louiqa Raschid, Adam Lee re: domain-expert Inbal Mar. 4
Contact G. Craig Murray re: domain-expert Fatemeh Mar. 4
Contact Adam Phillippy+Mike Schatz re: domain-expert Mike Mar. 4
Remind Jimmy Lin about domain-expert Fatemeh Mar. 4
Two paper reviews Sima Mar. 4
Annotate network datasets Huimin Mar. 4
New data collectoer Inbal Apr. 10

Meeting Notes

Date/Time Met With Synopsis
Apr. 1, 10am Ben Shneiderman Seek to reduce scope of project
Mar. 18, 12pm (group sync) Define tasks
Mar. 18, 10am NLM Michael Galperin Learn tasks that he does using NCBI site
Mar. 12, 10am AVW 3165 Adam Phillippy Learn tasks that AP & MS do on NCBI site
Mar. 11, after class (group sync) Update status on experts, data, tasks
Mar. 7, 6pm VMH (Louiqa's office?) Louiqa Raschid Find out possible queries and data to use
Mar. 4, after class (group sync) Format data, arrange meetings with experts
Feb. 28, after class (group sync) Move toward using PubMed and related NCBI datasets
Feb. 27, 12pm AVW 2120 Jimmy Lin Try to get in contact with biomedical domain expert
Feb. 26, 1pm HCIL Adam Perer Offered his assistance and described general challenges to consider
Feb. 26, 10am Office Hrs Ben Shneiderman Feedback on project ideas and potential directions for project
Feb. 13, 1pm HCIL Jimmy Lin Initial orientation to general ideas and dataset

Software Design

Expert Tasks to Accomplish

  1. Louiqa: Start with a MeSH term query, and explore genes and documents related to the terms.
  2. Adam: Start with a sequence query (FASTA file) as input to NCBI BLAST database, and display results grouped by species. Clicking BLAST results will further expand them to find genes.
  3. Michael: Start with a query to OMIM, and retrieve links to other types of data referenced in OMIM: protein, publication, and so on.

Envision: illustration of the end goal system

Schematic illustration

Source file (Power Point)

Important features

  • Initial network is a search result from a single DB of NCBI
  • Network is extended on demand, and on the fly (using NCBI eUtils)
    • Important: visualization should change smoothly, such as the user can follow the addition of new nodes/ arcs
    • Important: show only one level of arcs (from a specific node in one window, to all linked nodes in other windows)
      • (S) What about the links among nodes in one window, like citation or co-authorship links for papers and co-mention links for genes/proteins.
  • Detail on demand: open the NCBI website with the relevant page
  • Filter:
    • By node attributes (e.g., year, relevance)
    • By arc attributes (e.g., confidence)
    • By DB
  • Number of windows (DBs) is dynamic

(S) Additional features

  • Size coding for the nodes based on their importance, e.g.,
    • Papers: number of citations
    • Genes/Proteins: number of times they have been mentioned in the corresponding papers
  • There can also be location coding for genes/proteins using the time they have been explored (if available).
  • Clustering-on-demand (like HCE and SocialAction): cluster the nodes in each window using some criteria like co-authorship, citation (for papers), co-mentioned in one paper (for genes/proteins).
  • Overview: all nodes are shown in one window (shape/color coding for papers, genes, proteins,...) to see their relative distributions with respect to time and each other.

(M) Additional features (thanks Alex for ideas)

Simplish features

  • An "undo" button to undo the last action(s).
  • As the user keeps clicking and clicking, the graph will get cluttered with lots of nodes. We should therefore have an option to remove one or multiple nodes from the graph, along with all attached links.
  • It may be problematic to add new nodes to existing substrate regions. For example, say a document region has 20 nodes, and user just clicked a different gene that added more nodes to the same document region. There needs to be some explicit way of showing which new nodes were just added --- e.g. with color coding. We can't just place nodes left-to-right, because the region may have some other ordering already (for example, ordering documents by publication date).

Abstract idea: substrate tree

One approach to the substrate tree, showing only those parts of the tree that are selected in the overview
Another approach that allows user to group links between substrates into fat links

Substrate Tree source file (Power Point)

Slightly different approach: a user may be interested in recounting or seeing how they explored the graph. This would require keeping track of what nodes were added when and where.

To this end we could display, instead of a single substrate with multiple regions, a tree of substrates. This substrate tree corresponds to the exploration hierarchy that the user engaged in. Nodes of the tree are substrates. Expanding nodes within a substrate is equivalent to adding a child node to the substrate tree. The user can then zoom around the substrate tree and easily navigate back to where they were before, possibly performing different queries on different sets of nodes.

Producing an appropriate view of the substrate tree could be challenging as it would quickly get cluttered with many nodes. We should therefore have some way of collapsing or hiding nodes, and only give details for those few nodes the user is interested in.

An additional interesting feature here would be to show links between sibling nodes in the substrate tree.