- Behjat Siddiquie (behjat at cs.umd.edu)
- Chang-Han Jong ()
- Prahalad Rajkumar (bridgenbarbu at gmail.com)
- Tanya Clement
- Catherine Plaisant
- 1 Background
- 2 Goal
- 3 Status/Summaries of Meetings (in reverse chronological order)
- 4 Definitions
- 5 Possible Enhancement of Existing Versioning machine
- 6 Softwares and Methods
- 7 Snapshots
- 8 References
- 9 Credits
Versioning machine (VM, http://www.v-machine.org) is an interface that provides the facility to view multiple versions of a document, along with the changes across versions. VM uses XML to represent the versioning data. For the following two versions of text, the corresponding XML would be as shown. The idea is to have different versions of a line in a XML markup block including deleted words.
Our goal here is to implement a Versioning tool that enables the user to effectively visualize various versions of a document along with the changes across the versions.
Status/Summaries of Meetings (in reverse chronological order)
Meeting - 4/18
Suggestions from the mid-semester class presentation were instrumental in formulating our plans for the rest of the semester.
- Displaying the actual text rather than labels (boxes)
- Associating detail windows with their origin, either by arrows or by color coding
- Utilize as much of the screen space as possible, to effectively display contents
- Change the background color (which is currently black )
- Compute statistics of words in the documents, in order to display unique words in each version, versions that have similar phrases, etc
- Finish the parser to convert data in XML format to our format
- Get a head start on writing the paper, come up with a basic template
Meeting with Tanya - 4/10
This was the first instance where we had an opportunity to demonstrate our program to Tanya. Chang was in charge of conducting the demonstration, Prahalad noted down Tanya's comments as well as a summary of the meeting, while all three of us participated in discussing the development and progress of the project.
The program was loaded in Tanya's computer, and Chang opened a couple of text files to begin the demo. Chang spoke about the general organization of the program, and explained specific features of the program such as the search capability, the representation of text as boxes. Tanya provided feedback as we went along.
Tanya's feedback and suggestions:
- The first priority is to work on obtaining data from her poems (Tanya would send updated files to Behjat)
- Link the text box generated by clicking on a work to the box(word) it originated from
- Tanya liked the idea of synchronizing multiple versions, such that when one version is scrolled, the other version scrolls as well
- She approved of the idea of synchronizing the panels containing boxes with the corresponding panels containing the actual text.
- Tanya provided positive feedback about the search capabilities of the program
- We spoke about possibilities in similarity searching and hotspots:
- Displaying the unique words in each version
- Phrases (see app below) that don't change across versions
App - these are xml tags that represent a set of words which are different in at least one version. While the general rule is that phrases that display difference in at least one version is represented by an app tag, this could in theory be relaxed as Tanya sees fit.
Meeting - 4/09
Agreed on the following :
- Windows containing text can be embedded in the boxes, and will correspond to the SubjectPanelBox, similar to the functionality in Spotfire
- Make the WordBox bigger in order to fit the text, to facilitate readability
- Keep the possibility of word wrap in mind, to cater to documents containing lengthy paragraphs
- Use a format similar to <ADD>, < DEL> for representing meta data
A rough template of the paper:
- Introduction and Background
- Overview of the Architecture
- Similarity Searching (though this is not one of our main focuses)
- References and Related Work
Meeting - 4/02
Agreed upon the following tasks
- Think of a revised, catchy title for the project
- Search and highlighting - the details of highlighted words or lines can be shown in a floating transparent text window.
- Come up with an algorithm to identify unique words in each version, and to search for unique words.
- Possibly show the added/deleted words
- Displaying and highlighting similarity across documents
- layout for display transparent text window
- should make sure that overlapping does not occur
- should make sure that the display is close to the original article
- find the same text paragraph (spot)
- arrow/ curve line is a possibility (may consider it later)
Meeting with Prof. Shneiderman - 2/28
Meeting - 3/06
- From WinDiff, we got an idea on how to present the 'Movement' of text.
- One way to enhance bar of the articles is to use anchors to divide the bars so that we can use length-based or ratio-based presentation in a segment of article. Also, multiple segments of different documents or versions can be compared
- we decided to use Java with Eclipse IDE, Swing
- XML parsing and of the first two topics should be implemented first
- We talked to Tanya and got an idea of what additions she wants to improve the viz application, but we want to formalize them in the context of information visualization.
- Get some references of previous work, current research and existing similar apps/softwares in this area.
XML data format
The XML language for Versioning machine is defined by Text Encoding Initiative.
Current implementation of the Versioning Machine
Data is stored in XML format. XLST technology is used to convert XML to HTML on the fly on the browser. Therefore, the current implementation serves just as an graphical viewer, and does not provide additional features, or visualizations. Also, it may face compatibility problem in XML implementation of browsers.
- Version: different variants of a documents
- Documents: a set of versions of texts
Possible Enhancement of Existing Versioning machine
- Specific additions required by Tanya Part I
- searching across documents.
- finding patterns of a poet(additions/deletions of a specific word).
- finding hotspots(regions of similarity/differences) in versions/documents.
- create a visualization/snapshot of all the poems and all the versions (this was the very first thing Tanya spoke about)
- Specific additions required by Tanya Part II
- In a overview mode, all version of multiple documents are shown
- In overview mode, she want to find things she knows are there. I.e. a word oor a phrase that is alike across documents(not just across versions)
- search for documents that are most alike and least alike between versions
- Hotspots for uniques versus common words and phrases
- Hotspots for unique versus common changes such as adding and/or deleting the same letters, words, punctuation, parts of speech
- In detail view, she would like to take out the deleted and added phrases so that she can see a "clean" version (like reviewing in Word)
- In detail view, she wants to compare minute differences between what ever versions she has on display instead of just the line (this would change dynamically depending on what versions she has on display)
Softwares and Methods
|Jplag||Academic||To detect source code similarity with awareness of programming language structures||Text||Web software|
|Piccolo||Academic||Supports java, .net, pocket .net||download|
Susan Schreibman, Amit Kumar and Jarom McDonald. The Versioning Machine This is the paper describing the Versioning Machine that inspired our project. The paper contains a detailed description of how the Versioning Machine works, and how the tool facilitates comparing different versions of a document.
Fernanda B. Viégas, Martin Wattenberg, Kushal Dave. Studying cooperation and conflict between authors with history flow visualizations. This paper deals with monitoring history changes in wikipedia articles. Since wikipedia has millions of articles, this research is a superset of our project.
Benjamin B. Bederson, Jesse Grosjean, Jon Meyer. Toolkit Design for Interactive Structured Graphics. This paper describes tools which support 2D structured graphical applications, zoomable user interface applications in particular. We would like to try and use Piccolo to facilitate zooming in our project.
Nancy E. Miller, Pak Chung Wong, Mary Brewster, Harlan Foote. TOPIC ISLANDS TM – A Wavelet-Based Text Visualization System This work makes use of wavelets to represent textual information. While this in itself may not be relevant to our project, the paper does a good job in outlining some of the difficulties of representing text, which we should be aware of.
A Comparison of Reading Paper and On-Line Documents This paper covers movement of text within the original document, which is relevant to our project.
Ed H. Chi, Lichan Hong, Michelle Gumbrecht, Stuart K. Card. ScentHighlights: highlighting conceptually-related sentences during readingThis research intends to use highlighting to help people find the important words in a given text. They approach the goal by looking at words relevant to the title/topic sentences.
Vladislav Daniel Veksler, Wayne D. Gray. Mapping semantic relevancy of information displays.This work tries to predict where the human eye is likely to catch information. Semantic relevancy maps contain two elements, 1)statistical measure of similarity and 2)information about position and size of each text string. This paper is highly related to our research but the charts in it are not very clear.
Antonio Si, Hong Va Leong, Rynson W. H. Lau. CHECK: a document plagiarism detection system This paper describes CHECK, a document plagiarism detection tool. Since plagiarism compares different documents to detect similarities, this work is directly related to versioning.
Sergey Brin, James Davis, Hector Garcia-Molina. Copy detection mechanisms for digital documents This work deals with detecting similarities among digital documents, and therefore is directly related to our project.
Hongyuan Zha, Xiang Ji. Correlating multilingual documents via bipartite graph modeling This paper provides the method to compare two documents using linear algebra and graph theory. They model the similarity problem to be a weighted bi-partite graph G(A,B,W). A and B are the sentences of two articles and W would be a adjacency matrix for the similarity between elements of A and B. Then they try to find the dense subgraph by using the singular property of matrices.
Khoo Yit Phang, Jeffrey S. Foster, Michael Hicks. Path Projection for User-Centered Static Analysis Tools This is a current work by Khoo Yit Phang, a UMD student. We hope to see how highlighting was used in this program exploration tool.
G. Lommerse, F. Nossin, L. Voinea, A.Telea, The Visual Code Navigator: An Interactive Toolset for Source Code Investigation, in Proc. IEEE InfoVis’05, IEEE CS Press, 24 – 31, 2005.