Visualizing Large Collections. Reviewing Mapping Texts Project

In this post I would like to review the Digital Humanities Project Mapping Texts. Mapping Texts was a collaboration between scholars, staff and students at Stanford University and the University of North Texas. It was supported by the National Endowment for the Humanities. The purpose of the project was “experimenting with new methods methods for finding and analyzing meaningful patterns embedded in massive collections of digital newspapers”. The database was a collection of 232,500 pages of historical newspapers digitized for the Chronicling America project. The project was finished in 2011 and for what you can see in the website, there hasn’t been any later updates or new analysis.

Accessing the project

You access the website through the home page. There you see the project’s general information, such as who developed it, in collaboration with whom, and the sources that they used. The home page contains the project’s logo on top, followed by a navigation bar, and the main text. The small problem that I found in the home page is the repetition of the information that it contains. The same text that appears in the main section also appears in the right side of the page, and it is followed by a series of links that lead you to the different pages of the project, the same links that you can find in the navigation bar. That is to say that there seem to be two navigation bars in the home page, one on top and horizontal and one on the left and vertical. Nevertheless, if we take a look at the footer, we find the same links one more time, under the section pages. Moreover, in the main section of the home page, you also have the links that direct you to other pages from the project, in this case, the two interactive visualizations that they developed. In sum, repeated links that direct to the same pages and duplicated information that may confuse the viewer, and let you think that they have more than what they actually do.

Mission & People

The first paragraph of the mission section contains the same information that we read twice in the home page. It then follows by describing the idea behind the project, the sources and the two interactive visualizations that they built, to end with the collaboration that this project had between the University of North Texas and the Stanford’s Bill Lane Center for the American West. Besides the duplicated information from the top of the page, I found the rest of the information clear and able to describe very well the main purposes of the project. The people page lists the different people that was involved in the project and their contact information. A thing to note is that in these pages happens the same as in the home page. You have a navigation bar on top, a navigation bar on the right and links to the different pages in the footer. You also have a list of the posts included in the website, and their categories. I don’t think this last list is useful since there are only five posts in the whole website, and you can access them through a specific section.

Mapping Quality

In Mapping Quality we access the first visualization. As explained in the website,

“This visualization plots the quantity and quality of 232,567 pages of historical Texas newspapers, as they spread out over time and space. The graphs plot the overall quantity of information available by year and the quality of the corpus (by comparing the number of words we can recognize to the total number scanned).”

We have the text on the left and the graph on the right. We then have,

“The map shows the geography of the collection, grouping all newspapers by their publication city, and can show both the quantity and quality of the newspapers from various locations. Clicking on a particular city will provide a detailed view of the individual newspapers, where you can examine both the quantity and quality of information.”

To end up with,

“A timeline of historical events related to Texas is also available for context.”

Unfortunately, the timeline is not fully accessible since a text box that contains the filters of the map overlaps some part of it. In this section, I found that the description was clear and that the graph and map offer a nice interactive process. You can see the different results easily and they are not difficult to understand. However, most of the links from the timeline do not work, and I guess that this is because the project was updated for the last time in 2011, and they didn’t maintain it throughout these years. A last thing to note is that this page does not have the same layout as the others did. In this sense, there is no navigation bar, neither links on the right or footer. You only have a header with the Project name and logo, that if you click it takes you to the home page. This is the only access to the rest of the website.

Mapping Language

The second visualization is the one that you can find in Mapping Language page:

“This visualization plots the language patterns embedded in 232,567 pages of historical Texas newspapers, as they evolved over time and space. For any date range and location, you can browse the most common words (word counts), named entities (people, places, etc), and highly correlated words (topic models).”

This description ends with an about mapping texts link. If you click on it, a black text box appears and it describes the Mapping Texts project. This link really doesn’t add any info as we already now what the project is about and it is just repeating the same information that we found in the home and the mission pages. I guess that it must have been added before the construction of the rest of the website, when they were doing this specific visualization, and they then just left it. The useful aspects of this visualization are that:

it is easy to interact with
Much more clear than the previous visualization in terms of the amount of information that is displayed in one page
You can copy and export the lists of your searches. 4.If you click on the about bottom of the total word counts, named entity counts and topic models analysis, you can see a description and explanation of each of them, including some useful links to other pages and resources. I found that last aspect specially useful if you are not familiarized with these contents. Lastly, the same as in the previous visualization page, there is no navigation bar, neither links on the right or footer. You have a header with the project name and logo, but this time when you click it opens a new tab for the home page.

Publications

In the Publications page you have two articles published in 2011 that describe the project and offers an analysis of topic modeling applied to a large collection of newspapers. They are both interesting articles but I would have appreciated to see some other articles, that appear after that date and relate either to the project or to some parallel analysis. Unfortunately, the website was updated in 2011 for the last time, and no further information or analysis was included.

Data

The last page is the Data one. It begins by describing their source materials for the project:

“The Mapping Texts project team relied heavily on the Texas Digital Newspaper Collection’s archive of historic documents. The main primary material for this project was 232,500 pages that were digitized and converted to plain text using optical character recognition (OCR). This collection was processed through several different computational analyses to help the team explore the possibilities for computer aided “distant reading” of large document collections.”

Although it repeats some of the information that we already read in the home and mission pages, I found it clear and useful. The other good aspect from this last page is that you can download the data files that contain the results of the natural language processing tools that they used on the document collection. Moreover, you can also access the original source code of the data visualizations that they created for the project. The codes are available in GitHub repositories and are able to be re-used. I found that last possibility extremely useful and I believe that the reusability that they promote of their codes and data is a great point to highlight. I’m personally getting involved with a project that will do some text mining and topic modeling analysis of an also large collection of newspapers from California and I find that the possibility of having their code can help me a lot to experiment with it in my own database. Thanks for that, Mapping Texts

Overall considerations

Overall, I found Mapping Texts project a well built one, that is clear in its objectives and that is able to show their experimentation in a clear and interactive way. I would have liked to see some updates after 2011, or the re-application of this codes and experiments to different sets of newspapers collections, the same of having some more articles that discuss either this or related projects. The design of the site is simple, most of the pages have the interface of a blog page. I would have removed the duplicated links and navigation bars from the different pages, since they confused the viewer in where to click to move around the website and access the different pages and sections. The most outstanding aspect is the sharing of the codes and the data that they offer. It encourages you to take a deeper look at their sources, the same as how the create the codes for the different visualizations.

... Back to top ...