Three different tools were used in the last three modules: Voyant Tools, Kepler.gl, and Palladio to visualize the WPA Narratives.
The dataset used with the three tools throughout modules 7-9 came from the Library of Congress’ Born in Slavery: Slave Narratives from the Federal Writers’ Project, 1936-1938. This collection contains narratives that were collected in the 1930s as part of the Federal Writers’ Project (FWP) of the Works Progress Administration, later renamed Work Projects Administration (WPA). This collection consists of more than two thousand interviews with former slaves from seventeen states collected in the years 1936-1938 by staff of the Federal Writers’ Project of the Works Progress Administration. The interviews are available in the public domain as images and uncorrected OCR as part of the Library of Congress American Memory site, and as transcriptions as part of Project Gutenberg. These interviews are a complex source; make sure you read the background material before you work with them.
CONSIDERATIONS REGARDING INTERVIEWS
Interviewing is a data gathering method in which there is a spoken exchange of information. Thus, oral history can be used in geographical research to show the complex lines that ultimately form the web of an individual’s life experiences. It is a powerful source of situated learning that can supplement understandings of space, place, environment and relationships.
However, the method of approach and the scope of the interviews need to be defined from the start. The WPA interviews were conducted face-to-face. Sharon Ann Musher identifies three interview pitfalls: Authenticity, Bias, and Candor, in her article “The Other Slave Narratives: The Work Progress Administrations Interviews.” Another thing to note is the dialect used in a specific region or among a social group.
While oral history as a technique to gather information, insights, and knowledge from participants in social research, no amount of technology can create new insight if the data is garbage. “When using data, most people agree that your insights and analysis are only as good as the data you are using. Essentially, garbage data in is garbage analysis out.”1
DIGITAL TOOLS: NETWORKS AND VISUALIZATION
Voyant Tools is an open-source, web-based application for performing text analysis; Kepler.gl is an advanced geospatial visualization tool; Palladio
When it comes to data preparation, each tool has its own rules of how the data needs to be set up so it can be easily imported into a program and easily organized and manipulated. If your data is properly prepared, then the analysis can be something that is quick and clean.
- Each column in your file is equivalent to a variable, and each row in your file is the same thing as a case or observation. Also you should have one sheet per file. If you have an Excel sheet, you know you can have lots of different sheets in it, but a CSV file has only one sheet and also that each file should have just one level of observations. If you do this, then it makes it very easy to import the data and to get the program up and running.
- You have things like titles and you have images and figures and graphs and you have merged cells and you have color to indicate some data value or you have sub-tables within the sheet or you have summary values or you have comments and notes that might actually contain important data. All of that can be useful if you’re never going beyond that particular spreadsheet, but if you’re trying to take it into another program, all of that gets in the way.
- There are other problems that show up in any kind of data, like, do you actually know what the variable and value labels are? Do you have missing values where you should have data? Do you have misspelled text? If people are writing down the name of the town that they live in or the company they work for, they could write that really in infinite number of ways. Or in a spreadsheet, it’s not uncommon for numbers to accidentally be represented in the spreadsheet as text, and then you can’t do numerical manipulations with it.
- Then there’s a question of what to do with outliers and then there’s metadata, things like where did the data come from? How was it processed?
All of this is information you need to have in order to have a clean dataset that you know the context and the circumstances around it that you can analyze it. And that’s to say nothing about trying to get data out of things like scanned PDFs or print graphs, all of which require either a lot of manual transcription or a lot of coding. Data prep is a necessary, vital step to get something meaningful out of your data. So give it the time and the attention it deserves, you’ll be richly rewarded.
Text Mining/Topic Modeling with VOYANT
Voyant Tools is a web-based (optional local download) reading and analysis environment for digital texts, suitable for exploring patterns in words or phrases in a text. It can read .txt, .pdf, .doc and other formats. You can also perform topic modeling to discover abstract “topics” that occur in your corpus. It is frequently used in text-mining for discovery of hidden semantic structures in a text body.
After you pasting or upload text, the screen opens to reveal 5 nested windows within the one screen, each containing a tool: Cirrus, Reader, Trends, Summary and Contexts. These tools sift through all of the corpus to discover insights that otherwise wouldn’t be immediately apparent. Analyzing text poses various unique challenges. Text data is several times as large as numeric data. Also, text does not have a fixed structure or schema and that makes understanding it very difficult.
The Cirrus tool is a word cloud generator. It highlights what words are used with what frequency in a body of text or corpus. The size of the word in the word cloud is based on the number of occurrences of that word in the corpus. The greater the occurrences, the bigger the size. Word clouds can also limit the number of words shown to the top and popular ones. A word cloud can also be used to show the popularity of key words visually.
You may want to remove frequently appearing words, like ‘the’ ‘a’ or certain ways of saying a word based on the dialect of the interviewee that do not add value to the natural language processing as a part of your word cloud. To remove unwanted words, you can add these words to the stopword list, and then plot the word cloud again. Voyant has a universal stopword list, however, you can add your own relative to the corpus analyzed. The WPA Narratives contain dialect words so I added to the stopword list.
If you click on Terms in the menu bar (next to Cirrus), the window changes to a list of words and their frequency in the document. The word suggests that ‘old’ appears the most throughout the corpus of narratives. The usage of different words associated with race and gender varies throughout the corpus. If you click on Links in the menu bar, the window changes to a visualization of the highest frequency words that occur close to the specified search terms. The Context slider on the bottom right adjusts how close to the search terms the words that appear are.
Voyant compares documents/interviews from different states. You can switch between viewing the frequency of the selected word within specific documents instead of the total frequency for the document.
Each document appears in the graph as a colored block. The wider the block, the larger number of words there is in the document. When you select a word, a line graph appears that shows the frequency of the selected word across each document. The vertical blue line indicates what part of the corpus is being displayed in the Reader tool window
Click on a word in the Reader window – a frequency graph for that word will appear in the Trends tool, and that word will appear in the Contexts tool, beginning with the instance that you clicked on.
This tool shows all of the appearances of a selected word with the words to its left and right. The Context slider on the bottom of the window adjusts how many words surrounding the search term are displayed.
The line graph shows the total frequency of the selected word in each document or corpus.
Clicking on ‘OLD produces the graph below.
This corpus has 17 documents with 2,442,169 total words and 41,128 unique word forms.
Most frequent words in the corpus: old (10094); â (8554); come (8000); got (7604); time (7472)
Spatial data tools allows you to extract deeper insight from data using a set of analytical methods and spatial algorithms.
Kepler.gl can draw connections between two points in 2D and 3D. I found the 3D Arc layers didn’t reveal more than what 2D lines could tackle. The key is that the dataset must contain the latitude and longitude of two different points for each arc. Some of the interviews didn’t have a precise location so it wasn’t all visualized consistently.
Kepler’s heat map is used to visualize how hot an area is in terms of number of points in a particular region. You can create a heat map with any point layer. It takes discrete data points to form a more continuous field assessment. The advantage of mapping layers is the quick visualization of various attributes recorded within the same feature class. Take the two point feature layers for enslavement and their interviews data for slave. Converting the points to a heat map is easier for one to understand the migration of slaves between the two locations. It looks like they didn’t venture far at all. While the point locations may have been inconsistently generalized when no granular data was available, the ability to ‘see’ a migration spread, no matter how small adds to the narrative patterns.
Palladio is a visualization platform that includes tools for visualizing and exploring map, graph, and list data that can help find insights. It offers four styles: Map, Graph, List, and Gallery View data visualization for making sense of complex data. The module activities were specific to maps and graphs.
Mapping the points on the map can be sized to represent their relative magnitude within the data. With the map’s tooltip function you can select which information will be displayed when hovering over a specific point on the map.
If you wanted to compare text by interviewer, by their race, by date, or according to other variables beyond the state where they took place; While a tabular view of the data can be revealing Palladio was helpful in revealing relationships it with more emphasis using the network graph’s strength of nodes and edges. Edges can also contain information noted by the width of the arc.
The graph visualization provides a unique capability for visualization and analysis. Graph databases are all about relationships between entities. While other databases primarily focus on entities and attributes, graph databases focus on relationship between entities and attributes for those relationships. A graph database stores nodes and attributes for them. Relationships can have their own set of attributes that describe the relationship. It is possible to run queries based on relationship attributes. A graph databases allows chaining of relationships and provide means to perform queries on these chains.
In the Graph view, you can visualize the relationships between any two dimensions of the data. Topics discussed displayed as nodes connected by lines was really powerful as nodes could be scaled to reflect their relative magnitude.
As a graph database, Palladio is optimized for working with complex data. When I refer to complex data, I mean connected data, data that contains references to other data points. In graphs terms, we call these relationships. In the relational table, that would be data and queries that involve writing lots of joins. Data fusion, combining data sets and querying is where Palladio is good decent.
Palladio graph isn’t so good at working with discrete locational data. While we can certainly work with this type of data in Palladio, you won’t see many of the graph-specific benefits with this type of data. It doesn’t make much sense to store individual observations with no base map behind it because that data isn’t connected, it’s not suited for feature layers with multiple attributes. Instead, storing the structure of the network as a graph would make more sense. There is also no native Datetime support. This can make working with dates more challenging.
The WPA Narratives contained attributes Age , Coordinates: Exists, Interview Subject, Interviewee, Interviewer, M/F, Place Names, State Where Interviewed, Topics, Type of slave, When interviewed, Where enslaved, Where interviewed.