Skip to Main Content

Getting Started with Proquest TDM Studio: Geographic Analysis

Geographic Data Visualization

The Geographic Visualization is valuable for exploring trends over space and over time. You can use Geographic Analysis to tackle questions like:

  • How did the Flint water crisis unfold from 2014 to 2020?
  • What countries are most interested in electric cars and solar energy?
  • Where are the highest levels of homelessness in the US? Are these states talking about these challenges in the public sphere?

The following geographic visualization is created from a set of US newspapers using the search terms “drinking water” AND lead AND “contaminat*”.

 

 

Geographic Analysis is a difficult task and there will definitely be instances in the data visualization and results export where the TDM Studio algorithm picks the incorrect location—e.g. placing “London” in Ontario instead of England. In a teaching and learning context, it can be valuable to use these geocoding or NER errors as teaching opportunities for understanding the limitations of algorithmic text mining as well as some of the challenges surrounding the task.

Each cluster or circle on the map represents a count of locations identified in the documents. For example, in the cluster over central Europe, there are 224 occurrences of the underlying locations which have been resolved to this area. The 224 locations will likely come from less than 224 articles.

By adjusting the time slider, it is possible to see how the number of locations on the map changes over time. The time slider is similar to a date filter—All of the points which occur within the date range (inclusive) will be included on the map.

Important Note: Creating interactive data visualizations can be computationally intensive. In order to expedite the availability of your visualization, it may be necessary to select a sample from your dataset. The locations presented in the geographic analysis visualization are likely a sample of the total locations present in the entire dataset. Thus, the minimum and maximum dates on the time slider are based on this sample of locations which are visualized on the map. This is often different from the project date range which is presented in the project header. For example, if a project dataset contains an article which has the earliest publication date but does not contain any locations then the project date range will be different than the Geographic Analysis time slider date range.

List of Articles

When you click on a cluster, a drawer opens presenting the list of articles which contain locations included in the selected cluster. The articles are listed in order from most recent to oldest publication date.

In the above example, I am interested in learning more about why Flint has more locations identified than Chicago even though Chicago has a far greater population. This is due to the Flint Water Crisis, which is also apparent from the list of articles.

By clicking on a specific article title from the list, a new tab will be opened with the full-text view of the article.

Important Note: The list of articles may have fewer articles than the number of locations in the selected cluster. This is because most articles contain more than one location.

 

 

Export Data

It is possible to export the geographic data as well as the article metadata via “Export Data.” By clicking on “Export Data,” you can select the data format which works best for you (.csv or geojson), and the selected file will begin to download. Depending on the size of the selected file, this can take a few minutes.

 

 

You can then use this exported data for further text mining analysis. For example, if I wanted to analyze how income and education related to water crises, I could export the geographic data from TDM Studio for my project and pair this data with other available datasets.

Geographic Named Entity Recognition (NER)

The algorithm which delivers the location information to the data visualization is created via a two-step process: Geotagging and Geocoding. For Geographic Analysis in TDM Studio, algorithms and approaches have specifically been chosen which are intelligible and open-source licensed.

The first step is to identify geographic entities within each newspaper document. This can be a challenging task because works such as “Charlotte” can be used both as a person’s name as well as the name of a location. For this process, TDM Studio uses SpaCy’s NER model and pipeline which perform well on NER tasks as well as geographic NER tasks. SpaCy provides an overview here

Candidate Selection

Once SpaCy has identified location entities within each newspaper article (Title, Abstract, Text), TDM Studio then uses GeoNames to create a list of candidates to link the geographic entity. In other words, when a newspaper article mentions the geographic entity “London,” is it referring the “London” in England or the “London” in Canada?

To select candidates from GeoNames, TDM Studio uses exact, lower-cased token-matching.  The alternate names from GeoNames as well as the official names as candidates are included.

Geocoding

To pick between candidates, TDM Studio uses a gravity-inspired, geocoding algorithm which ProQuest has developed. The initial inspiration and pilot work was completed via a collaboration with the University of Michigan.

The primary intuition behind the gravity geocoder is that newspapers have a geographic center and are more likely to discuss places which are closer to that geographic center. For example, when The Guardian mentions “London”, it is more likely to be referring to London, England vs. London, Ontario. On the other hand, when The Globe and Mail refers to “London”, it is more likely to be referring to London, Ontario than London, England.

To pick which “London” the article is referring to, TDM Studio uses Newton’s formula for gravity:

F=G* m1m2/r2

Where the population of the candidate is used for mass and the distance between publisher location and the candidate is used for distance (r). TDM Studio then chooses the candidate with the greatest gravitational force. This approach means that specific publications (e.g. The Guardian) will always pick London, England when “London” occurs in a newspaper article. This approach has been benchmarked against internal newspaper datasets as well as external evaluation datasets.

Subsampling for Visualization

Each newspaper article may have multiple locations and it may also mention the same location multiple times. This can result in a very large number of locations and can create challenges for visualization performance. For the Geographic Visualization, a random sampling is used to limit the max number of points on the map to 4,000. For example, if the project dataset results in 15,000 total locations, we will take a random sample of 4,000 locations from this 15,000 and plot these 4,000 locations on the map.

All locations (in this example, 15,000) which have been geocoded are included in the exportable csv / geojson files.

Important Note: In rare cases, long documents will have hundreds or even thousands of locations. If a document has more than twenty locations, only the first twenty locations are included in the results.

Additional Recommended Reading

Below are a few selected articles which discuss some of the challenges and solutions to Geotagging and Geocoding. These articles were important to the development of a gravity-inspired geocoding algorithm for TDM Studio and are valuable for further exploration.

Buscaldi, D. and Magnini, B., 2010, February. Grounding toponyms in an Italian local news corpus. In Proceedings of the 6th workshop on geographic information retrieval (pp. 1-5).

DeLozier, G., Baldridge, J. and London, L., 2015, February. Gazetteer-independent toponym resolution using geographic word profiles. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 29, No. 1).

Gritta, M., Pilehvar, M.T., Limsopatham, N. and Collier, N., 2018. What’s missing in geographical parsing?Language Resources and Evaluation52(2), pp.603-623.