Skip to Main Content

Getting Started with Proquest TDM Studio: Topic Modeling

Topic Modeling Data Visualization

Topic Modeling is a text-mining approach which can be valuable for identifying which topics or subjects are part of a dataset. With TDM Studio, Topic Modeling can be used with both newspaper content as well as dissertation and thesis content for several different objectives. For example:

  • If I am interested in understanding the relationship between what is discussed on the front page of the newspaper and the 2009 financial crisis, Topic Modeling can be valuable. How do public narratives impact economic recovery? Or how does economic recovery impact reported narratives?
  • Topic Modeling can be used to analyze recent Computer Science dissertations and theses to determine what were the trending methodologies in machine learning over the past five years. This can also be valuable from a discovery standpoint for finding dissertations and theses related to my research (e.g., for a literature review).

In the example below, we are using LDA to analyze a set of 8851 newspaper articles from the New York Times for the month of September, 2001. These are all of the articles published by the New York Times for the month of September. How does the news cycle change in response to the tragic, terrorist attack? How does this differ from one newspaper to another?

Topic Modeling (Latent Dirichlet Allocation) and Pre-Processing

LDA (Latent Dirichlet Allocation) is a generative model which attempts to discover ‘latent’ or hidden topics within a collection of documents. The only observed variable in the model is the occurrence of words in documents. The number of topics is provided as an input from the user (in TDM Studio via the ‘Number of Topics’ dropdown) and will impact the resulting topic model.

 

 

For TDM Studio, we use scikit-learn’s implementation of Latent Dirichlet Allocation.
This implementation also includes a valuable User Guide which includes further details on how word and topic distributions are computed.

For preparing documents for topic modeling, we rely upon scikit-learn’s CountVectorizer.
For newspaper articles, we use title, abstract, and full text as input. Because dissertations and theses are often hundreds of pages long, for dissertations and theses, we use the title and abstract as input.

Topic Modeling Keywords and Topic Documents

For each topic, we list ten words which have the highest probability for the topic. These words often, though not always, give an indication of what the topic is about.

 

By clicking on a topic card, we present a list of up to fifty documents related to the selected topic. These are the documents for which the selected topic has a high probability of occurring. By clicking on the title of a document, a new window will open with the full text of the selected document.

Additional Recommended Reading

Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. The Journal of Machine Learning Research, 3, pp.993-1022.

Hall, D., Jurafsky, D. and Manning, C.D., 2008, October. Studying the history of ideas using topic models. In Proceedings of the 2008 conference on empirical methods in natural language processing (pp. 363-371).

Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S. and Blei, D.M., 2009, December. Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems (Vol. 22, pp. 288-296).

Dieng, A.B., Ruiz, F.J. and Blei, D.M., 2019. The dynamic embedded topic modelarXiv preprint arXiv:1907.05545.