If you're embarking on a machine learning project in the digital humanities, you're most likely not going to call it that. Machine learning drives techniques in text mining, network analysis, and more—but often that machine learning is already baked into the tools you're using. Knowing a little bit more about how it works can help you assess the tools you plan to use.
In some cases, you may wish to use software libraries such as TensorFlow or PyTorch to build your own machine learning projects. If so, book a consultation with the DHLab to talk to us about using the Machine Learning Cube, which is specially set up to support GPU-accelerated machine learning.
Artificial Intelligence describes, broadly speaking, any occasion on which some form of intelligence appears in a machine. “Intelligence” in this instance often applies to perception, reasoning, or learning—tasks like recognizing objects in a photograph, finding a way through a maze, or even making conversation.
Machine Learning is a specific approach to Artificial Intelligence by which programs improve how well they work (they “learn”) based on data or attempts at a task. Instead of the programmer choosing how the computer approaches its problem directly (by setting conditions under which it might provide different answers, for example), they instead design how the computer “learns” how to approach a problem based on examples or other input.
In machine learning, a feature is a measurable aspect of your data—so if your data is a pixel, a feature might be its hue. Or take speech recognition as a more elaborate example. An important step towards speech recognition is usually identifying phonemes—the basic sounds that build words. (These have some overlap with, but are not the same as, the characters of the Latin script in English. ‘Th’ is a different sound that ‘t’ and ‘h’ separately, for example, and furthermore the ‘th’ in breath is different than the ‘th’ in breathe.) To identify phonemes, a computer compares the sounds of recorded speech to previous examples with known phonemes. This is where features come in: features are the measurable aspects of the recorded sound by which the computer makes this comparison. So the length of a sound might be one such feature.
A feature vector is a mathematical encapsulation of a group of numerical features for a given object. (An object might be a picture, a piece of text, a phoneme—whatever the unit of data is that you’re using your machine learning program on.) It is essentially just a list of numbers that represent your object and can be worked with as a group.
Deep learning describes a use of machine learning at more than one layer of processing for a given input (see neural nets). Speech recognition generally relies on deep learning. As described in the definition for features, speech recognition takes more than one step to go from raw sound to finished text. There might be machine learning steps to recognize which phonemes a raw sound might be—but then there might also be steps for recognizing which phonemes go together into words, or even which of the possibly-heard next words makes most sense after the previous.
Training data is a representative sample of data, which may or may not be labeled with the desired output from a program, used to develop machine learning programs. For example, you might use pictures of documents and transcriptions of their contents as training data for an OCR program. Training data can potentially introduce algorithmic bias, on which more below. It is worth remembering also that some machine learning tools available to you were designed with a specific purpose in mind that may not apply to your project: a program trained on contemporary customer reviews might generate unexpected results when applied to eighteenth-century letters.
Supervised learning is a way of approaching machine learning that hinges on labeled training data: already solved versions of a task or problem. For example, you might use a set of scanned texts you’ve already transcribed by hand to train a machine learning program for recognizing printed characters (OCR). The machine learning program identifies the patterns in the features of scanned images of characters that correspond to each transcribed letter. (If height and number of crossed lines were hypothetically features, for example, it might notice that lower-case ‘t’s are tall and crossed once, while ‘l’s are tall but not crossed.) Then, when you apply your machine learning model to new scans, it will compare them to its map of which features point to which letters and generate a possible transcript—when it runs across a tall character crossed once, it will spit out a ‘t’.
Unsupervised learning is a way of approaching machine learning that creates programs which attempt to identify patterns in their input without using prior solved examples to guide them. This still relies on features, the numerical characteristics of a given object or data points that you’ve identified. You could, for example, design a machine learning program to find groups (“clusters”) within the input, or to find outliers. In digital humanities, topic modeling relies on unsupervised learning to identify words that frequently co-occur (but don’t occur as frequently around other words), suggesting that they relate—and therefore form a topic.
Reinforcement learning is a way of approaching machine learning that encourages programs to attain specific goals by attaching numerical rewards to them without providing completed examples—the idea is to find the most efficient solution, or path to solution, through exploration. For example, you might design a program to solve a maze that has hazards and a few exits. The exits might have a high reward value, while the hazards might have a very negative reward value. A reinforcement learning program might “test” different choices in the maze, and then assign values to each choice based on the value (positive or negative) of their result and the cost of getting there. It might then test another set of choices, weighted in favor of those that got positive results previously. What this looks like exactly depends on the implementation, but the idea is that the rewards “shape” the kinds of answers or solutions the program gives while still allowing it to find potentially novel or more efficient solutions than supervised learning (which depends on following the pattern of pre-determined solutions) might allow.
A generative adversarial network is a system of two machine learning programs that compete to improve their results—hence “adversarial.” One is designed to identify something real—most often, a real photograph. The other is designed to generate fake examples of that something. Each is improved based on its ability in competition with the other—so the identifier is trained on the generator’s examples, while the generator is trained on the identifier’s feedback. Thus you might use a generative adversarial network to generate pictures of dogs—your generated dog pictures would get more and more dog-like as the identifier got better and better at recognizing real pictures of dogs.
Artificial Neural Networks—the technology usually meant by the term neural net—make decisions, analyze information, or perform other tasks using a model inspired by the networks of neurons and synapses in human or other animal brains. Each “neuron” of an artificial neural network—often called a node (see this guide's section on networks)—accepts one or more numerical inputs (see features) and uses these to generate a numerical output. Taken together with the output of other nodes, and sometimes even further processed through additional layers of nodes (see deep learning), these results can provide or represent complex answers or decisions about the input. They might process pieces of an image in their context, or analyze aspects of a sound. Machine learning enters the picture when it comes to the formulae by which the nodes generate their output: they can use machine learning (see unsupervised learning, supervised learning, and reinforcement learning) to create or adjust these to produce the desired results.
Algorithmic bias describes unintentionally skewed output or analysis resulting from the design or history of a computer program. In the context of machine learning, this can often result from what training data is used in supervised learning. For example, if there is bias in the training data—if the phrase “he is a doctor” appears more often than “she is a doctor”—the resulting model will conform to this bias. Similarly, if there is more coverage of certain kinds of training data than others—if many fewer (or no) recordings of AAVE speakers are included in generating a speech recognition program, for example—the final program may produce unpredictable, undesirable, or unusable results when used with data from this missing group. This is not exclusive to supervised learning programs, which use training data—unsupervised learning and reinforcement learning can also generate biased results, since their features and rewards are designed to get desired outputs from expected inputs, which may reflect the implicit biases or expectations held by the programs' makers.
The term Natural Language Processing (NLP) describes the use of a computer to analyze human language, usually text. This might include everything from labeling parts of speech to identifying probable tone or emotion. Some NLP methods rely on machine learning, including sentiment analysis and topic modeling.
Sentiment analysis is an NLP technique that identifies the probable tone or emotion of words and phrases in a body of text by assigning numerical scores. Sentiment analysis can potentially be performed in a few ways, but most out-of-the-box solutions available to you are likely to rely at least in part on supervised learning based on a training set of labeled texts or words. While it is possible to use sentiment analysis for scholarly projects, it is especially important to be aware of potential algorithmic bias and to investigate how comparable the training set for a given library or tool is to your texts. ("Awesome" might mean something very different in a customer review in the United States in 2023 and a religious text in England in 1885.)
Topic modeling is an NLP technique for identifying themes and topics in a large body of text. Topic modeling uses unsupervised learning to identify words and phrases with potentially related meanings within a text that meaningfully co-occur, create groups of these words and phrases, and then rate how relevant each group of words and phrases is to a given part of the text. These groups are the "topics" of topic modeling. For example, you might identify the primary topic of a few chapters in a set of books as "rainy sunshine storm lightning dreary sky" and conclude that these were primarily discussing weather.
Optical Character Recognition, better known as OCR, refers to the automatic transcription of letters in pictures of printed documents to machine-readable text. For example, you might use Adobe Acrobat or ABBYY FineReader to make scans of a book searchable. OCR relies on a supervised learning model, using transcribed examples as the training set. If you’ve used ABBYY FineReader, you may know that you can improve its results for scans using unusual characters, for example—it does so by adding these to its training set for your documents.
Handwritten Text Recognition (HTR) is more complicated than OCR for printed documents, given the variability in how a letter might appear even within one individual’s handwriting. However, machine learning can still assist with the transcription of these documents with a relatively high accuracy rate if given a large and comparable enough training set. (As this implies, HTR relies on supervised learning.) Transkribus is a tool that allows users to generate models based on their own training set, which can make HTR achievable to a degree without too much specialized knowledge.