Recently I’ve been reading a great book called Building Machine Learning Systems with Python. The book has two authors: Willi Richert and Luis Pedro Coelho. As is often the case for books with multiple authors, the individual chapters have a different literary feel to them. The following meta-idea occurred to me:
Can the tools and techniques from the book be used to identify who wrote each chapter?
A person’s writing style is an example of a behavioral biometric. The words people use and the way they structure their sentences is distinctive, and can often be used to identify the author of a particular work. This is a widely studied problem, with hundreds of academic papers on the subject.
There are two high-level ways to attack the chapter attribution problem:
- Supervised learning: One approach would be to gather ground truth from external sources. For example, find works for each author from other publications, blogs, etc. These samples would be used to learn a model for each author’s writing style. Determining who wrote each chapter would be a binary classification problem.
- Unsupervised learning: A second approach is unsupervised, meaning that the analysis is conducted without ground truth. In this method, the chapters are analysed to find two subsets that appear to have been written by the same person.
On this page I will consider the unsupervised problem. There are three steps:
- Preparing and loading the data
- Feature extraction: We will experiment with a few different feature sets. Even though the focus is on the unsupervised problem, the feature extraction code can also be used for supervised learning.
- Classification: We will use clustering to find natural groupings in the data. Since we have several feature sets, we will use ensemble learning: learn multiple models, each built using different features, that vote to determine who wrote each chapter.
Firstly, you will need to have the following Python libraries installed: NumPy, SciPy, scikit-learn, and NLTK. Secondly, you will need the raw text of a book (you can use any book with 2 or more authors). Convert it to text (e.g. using PDFMiner), and remove anything that isn’t body text (e.g. chapter and section headings, tables, code snippets, etc.). Finally, divide the book into chapter files named
chapter02.txt,etc. Run the following code to import the libraries and load the text:
import numpy as np import nltk import glob import os from sklearn.feature_extraction.text import CountVectorizer from sklearn.cluster import KMeans from scipy.cluster.vq import whiten sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') word_tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+') # Load data data_folder = r"[path to chapters]" files = sorted(glob.glob(os.path.join(data_folder, "chapter*.txt"))) chapters =  for fn in files: with open(fn) as f: chapters.append(f.read().replace('\n', ' ')) all_text = ' '.join(chapters)
There are dozens of possible features for authorship attribution that have been proposed in the literature. Good features for this problem (1) are able capture the distinctive aspects of someone’s writing style, and (2) are consistent even when the author is writing on different subjects. We will experiment with a few different approaches.
Lexical and punctuation features
- Lexical features:
- The average number of words per sentence
- Sentence length variation
- Lexical diversity, which is a measure of the richness of the author’s vocabulary
- Punctuation features:
- Average number of commas, semicolons and colons per sentence
The following code extracts these features:
# create feature vectors num_chapters = len(chapters) fvs_lexical = np.zeros((len(chapters), 3), np.float64) fvs_punct = np.zeros((len(chapters), 3), np.float64) for e, ch_text in enumerate(chapters): # note: the nltk.word_tokenize includes punctuation tokens = nltk.word_tokenize(ch_text.lower()) words = word_tokenizer.tokenize(ch_text.lower()) sentences = sentence_tokenizer.tokenize(ch_text) vocab = set(words) words_per_sentence = np.array([len(word_tokenizer.tokenize(s)) for s in sentences]) # average number of words per sentence fvs_lexical[e, 0] = words_per_sentence.mean() # sentence length variation fvs_lexical[e, 1] = words_per_sentence.std() # Lexical diversity fvs_lexical[e, 2] = len(vocab) / float(len(words)) # Commas per sentence fvs_punct[e, 0] = tokens.count(',') / float(len(sentences)) # Semicolons per sentence fvs_punct[e, 1] = tokens.count(';') / float(len(sentences)) # Colons per sentence fvs_punct[e, 2] = tokens.count(':') / float(len(sentences)) # apply whitening to decorrelate the features fvs_lexical = whiten(fvs_lexical) fvs_punct = whiten(fvs_punct)
Bag of Words features
Our second feature set is Bag of Words, which represents the frequencies of different words in each chapter. This feature vector is commonly used for text classification. However, unlike text classification, we need to use in topic independent keywords (aka “function words”) since each author is writing on a variety of subjects. Our vocabulary will be the most common words across all chapters (e.g. words like ‘a’, ‘is’, ‘the’, etc.). The idea is that the authors use these common words in a distinctive, but consistent, manner.
In the following code, we use NLTK to find the most common words in the book, and scikit-learn to create the feature vectors for each chapter:
# get most common words in the whole book NUM_TOP_WORDS = 10 all_tokens = nltk.word_tokenize(all_text) fdist = nltk.FreqDist(all_tokens) vocab = fdist.keys()[:NUM_TOP_WORDS] # use sklearn to create the bag for words feature vector for each chapter vectorizer = CountVectorizer(vocabulary=vocab, tokenizer=nltk.word_tokenize) fvs_bow = vectorizer.fit_transform(chapters).toarray().astype(np.float64) # normalise by dividing each row by its Euclidean norm fvs_bow /= np.c_[np.apply_along_axis(np.linalg.norm, 1, fvs_bow)]
For our final feature set, we extract syntactic features of the text. Part of speech (POS) is a classification of each token into a lexical category (e.g. noun). NLTK has a function for POS labeling, and our feature vector is comprised of frequencies for the most common POS tags:
# get part of speech for each token in each chapter def token_to_pos(ch): tokens = nltk.word_tokenize(ch) return [p for p in nltk.pos_tag(tokens)] chapters_pos = [token_to_pos(ch) for ch in chapters] # count frequencies for common POS types pos_list = ['NN', 'NNP', 'DT', 'IN', 'JJ', 'NNS'] fvs_syntax = np.array([[ch.count(pos) for pos in pos_list] for ch in chapters_pos]).astype(np.float64) # normalise by dividing each row by number of tokens in the chapter fvs_syntax /= np.c_[np.array([len(ch) for ch in chapters_pos])]
Our goal in the modeling stage is to find two groups, or “clusters”, in the feature space, with each group being the chapters written by an author. To find the clusters we use scikit-learn’s implementation of k-means with k=2:
def PredictAuthors(fvs): km = KMeans(n_clusters=2, init='k-means++', n_init=10, verbose=0) km.fit(fvs) return km
Results and Conclusions
I will make the assumption that Luis Pedro Coelho wrote chapter 10, as it is on computer vision (Luis is the author of a popular computer vision library called mahotas, which I use quite a bit in other projects). Using this fixed data point, we can assign a name to each cluster, and subsequently an author to each chapter. Here are the results for each feature set:
|Bag of Words||WR||LC||WR||LC||WR||WR||LC||LC||WR||LC||WR||LC|
After counting up the votes, two chapters are a tie (“Clustering – Finding Related Posts” and “Topic Modeling”). Here are the chapters with a majority win:
|Willi Richert||Luis Pedro Coelho|
How confident am I in these results? Not very. Overall, the problem was much harder than I anticipated:
- Selecting good features for unsupervised learning is difficult – it is like feeling your way around the dark. It is likely that several of the features I’ve used are not informative.
- The chapters are unlikely to be “pure”. The authors may have collaborated on some sections, read over and modified each other’s work, and the whole book was probably sterilized by copy editors. All of these add noise to the data.
- The results are not stable. For example, if I make a minor change to the code (e.g. change the normalization method), or even run k-means again (which has randomness in its initialization), the clusters change. This indicates that clusters are not well separated in the feature space. In fact, it is this instability that motivated me to use ensemble learning: as long as some of the models are performing better than chance, the hope is that the results of voting will be consistent.
Code can be found here. Now I will try to contact the authors, so stay tuned!
I’ve heard back from Willi Richert, and all of the guesses were correct! He also had some good ideas on how to improve the classifier. I’ll give one hint: maybe some sections of a chapter are more distinctive than others? He gave me the answer for the chapters that tied, but I’ll keep them secret and pass the challenge on to you. Can you break the tie? (discuss at: http://twotoreal.com)