Linguistic Fingerprinting with Python | by Lee Vaughan | Aug, 2023

Attributing authorship with punctuation heatmaps

Single forensic fingerprint in yellow tones with blue semicolons (picture by DALL-E2 and writer)

Stylometry is the quantitative research of literary model via computational textual content evaluation. It’s primarily based on the concept that all of us have a novel, constant, and recognizable model in our writing. This contains our vocabulary, our use of punctuation, the typical size of our phrases and sentences, and so forth.

A typical software of stylometry is authorship attribution. That is the method of figuring out the writer of a doc, equivalent to when investigating plagiarism or resolving disputes on the origin of a historic doc.

On this Fast Success Knowledge Science venture, we’ll use Python, seaborn, and the Pure Language Toolkit (NLTK) to see if Sir Arthur Conan Doyle left behind a linguistic fingerprint in his novel, The Misplaced World. Extra particularly, we’ll use semicolons to find out whether or not Sir Arthur or his up to date, H.G. Wells, is the possible writer of the ebook.

Sir Arthur Conan Doyle (1859–1930) is greatest recognized for the Sherlock Holmes tales. H. G. Wells (1866–1946) is known for a number of groundbreaking science fiction novels, equivalent to The Invisible Man.

In 1912, Strand Journal revealed The Misplaced World, a serialized model of a science fiction novel. Though its writer is understood, let’s fake it’s in dispute and it’s our job to unravel the thriller. Consultants have narrowed the sphere down to 2 authors: Doyle and Wells. Wells is barely favored as a result of The Misplaced World is a piece of science fiction and contains troglodytes just like the Morlocks in his 1895 ebook, The Time Machine.

To resolve this drawback, we’ll want consultant works for every writer. For Doyle, we’ll use The Hound of the Baskervilles, revealed in 1901. For Wells, we’ll use The Struggle of the Worlds, revealed in 1898.

Luckily for us, all three novels are within the public area and accessible via Project Gutenberg. For comfort, I’ve downloaded them to this Gist and stripped out the licensing info.

Authorship attribution requires the appliance of Pure Language Processing (NLP). NLP is a…

Statistical Experiments With Resampling | In the direction of Information Science

August Version: Summer time Reads for Information Scientists | by TDS Editors | Aug, 2023