Thursday, February 28, 2013

Canonic authors and the pronouns that they used

My last post had the aggregate statistics about which parts of the library have more female characters. (Relatively). But in some ways, it's more interesting to think about the ratio of male and female pronouns in terms of authors whom we already know. So I thought I'd look for the ratios of gendered pronouns in the most-collected authors of the late 19th and early twentieth centuries, to see what comes out.

On the one hand, I don't want to claim too much for this: anyone can go to a library and see that Washington Irving doesn't write female characters. But as one of many possible exercises in reducing down the size of the library to rethink the broad aspects of the literary canon, c. 1910, I do think it's suggestive; and, as I'll suggest towards the end, knowing these practical details can help us explore the instability of 'subject' or 'genre' as expressed by the librarians who choose where to put these books on the shelves.

Monday, February 25, 2013

Genders and Genres: tracking pronouns

Now back to some texts for a bit. Last spring, I posted a few times about the possibilities for reading genders in large collections of books. I didn't follow up because I have some concerns about just what to do with this sort of pronoun data. But after talking about it to Ryan Cordell's class at Northeastern last week, I wanted to think a little bit more about the representation of male and female subjects in late-19th century texts. Further spurs were Matt Jockers recently posted the pronoun usage in his corpus of novels; Jeana Jorgensen pointed to recent research by Kathleen Ragan that suggests that editorial and teller effects have a massive effect on the gender of protagonists in folk tales. Bookworm gives a great platform for looking at this sort of question.

Thursday, February 14, 2013

Anachronism patterns suggest there's nothing special about words

I'm cross-posting here a piece from my language anachronisms blog, Prochronisms.

It won't appear the language blog for a week or two, to keep the posting schedule there more regular. But I wanted to put it here now, because it ties directly into the conversation in my last post about whether words are the atomic units of languages. The presumption of some physics inflected linguistics research is that it is. I was putting forward the claim that it's actually Ngrams of any length. This question is closely tied to the definition of what a 'word' is (although as I said in the comments, I think statistical regularities tend to happen at a level that no one would ever call a 'word,' however broad a definition they take).

The piece from Prochronisms is about whether writers have a harder time avoiding anachronisms when they appear as parts of multi-word phrases. Anachronisms are a great test case for observing what writers know about language. Writers are trying to talk as if they're from the past; but--and this is fundamental point I've been making over there--it's all but impossible for them to succeed. So by looking at failures, we can see at what level writers "know" language. If there's something special about words, we might expect them to know more about words than phrases. But if--as these preliminary data seem to indicate--users of language don't seem to have any special knowledge of individual words, that calls into question cognitive accounts of changes in language, like the one the physicists offered, that rely on some fixed 'vocabulary' limit that is enumerated in unigrams.

Anyhow, here's the Prochronisms post:

Wednesday, February 6, 2013

Are words the atomic unit of a dynamic system?

My last post was about how the frustrating imprecisions of language drive humanists towards using statistical aggregates instead of words: this one is about how they drive scientists to treat words as fundamental units even when their own models suggest they should be using something more abstract.

I've been thinking about a recent article by Alexander M. Petersen et al., "Languages cool as they expand: Allometric scaling and the decreasing need for new words." The paper uses Ngrams data to establish the dynamics for the entry of new words into natural languages. Mark Liberman argues that the bulk of change in the Ngrams corpus involves things like proper names and alphanumeric strings, rather than actual vocabulary change, which keeps the paper from being more than 'though-provoking.' Liberman's fundamental objection is that although the authors say they are talking about 'words,' it would be better for them to describe their findings in terms of 'tokens.' Words seem good and basic, but dissolve on close inspection into a forest of inflected forms, capitals, and OCR mis-readings. So it's hard to know whether the conclusions really apply to 'words' even if they do to 'tokens.'