Psycholinguistic Databases & Miscellaneous Resources

I update this list in my spare time based on random PubMed alerts. It’s not the most systematic way of populating this kind of list, but I’ve got a kid in high school. Times are hard! Please email me with suggestions for databases, corpora, and stimulus sets I have missed. Several other language/psycholing labs maintain their own metabases (metabase— Is that a word? It is now). See the lab pages for Marc Brysbaert and Dan Mirman. For some hardcore linguistic databases out there visit Words of the World and Experimental Linguistics in the Field.


Affective and Social Cognition Norms

affective ratings for >14k English words

Norms of valence, arousal, and dominance for 13,915 English lemmas from Warriner et al (2013).

affectvec

Vector based norms for over 70,000 English lemmas across over 200 fine-grained affective dimensions really down to the nittiest of the nitty gritty. These are interesting because they appear to have been generated using an embedding approach but are normalized to something that looks like a -1 to 1 Pearson Correlation for each particular dimension, making them more feature-based and useful for single words.

bilingual valence and arousal ratings for l1-L2

This very cool database lists valence and arousal ratings for bilingual adults evaluating English words (i.e., English as L2). Click here to see the paper by Imbault and colleagues (2020) How are words felt in a second language: Norms for 2,628 English words for valence and arousal by L2 speakers, Binlingualism, Language, and Cognition.

croatian >3k words on 5 emotions (Crowd-5e)

Thanks to Coso and colleagues for producing this database of affective norms for a chunky set of Croatian words. It’s pretty fascinating to think of how much cross-linguistic variability there is in lexical affect. With the Crowd-5e database, it should be possible to contrast English translation equivalents to get at this question. That’s a great idea for a master’s thesis (pssst… to you linguists out there).

grievance dictionary: language use in the context of grievance-fueled violence threat

This is a cool resource examining how language differs in the context of threat. van der Vegt and colleagues (2021) report norms for word usage commonly used in automated identification of threat in social media posts, etc. Click on the link above to access the data via the OSF.

NRC Word-Emotion Association Lexicon (emo-lex)

Emo-lex is a terrific set of crowdsourced word norms for many English words characterized across eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). This work was spearheaded by computational linguist, Dr. Saif Mohammad, at the National Research Council Canada.

social norms for 8388 english words

Diveica and colleagues have collected a terrific set of norms along with an inclusive definition of socialness (no easy feat!). Link to the data above on the OSF. To view the preprint, see Diveica, Pexman, & Binney (2021) Quantifying Social Semantics: an Inclusive Definition of Socialness and Ratings for 8,388 English Words. PsyArXiv.

Valence, Arousal, Dominance Lexicon (NRC-VAD)

Check out this terrific resource from Saif Mohammad — 20,000 English words rated on valence, arousal, and dominance using Best-Worst scaling. Plus, these words have translations to over 100 languages. Wow! The empirical paper describing the methodology is: Mohammad, S (2019). Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia.

valence norms COVID19 effects of age & pandemic

Very cool work showing resilience among older US and UK adults to the effects of pandemic on ratings of positivity for thousands of English words. Link to the data above. Read the paper from Kyröläinen AJ, Luke J, Libben G, Kuperman V. Valence norms for 3,600 English words collected during the COVID-19 pandemic: Effects of age and the pandemic. Behav Res Methods. 2021 Dec 16:1-12. doi: 10.3758/s13428-021-01740-0


Age of Acquisition & Early Childhood Language

AOa English words for Spanish L2 speakers

Thanks very much to my new Twitter friend, Dr. Carlos Romero-Rivas for pointing out this very interesting database and article on subjective AOA ratings for English words by Spanish-English bilinguals. Come for the data but read the article! It’s pretty damn fascinating.

aoa ratings >50k english words

The Kuperman et al. norms represent an expanded list from Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 1-13. To access the norms visit here.

children’s picture book lexical database

We can’t just apply lexical norms (e.g., frequency, imageability) gleaned from adults to understand the linguistic world of children. That’s what’s so awesome about this work by Green and collegues (2023 - click link above). The authors report norms for >25k words, including bigrams and multiword utterances. AWESOME stuff for you developmentalists out there. CLICK HERE for the data.


American Sign Language

asl-lex

This visually stunning and beautifully crafted database is from Professors Naomi Casselli, Zed Sevcikova Sehyr, Ariel Cohen-Goldberg, & Karen Emmorey. ASL-LEX provides lexical and phonological properties for about 1,000 signs of American Sign Language, including iconicity, frequency, and many other variables.

asl signbank

Wow! Here’s a huge dictionary of ASL signs linked with ID glosses. Signbank is linkable to ELAN and is part of the SLAAASh (“Sign Language Acquisition, Annotation, Archiving and Sharing”) project (link here to read about it) through UConn and Gallaudet Universities.


Arabic

LexArabic: Receptive vocabulary test for estimating Arabic proficiency

Hats off to Dr. Alaa Alzahrani for creating this open-source (yes!) resource for assessing L2 Arabic proficiency. Click above to link to the article in Behavior Research Methods. Super useful!


Bilingualism & Multilingualism

iris digital repository of materials for research into second languages

We should all be conducting language research with an eye toward multilingualism and cross-language generalization. Here is a set of resources for people on the forefront of these efforts. Thanks to Cylcia Boilbaugh for linking me to this resource. Read one of the origin papers here. Click the link above to view the Iris portal.

multilingual eye-movement corpus (MECO)

Wow oh wow. Led by Victor Kuperman and Noam Siegelman, this data repository reflects eye movement data from reading studies in native readers of 13 languages. Access the data by linking above.


Blackfoot

Blackfoot Words: a database of Blackfoot lexical forms

This awesome work from Weber et al (2023) includes (in the authors’ own words)…”structure and creation of Blackfoot Words, a new relational database of lexical forms (inflected words, stems, and morphemes) in Blackfoot (Algonquian; ISO 639-3: bla). To date, we have digitized 63,493 individual lexical forms from 30 sources, representing all four major dialects, and spanning the years 1743-2017. Version 1.1 of the database includes lexical forms from nine of these sources.”


Chinese

Norms for 1286 colored pictures in Cantonese

Zhong et al. (2024) report this picture norming study using stimuli from Multipic and the 2005 classic McRae norms applied to Cantonese. Link to the OSF repository to grab the norms directly!

chinese: ANCW: Affective norms for 4030 Chinese words

Link above to read the article by Ying and colleagues (2023, in press) in BRM. These words have arousal, concreteness, valence, and dominance norms. Link to the supplemental material from the article to access the stimuli (spreadsheet form).

chinese valence & arousal ratings >11k words

Read the article by Xu and colleagues (2021), and access the word ratings (valence and arousal) by clicking on the link above.

chinese lexicon project II: >25k lex decision, naming traditional Chinese two-character words

This terrific resource by Tse et al (2022) is one of a number of recent works that are allowing researchers to make great strides in understanding lexical access and word recognition in languages other than English. Awesome work!

chinese and english six semantic dimension database: a large database of semantic ratings and its computational extension

This awesome resource from Dr. Shaonan Wang and at NYU and Dr. Nan Lin at Institute of Psychology Chinese Academy of Sciences includes ratings of 17,940 commonly used Chinese words (and phrases) and a computaional extension version consisting semantic ratings of 1,427,992 Chinese words and 1,515,633 English words on six major semantic dimensions, including vision, motor, socialness, emotion, time, and space. 

chinese imageability ratings for >10k 2-character words

This work by Su and colleagues (2022) reflects imageability ratings (i.e., rate the extent to which this word conjures a mental image) for over 10,000 words and examines the relationship between imageability and other lexical variables. Click <here> for just the data!

chinese verb semantic features

Verbs are really difficult to characterize. Verbs are less imageable than nouns and have argument structures. Verb path/manner distinctions differ across natural languages, but much of what we know about verbs has been informed by English. Deng and colleagues have produced a valuable resource for analyzing semantic features of verbs in Chinese. Click above to visit their paper — data are here.


Concreteness, Imageability, Sensorimotor norms

abstract conceptual feature ratings English nouns

These reflect MTurk ratings (N>350 people) on 15 different cognitive dimensions for 750 abstract and concrete English nouns as described by our semantic space approach Frontiers in Human Neuroscience (see Troche et al, 2014; Crutch et al., 2013).  

concreteness ratings for 40k English Lemmas

Here's a mammoth set of word concreteness ratings from the great Professor Marc Brysbaert and colleagues.  To retrieve these word concreteness norms, click here.

concreteness ratings for 62k English multiword expressions

They’re at it again! Click on the link above for the preprint of Muraki et al (2022) Concreteness ratings for 62 thousand English multiword expressions. If you just want to get your grubby hands on the data, link here.

lancaster sensorimotor norms

Here’s a giant set of effector-specific norms for many English words from Lynott and colleagues (2019). These people are the absolute shit.


Corpora: Language Samples

candor corpus >1 TB of multimodal corpus of human speech

You want over 850 hours of conversations transcribed and segmented down to the millisecond? Well… say no more. Thanks to Reece and colleagues for making this wonderful resource publicly available in the very best spirit of science. Click to link to the preprint above.

corpus of contemporary english (coca)

The Corpus of Contemporary American English (COCA) is a very large corpus of English for you text miners and NLP folk. There’s a fee, but it’s not too exorbitant.

concretext norms: concreteness ratings for English and Italian words in context

Some pretty cool norms here reflecting concreteness ratings (how strongly can you perceive this word through the senses?) for English and Italian words in sentence contexts. Click on the link above for the PloS One article from Montefinesse and colleagues. Click <here> to access the data directly.

global vectors for word representation (glove)

From Pennington et al. (2014) out of the Stanford NLP group: “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.”

88-million-word language of conspiracy corpus (loco)

I’m almost afraid to look at this, but it looks like an amazing corpus for mining the weird language of conspiracy theorists. Be sure to check out the paper from Miani et al (2021) in Behavior Research Methods

spotify spoken language podcast corpus (spotify corpus)

47,000 hours of podcast transcriptions are downloadable here. Wow!

talkbank

AphasiaBank, DementiaBank, BilingBank — this terrific resource by Professor Brian Macwhinney and his many colleagues and collaborators is one of the best resources around for analyzing natural language samples (stories, dyads, etc.) in numerous clinical and non-clinical populations. Thanks to Professor MacWhinney for all his hard work on collecting these language samples and building the complex data structures to make them public.


Databases General (including aggregation tools)

glasgow psycholinguistic norms (imageability, valence, etc.)

Normative ratings for 5,553 English words on nine psycholinguistic dimensions: arousal, valence, dominance, concreteness, imageability, familiarity, age of acquisition, semantic size, and gender association reported by Scott et al 2018 in Behavior Research Methods. 

LexiCAL: A calculator for lexical variables

Chee and colleagues (2021) published this the nuts and bolts of these tools for computing numerous properties of any corpus you feed it. Check out their article in PLoS One, which includes Python scripts for deriving the norms. I wish I knew how to program in Python a bit better than I do now. I’m stuck in the tidyverse.

LexOPS: R-Package & User Interface for the Controlled Generation of Word Stimuli

One database to rule them all! Where was this when I was trying to match stimuli for my doctoral dissertation. A Shiny app interface, too? Get out of town! Link to the PsyArXiv preprint here.

mrc psycholinguistic database

Here's the queen mother of psycholinguistic databases from the MRC/CBU (Cambridge).  Many of the word frequency and concreteness measures are too dated at this point, but the filtering features, concreteness, familiarity, etc. make this wonderful resource tough to beat. 

scope: the south carolina metabase

Per Gao and colleages… “The South CarOlina Psycholinguistic MEtabase (SCOPE) is a curated collection of psycholinguistic properties of words from major databases. It currently contains more than 200 variables and over 79,000 words and nonwords”. Read the preprint while it’s hot! Anything from Rutvik Desai’s group is off the hook. This metabase approach is the way to go….

word.norms database aggregation

Professor Erin Buchanan’s terrific resource from the Doom Lab pooling databases for word associations, frequency, etc. Link to the word norm database to specify ranges and generate your own output.


Data Visualization & Graphic Design Resources

brain illustrating: shading the freesurfer brain

I often find myself making illustrations of brains and highlighting particular regions for talks. Here's a document on how to do this in Photoshop using the Freesurfer brain rendering as a base.

dataviz

Technically this isn’t a psycholinguistic database or a stimuli bank but whatta resource! It’s a beautiful website organized like a decision tree for different plot and figure options along with links to R code and galleries. Thanks to software engineer Yan Holz and designer Conor Healy for this beautiful and useful resource.

google n-gram viewer

plot the frequency of any word or combination of words (n-grams) across many texts from 1800 to the present using Google’s interactive ngram viewer. Visit the ngram site here.

data visualization: plots, plots, plots

A gallery of plots generated in R with associated code from me, good old Jamie Reilly. Don’t mock.

ten simple rules for designing graphical abstracts

This 2024 article by Jambor and Bornhäuser in PLoS Computational Biology lays out some terrific guidelines for how to produce an effective graphical abstract for distilling your complex mechanisms or processing pipelines into digestible chunks. I love it!


Dictionaries and Related Lexical Resources

hunspell

*Note to self add link, dummy

urban dictionary

Today’s featured entry is “back burner bitch” or Triple B. It’s a friend who is your last resort for hanging out with (but doesn’t know it). Urban dictionary has zillions of these entries. We’ve used the urban dictionary extensively in our work on taboo word usage.


Dutch

arousal, valence, happiness, anger, fear etc. for 24k Dutch words

Is there anything that Dr. Laura Speed is not capable of accomplishing! Here she goes again with the inevitable Marc Brysbaert on this terrific set of Dutch word norms. Read the article or skip right to the data; Click here to visit their OSF site.

bank of standardized stimuli (boss): dutch names for 1400 photographs

The title pretty much says it all. Visit this work by Decuyper et al. (2021) in the Journal of Cognition (click link above). To get your grubby hands on the data directly, click here.

semantic gender norms for 24k dutch words

Semantic gender is a pretty mindblowing phenomenon. I remember taking German in high school and wondering why Tisch (table) is der tisch (masculine). Semantic gender isn’t really about that but more about priming — when you hear ‘der’ you are primed for only a subset of the lexicon relative to hearing ‘the’ which could be followed by just about anything. Read what Vankrunkelsven and colleagues have to say about how semantic gender facilitates our processing of word meaning. Oh yeah — norms too!


Embodied Cognition and Related Phenomena

calgary semantic decision project and embodied cognition ratings

Link above for category decision norms (concrete or abstract) and embodiment ratings from my Canadian idol, Dr. Penny Pexman and her co-authors Allison Heard, Ellen lloyd, and Melvin Yap. Great people. Great data.


Estonian

Concreteness ratings for 36k Estonian words

Congratulations to Proos and Aigro (2023) on their article in Behavior Research Methods (link above) reporting concreteness values for almost 36k Estonian words as derived from over 2k Estonian native speakers. The authors also contrasted these human-generated norms with a larger set of concreteness values generated by machine-learning reported by Aedmaa et al (2018). Humans and machines although strongly positively correlated in their ratings (R=.70), diverged in some pretty substantial ways. CHECK IT OUT!


Finnish

LASTU: A psycholinguistic search tool for Finnish lexical stimuli

There’s a special place in my heart for Finland. I’ve visited the country five times and have some very dear friends there. It’s a beautiful place, and Finnish is an astounding language. Thanks to Itkonen and colleages (2024) for producing this lexical database of Finnish. Link to the paper above or to view the database directly, click here. An added bonus is that the senior author is my friend, Minna Lehtonen.


French

Conceptual familiarity for 4k French nouns

Oui! Chedid et al (2019) report norms for conceptual familiarity of —- wait for it—— 4000 French nouns. This is some interesting work. There is a lot of controversy around disentangling lexical concepts from concepts. It’s nice to see more psycholinguistic norms from French coming down the pike.

morpholex-FR derivational morphology for almost 39k French words

Who doesn’t love French morphology? Actually, I am embarrassed to say that I don’t know anything about French morphology, but my ignorance shouldn’t stop you from caring about French morphology. Thanks to Maihot and colleagues (2020) for publishing this very cool paper and associated set of norms. Click above to link to their article in Behavior Research Methods.


Greek

GreekLex: a lexical database of Modern Greek

I have a confession. When I first saw this title I thought it said, ‘geeklex’ which would not have been even half as impressive as the actual work by Ktori and colleagues (2008) reporting lexical information for over 35k modern Greek words,


Hindi

shabd: a psycholinguistic database for Hindi

Verma and colleagues (2022) provide this resource based on a 1.2 billion word query of Hindi. Data include frequency counts and part-of-speech tags for a subset of words. Click on the link to access the article! Thanks to these authors for this terrific resource.


Iconicity

iconicity norms for >14k English words

This work led by the great Bodo Winter reports iconicity ratings for lots and lots of English words. This is a TERRIFIC resource for anyone interested in iconicity, sound symbolism, and related phenomena. Click <here> to access the data directly.


Italian

italian sensorimotor norms: perception & action strength >900 words

These norms by Repetto and colleagues (2022) BRM reflect perceptual and motor salience for over 900 words, adding to a growing list of modality norms. Click here to visit the data or the link in the title for the article description.


Megastudies (lex decision, naming)

chinese lexicon project II: >25k lex decision, naming traditional Chinese two-character words

This terrific resource by Tse et al (2022) is one of a number of recent works that are allowing researchers to make great strides in understanding lexical access and word recognition in languages other than English. Awesome work!

english lexicon project (lexical decision and speeded naming)

Here's another bread-and-butter psycholinguistic database from Professor David Balota at Washington University in Saint Louis. This monster has trial level naming and lexical decision data for zillions of English words.


Miscellaneous Resources (aka the island of lost toys)

context availability norms for 3k English words and their association with lexical processing

I love the old research by Schwanenflugel on context availability as an alternative to concreteness effects in lexical processing (e.g., fork conjures the context of a kitchen schema). Taylor and Colleagues (J Cognition 2022) present norms for context availability for >3k English words. Click here to get your hands on the data!

general knowledge norms

Did you know that Tasmanian Devils are bioluminescent? This wasn’t one of the general world knowledge questions Coane & Umanath (2021) assessed in their norms, but I like to think that one day this little factoid will become general knowledge.

idiom norms for english and german

Link to the English-German Database of Idiom Norms (DIN). These include a set of 300 idioms and associated norms collected by Sara D. Beck & Andrea Weber (2016) at the University of Tubingen.

oddity detection in real world scenes (ODDS database)

Click on the link above to access the OSF site and database of real world scenes by Hout and colleagues (2022). This is like a YUGE pile of real world Where’s Waldo photos where they don’t tell you what Waldo looks like. I love it!

prevalence ratings >60k words

Brysbaert and colleagues (2019) reported prevalence estimates for 61,800 words. ‘Prevalence’ is the relative proportion of people who know a particular word normalized using a probit transformation (see Marc Brysbaert’s webpage for a simple explanation). Prevalence can provide complementary information to word frequency and familiarity.


Morphology and Compounding

ladec: large database of english compounds

If you’re out taking your bulldog for a walk and want a catfish sandwich, look no further than this database of >8000 English compound words from Gagné, CL., Spalding, TL., & Schmidtke, D. (2019). LADEC: Large database of English compounds. Behaviour Research Methods. Link to the data here. Link above to the article in BRM.

morpholex english

English morphology for 70k-ish words as reported by Sánchez-Gutiérrez et al (2017).


Narrative Stimuli, Stories

aesop’s fables

Check out this paper by Ward et al (2015) examining the effects of age, acoustic challenge, and verbal working memory on recall of narrative speech. Audio files matched on all sorts of shit, and Aesop too. What’s not to like?

fMRI open narratives

Holy moly! The best of open science is upon us. Here’s a massive set of functional imaging data on naturalistic speech comprehension in the scanner. Thanks to Professor Uri Hasson for maintaining this resource. To read the paper and related documentation, see Nastase et al (2019), OpenNeuro, ds002345. https://doi.org/10.18112/openneuro.ds002345.v1.1.3

stories! NYU-BU contextually controlled stories corpus

These spoken narratives reflect meticulous experimental control as reported by Lewis and colleagues. The stimuli consist of, “16 high-quality recordings of 8 unique stories, spoken both by a female and a male actor. Each story consists of 128 sentences (~2000 words per story) organized around critical keywords, which have been matched along multiple linguistic dimensions”


Nonword & Pseudoword Stimuli

ARC nonword Database

Working on a lexical decision task and need some weird nonword foils?   Hmmm... but what if I need to specify some weird orthotactic constraints on my nonword stimuli?  Never fear. The bulk of the work has been done for you with the ARC nonword database.

klingon pocket dictionary

I am embarrassed to admit this, but I hit the Klingon dictionary for nonwords for one of the lexical decision experiments in my doctoral dissertation. So what! Just live with it.

wuggy multilingual pseudoword generator

This very cool database generates pseudowords aligned with the phonotactic rules of a specific target language as outlined in the paper from Keuleers, E., & Brysbaert, M. (2010).


Picture Stimuli & Naming Norms

age positive image gallery

There are so many things to like about this gallery of free images depicting a wide variety of older adults in a positive light! Click on the link above to access the photo gallery (no associated article or norming data).

flaticon

>10.5 million free icons if you need an icon. It’s not a scientific source, but don’t be such a snob. Did you know that Dr. Amy Vogel Eyny’s sister adds fake reviews to her Rate My Professor page? For real!

international picture naming database (IPNP)

Here's a great set of pictures of actions (n=275) and objects (n=520) normed across multiple languages from the International Picture Naming Project out of UCSD.

multi-language written picture naming dataset

This triumph by Torrance and colleagues (2017) involves trial level reaction time and naming agreement norms for the 260 colorized Snodgrass and Vanderwart pictures by Rossion and Pourtois (2004) for over 1200 participants across the following languages: (Bulgarian, Dutch, English, Finnish, French, German, Greek, Icelandic, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish). WOW!!!!

multipic

The Multilingual Picture (MultiPic) databank is the result of an international collaborative project intended to provide the scientific community with a set of publicly available 750 drawings from common concrete concepts created by the same author, standardized for name agreement and visual complexity in several languages. See Duñabeitia, J.A., Crepaldi, D., Meyer, A.S., New, B., Pliatsikas, C., Smolka, E., & Brysbaert, M. (in press). MultiPic: A standardized set of 750 drawings with norms for six European languages.Quarterly Journal of Experimental Psychology.

noun project

For one million icons and symbols, link here. There’s a subscription fee for use, but it won’t break your bank unless you live in Silicon Valley (burn).

pisces pictures with social context and emotional scenes

Click on the link above for some great scene stimuli by Teh and Colleagues (2018) normed for emotional valence, intensity, and social engagement. Someone must have been visiting my family at Thanksgiving to get this dark...

proper noun and place name norms in younger and older adults

This work by Souza et al (2022) examined naming in younger and older adults for famous places and people. These norms are culturally adapted and include all the usual suspects in terms of familiarity, etc.

scidraw

A database of free pictures for scientific presentations and whatever else you like. These aren’t normed or anything, but they might be useful for someone, and God if they aren’t cute. Check out the rats boxing c/o Antonis Asiminas

things database 1854 object concepts, 26k natural images

Click on the link above to access the PLoS One article from Hebert and colleagues (2019). If you can’t wait to get your grubby hands on the stimuli, then click here to visit the OSF site.

sun scene database (mit)

Need a picture of a beach or a kitchen -- or 15,000 other naturalistic scenes?  Here's the database for you.


Norwegian

Norwegian words: A lexical database for clinicians and researchers

Welcome to our Norwegian friends! This resource from Lind and colleagues (2015) appeared in the journal, Clinical Linguistics and Phonology. The database contains extensive phonological and morphological coding for 1600 or so Norwegian words.


R Packages for Language Science (links to github pages)

ConversationAlign

Reads dyadic transcripts, yokes numeric values to each word, and computes indices of alignment across pairs of interlocutors. Install the package from github using devtools.

curser

Generates novel combinations or curse words and common nouns using algorithms described in Reilly J, Kelly A, Zuckerman B, *Twigg P, *Wells M, *Jobson K, & *Flurie M (2020) Building the perfect curse word:  A psycholinguistic investigation of the form and meaning of taboo words. Psychonomic Bulletin & Review. 27(1).

semdistflow

Computes running bigram semantic distances for every pair of words in any length text you feed the program. Uses algorithms described in Reilly J, *Finley AM, Litovsky C, & Kennett Y (2023) Bigram semantic distance in continuous language narratives: Theory, method, applications. Journal of Experimental Psychology: General. 152(9), 2578-2590.

usapresidentialdebates

Two-party presidential debate transcripts from 1960 to 2020 with metadata on the candidates (e.g., party, party winner, age) and economic indicators (e.g., GDP). Package is optimized for use with its companion R package, ConversationAlign.


Reading, Spelling, Orthography

false fonts: a compendium of fonts

Here ‘s a really useful set of novel constructed orthographies from the FontStruct website. We are planning on using one of these fonts as a lower level visual baseline for English orthography (similar luminance and complexity, minus the semantics) for an EEG study we are soon launching.

orthographic consistency norms for 37k English words

With all the Yachts, Colonels, and Wednesdays in the English language, it’s a wonder anyone ever learns to read. Chee and colleagues (2020) report norms for feedforward (spelling-to-sound) and feedback (sound-to-spelling) consistency among 37,677 English words. This is a terrific resource for anyone investigating reading, writing, and disorders thereof.

spelling-to-pronunciation norms for 20k English words

English is such a tangled web. It’s a wonder any of us ever learns to read. Check out this work by Edwards and colleagues (2023) in BRM. Visit the authors’ github repo to steal the data for the low cost of free!

text readability via CLEAR: a corpus of normed reading passages

Visit Crossley et al (2022). A large-scale corpus for assessing text readability to read about this dataset. Click on the link above to get your hands on the reading passages and all of the beautiful analytics on the complexity of each passage.



Semantic Category Norms, Features, & Networks

decompositional semantics initiative

I confess. I study semantic memory but have never taken a formal semantics class. It’s pretty obvious if you’ve read anything I’ve written that most of the time I have no idea what I’m talking about, but you can bet that these people do. Visit to find out all about the many tools these computational linguists and computer scientists have created for elucidating semantic composition.

Feats: A database of semantic features for early produced noun concepts

Congratulations to Borovsky and colleagues (2023) on their recent publication in BRM! I’ll let the authors explain these norms in their own words, “Feats—a tool that was designed to make headway on these challenges by providing a database, the Language Learning and Meaning Acquisition (LLaMA) lab Noun Norms that extends a widely used set of feature norms McRae et al. Behavior Research Methods 37, 547–559, (2005) to include full coverage of noun concepts on a commonly used early vocabulary assessment” — Ken McRae is the big daddy of all semantic feature norms, so this dataset is bound to be terrific.

a large database of semantic norms and their computational extension

How cool is this article from Wang and colleagues (2023) appearing in Nature Scientific Data? Using embeddings to extrapolate semantic ratings is such a cool idea. From the authors: “Six Semantic Dimension Database (SSDD), which contains subjective ratings for 17,940 commonly used Chinese words on six major semantic dimensions: vision, motor, socialness, emotion, time, and space. Furthermore, using computational models to learn the mapping relations between subjective ratings and word embeddings, we include the estimated semantic ratings for 1,427,992 Chinese and 1,515,633 English words in the SSDD” -

semantic congruency norms: object-scene matching the ObScene database

I honestly can’t think of a database with a better name than this one — Ob- for object and scene for scene makes obscene. Love this. Andrade and colleagues report 898 object-scene pairs (e.g., suitcase-airport). I can think of so many uses for these stimuli! Visit their OSF site to access the data directly.

semantic category production norms for 117 concrete and abstract categories

Thanks to Banks and Connell (2022) for publishing this massive dataset of semantic category norms. People produced as many exemplars as they could in 60 seconds for 117 categories. It’s like a GIANT verbal fluency task. Read the paper above. Like to the OSF HERE for the data.

semantic distance web interface (snaut norms)

This simple web-based interface allows the user to derive distances between words or documents based on a continuous bag of words (CBOW) embedding model trained on subtitles for English and Dutch. For methods, see Mandera, P., Keuleers, E., & Brysbaert, M. (in press). Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language.

semantic feature coding for concrete and abstract concepts in FMRI

Wang and colleagues (2022) present feature norms for 600+ concepts by 11 participants undergoing fMRI. The concepts are abstract and concrete, and the features were generated offline using crowdsourcing. Click on the link above to access the data on Open Neuro. Very cool!

semantic feature production norms for 4436 English concepts

We have the great Erin Buchanan and a gang of roving misfits to thank for this extended database of feature production norms for 4436 concepts. Here’s a link to the paper in Behavior Research Methods. Click on the title above to link to the data on the OSF.

semantic feature norms for manipulable objects

Thanks to Valerio et al (2023) for publishing these norms. Read the article in Cognitive Neuropsychology all you embodied cognition freaks. 130 participants, 80 objects. Link to the database on the OSF here!

synonymy and semantic feature generation in younger and older adults

Here’s some great data from two fantastic people. Read the 2022 paper by Wei Wu and Paul Hoffman in Royal Society Open Science. Link to their data on the OSF by clicking above. These scholars report synonymy judgments and feature matching provided by 200 older and younger adults.

wordnet

The shadowy company that created the Terminator? Or was that Skynet? No matter… Wordnet has been around for a long time. It’s one of those bread-and-butter sources for constructing semantic networks. Wordnet plots distances between many English words using something called synsets. I think they’re synonym-like, but what do I know? Link to Wordnet above to find out.


Sentence Processing & Syntax

cloze probability, predictability, and alignment w/ EEG for 205 English sentences

“Cat on a hot tin ______” —- This feels like a USA Today crossword puzzle clue, but it’s also a good demonstration of cloze probability. ‘Roof’ in this context is a highly constrained candidate. Violations of cloze probability expectations (e.g., cat on a hot tin banana) are a long love of EEG language researchers. These sentences from Varga et al. are very carefully aligned WRT to cloze probabilty and predictability.

sentence completion norms

Need some sentence completion norms? Don’t we all! Well, first read this paper by Peelle et al. (2020) if you can stomach it. Then link to the stimuli by clicking above.


Spanish

SPALEX: A Spanish Lexical Decision Database From a Massive Online Data Collection

Pretty amazing work by Aguasvivas et al (2018) appearing in Frontiers in Psychology representing a welcome megatstudy of word recognition latencies as judged by lexical decision (i.e., Is this a word? Y/N) for Spanish.

spanish positive emotion norms

I love this work by Hinojosa and colleagues in BRM (2023, in press). The authors report norms for 9000 Spanish words across 7 positive emotions. This is an awesome resource for you affect-heads out there. LINK HERE to access the data.

spanish verb naming norms for psycholinguistic and motor content variables

Link above to the data reported by San Miguel-Abella et al 2021 for verb naming. These norms include over 4000 Spanish verbs — This is an awesome resource


Speech Perception and Speech Reading (aka lip reading)

Mouth and Facial Informativeness Norms for 2,276 English Words

Face/mouth visual articulatory norms for thousands of English words. This terrific work by Krason and colleagues in Behavior Research Methods includes norms for visual articulatory salience of spoken words derived from mturkers viewing silent videos of people speaking. Click here to zap you right over to the data.


Symbol Processing & Symbolic Cognition

symcog: an open source toolkit for assessing human symbolic cognition

Click on this awesome resource by Flurie et al (2022) which includes a set of Heider-Simmel like animations depicting abstract concepts such as heaviness. There’s also an extensive list of concepts without words. I should know - I’m a co-author on this!


Tabooness

tabooness norms for american english

Click on the link above to access norms for word length and other formal (e.g, phonological) variables for a set of taboo words. WARNING:
Some of these words include hate speech. This database reflects 1205 English high frequency words coded across 22 psycholinguistic variables.Click HERE to download data on combinatorial cursing (i.e., what makes a good combination of a curse word and a common noun in American English). We reported these data in Reilly et al (2020).


Welsh

SUBTLEX-CY: A new word frequency database for Welsh

So cool! Word frequency norms for Welsh as reported by van Heuven et al (2023) based on a >30million word corpus of Welsh subtitles. Here’s the weird thing about generating frequency norms from news subtitles — you tend to radically underestimate the prevalence of cursing in daily life since many countries impose restrictions on what you can/can’t say in media.


Word Associations

small world of words English word association norms for 12k words

De Deyne and colleagues have been collecting word association norms a la Doug Nelson’s classic USF norms for the past few years. They now have word associations for 12,000 words. That’s yuge!

university of south florida (USF) word association norms

When I say ‘dog’, what’s the first word that comes to mind? That’s word association, and it tells us a lot about language and semantic memory. The USF database is the OG of word association norms from Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998). The University of South Florida word association, rhyme, and word fragment norms.

words (Dutch and English) with corresponding pictures matched on visual and semantic similarity

This is a terrific resource for those of us interested in interactions between language, vision, and semantic memory. Thanks to Falk Huettig for sending these our way! The stimuli and matching procedures are described by de Groot and colleagues (2015). Think picture tetrads varied in semantic similarity.


Word Frequency (words & multiword utterances)

subtlex US

Everybody uses these word frequency norms from Marc Brysbaert! These word frequency norms reflect frequency counts derived from movie and news subtitles. If you’re using CELEX or Kucera and Francis, drop those zeroes and get with the hero.

multilex: word frequency for multiword utterances in French and English

It is difficult enough just to interpret a frequency value for one word (e.g., dog). It could be a noun or a verb or have an alternate meaning altogether. Multilex moves beyond single words to produce frequency values for multiword utterances in English and in French. The authors made creative use of the Google n-gram database here.

WorldLex Blog, Twitter and Newspapers Word Frequencies for 66 languages

Linguists rejoice! Lexical frequency data across many natural languages scraped from Twitter, Blogs, and other such media as reported by Gimenes, M., & New, B. (2015).