Using natural language processing to analyse text data in behavioural science


Abstract

Language is a uniquely human trait at the core of human interactions. The language people use often reflects their personality, intentions and state of mind. With the integration of the Internet and social media into everyday life, much of human communication is documented as written text. These online forms of communication (for example, blogs, reviews, social media posts and emails) provide a window into human behaviour and therefore present abundant research opportunities for behavioural science. In this Review, we describe how natural language processing (NLP) can be used to analyse text data in behavioural science. First, we review applications of text data in behavioural science. Second, we describe the NLP pipeline and explain the underlying modelling approaches (for example, dictionary-based approaches and large language models). We discuss the advantages and disadvantages of these methods for behavioural science, in particular with respect to the trade-off between interpretability and accuracy. Finally, we provide actionable recommendations for using NLP to ensure rigour and reproducibility.

This is a preview of subscription content, access via your institution

Access options

/* style specs start */

/* style specs end */

Buy this article

Buy now

Prices may be subject to local taxes which are calculated during checkout

/* style specs start */
style {
display: none !important;
}
.LiveAreaSection * {
align-content: stretch;
align-items: stretch;
align-self: auto;
animation-delay: 0s;
animation-direction: normal;
animation-duration: 0s;
animation-fill-mode: none;
animation-iteration-count: 1;
animation-name: none;
animation-play-state: running;
animation-timing-function: ease;
azimuth: center;
backface-visibility: visible;
background-attachment: scroll;
background-blend-mode: normal;
background-clip: borderBox;
background-color: transparent;
background-image: none;
background-origin: paddingBox;
background-position: 0 0;
background-repeat: repeat;
background-size: auto auto;
block-size: auto;
border-block-end-color: currentcolor;
border-block-end-style: none;
border-block-end-width: medium;
border-block-start-color: currentcolor;
border-block-start-style: none;
border-block-start-width: medium;
border-bottom-color: currentcolor;
border-bottom-left-radius: 0;
border-bottom-right-radius: 0;
border-bottom-style: none;
border-bottom-width: medium;
border-collapse: separate;
border-image-outset: 0s;
border-image-repeat: stretch;
border-image-slice: 100%;
border-image-source: none;
border-image-width: 1;
border-inline-end-color: currentcolor;
border-inline-end-style: none;
border-inline-end-width: medium;
border-inline-start-color: currentcolor;
border-inline-start-style: none;
border-inline-start-width: medium;
border-left-color: currentcolor;
border-left-style: none;
border-left-width: medium;
border-right-color: currentcolor;
border-right-style: none;
border-right-width: medium;
border-spacing: 0;
border-top-color: currentcolor;
border-top-left-radius: 0;
border-top-right-radius: 0;
border-top-style: none;
border-top-width: medium;
bottom: auto;
box-decoration-break: slice;
box-shadow: none;
box-sizing: border-box;
break-after: auto;
break-before: auto;
break-inside: auto;
caption-side: top;
caret-color: auto;
clear: none;
clip: auto;
clip-path: none;
color: initial;
column-count: auto;
column-fill: balance;
column-gap: normal;
column-rule-color: currentcolor;
column-rule-style: none;
column-rule-width: medium;
column-span: none;
column-width: auto;
content: normal;
counter-increment: none;
counter-reset: none;
cursor: auto;
display: inline;
empty-cells: show;
filter: none;
flex-basis: auto;
flex-direction: row;
flex-grow: 0;
flex-shrink: 1;
flex-wrap: nowrap;
float: none;
font-family: initial;
font-feature-settings: normal;
font-kerning: auto;
font-language-override: normal;
font-size: medium;
font-size-adjust: none;
font-stretch: normal;
font-style: normal;
font-synthesis: weight style;
font-variant: normal;
font-variant-alternates: normal;
font-variant-caps: normal;
font-variant-east-asian: normal;
font-variant-ligatures: normal;
font-variant-numeric: normal;
font-variant-position: normal;
font-weight: 400;
grid-auto-columns: auto;
grid-auto-flow: row;
grid-auto-rows: auto;
grid-column-end: auto;
grid-column-gap: 0;
grid-column-start: auto;
grid-row-end: auto;
grid-row-gap: 0;
grid-row-start: auto;
grid-template-areas: none;
grid-template-columns: none;
grid-template-rows: none;
height: auto;
hyphens: manual;
image-orientation: 0deg;
image-rendering: auto;
image-resolution: 1dppx;
ime-mode: auto;
inline-size: auto;
isolation: auto;
justify-content: flexStart;
left: auto;
letter-spacing: normal;
line-break: auto;
line-height: normal;
list-style-image: none;
list-style-position: outside;
list-style-type: disc;
margin-block-end: 0;
margin-block-start: 0;
margin-bottom: 0;
margin-inline-end: 0;
margin-inline-start: 0;
margin-left: 0;
margin-right: 0;
margin-top: 0;
mask-clip: borderBox;
mask-composite: add;
mask-image: none;
mask-mode: matchSource;
mask-origin: borderBox;
mask-position: 0 0;
mask-repeat: repeat;
mask-size: auto;
mask-type: luminance;
max-height: none;
max-width: none;
min-block-size: 0;
min-height: 0;
min-inline-size: 0;
min-width: 0;
mix-blend-mode: normal;
object-fit: fill;
object-position: 50% 50%;
offset-block-end: auto;
offset-block-start: auto;
offset-inline-end: auto;
offset-inline-start: auto;
opacity: 1;
order: 0;
orphans: 2;
outline-color: initial;
outline-offset: 0;
outline-style: none;
outline-width: medium;
overflow: visible;
overflow-wrap: normal;
overflow-x: visible;
overflow-y: visible;
padding-block-end: 0;
padding-block-start: 0;
padding-bottom: 0;
padding-inline-end: 0;
padding-inline-start: 0;
padding-left: 0;
padding-right: 0;
padding-top: 0;
page-break-after: auto;
page-break-before: auto;
page-break-inside: auto;
perspective: none;
perspective-origin: 50% 50%;
pointer-events: auto;
position: static;
quotes: initial;
resize: none;
right: auto;
ruby-align: spaceAround;
ruby-merge: separate;
ruby-position: over;
scroll-behavior: auto;
scroll-snap-coordinate: none;
scroll-snap-destination: 0 0;
scroll-snap-points-x: none;
scroll-snap-points-y: none;
scroll-snap-type: none;
shape-image-threshold: 0;
shape-margin: 0;
shape-outside: none;
tab-size: 8;
table-layout: auto;
text-align: initial;
text-align-last: auto;
text-combine-upright: none;
text-decoration-color: currentcolor;
text-decoration-line: none;
text-decoration-style: solid;
text-emphasis-color: currentcolor;
text-emphasis-position: over right;
text-emphasis-style: none;
text-indent: 0;
text-justify: auto;
text-orientation: mixed;
text-overflow: clip;
text-rendering: auto;
text-shadow: none;
text-transform: none;
text-underline-position: auto;
top: auto;
touch-action: auto;
transform: none;
transform-box: borderBox;
transform-origin: 50% 50%0;
transform-style: flat;
transition-delay: 0s;
transition-duration: 0s;
transition-property: all;
transition-timing-function: ease;
vertical-align: baseline;
visibility: visible;
white-space: normal;
widows: 2;
width: auto;
will-change: auto;
word-break: normal;
word-spacing: normal;
word-wrap: normal;
writing-mode: horizontalTb;
z-index: auto;
-webkit-appearance: none;
-moz-appearance: none;
-ms-appearance: none;
appearance: none;
margin: 0;
}
.LiveAreaSection {
width: 100%;
}
.LiveAreaSection .login-option-buybox {
display: block;
width: 100%;
font-size: 17px;
line-height: 30px;
color: #222;
padding-top: 30px;
font-family: Harding, Palatino, serif;
}
.LiveAreaSection .additional-access-options {
display: block;
font-weight: 700;
font-size: 17px;
line-height: 30px;
color: #222;
font-family: Harding, Palatino, serif;
}
.LiveAreaSection .additional-login > li:not(:first-child)::before {
transform: translateY(-50%);
content: “”;
height: 1rem;
position: absolute;
top: 50%;
left: 0;
border-left: 2px solid #999;
}
.LiveAreaSection .additional-login > li:not(:first-child) {
padding-left: 10px;
}
.LiveAreaSection .additional-login > li {
display: inline-block;
position: relative;
vertical-align: middle;
padding-right: 10px;
}
.BuyBoxSection {
display: flex;
flex-wrap: wrap;
flex: 1;
flex-direction: row-reverse;
margin: -30px -15px 0;
}
.BuyBoxSection .box-inner {
width: 100%;
height: 100%;
padding: 30px 5px;
display: flex;
flex-direction: column;
justify-content: space-between;
}
.BuyBoxSection p {
margin: 0;
}
.BuyBoxSection .readcube-buybox {
background-color: #f3f3f3;
flex-shrink: 1;
flex-grow: 1;
flex-basis: 255px;
background-clip: content-box;
padding: 0 15px;
margin-top: 30px;
}
.BuyBoxSection .subscribe-buybox {
background-color: #f3f3f3;
flex-shrink: 1;
flex-grow: 4;
flex-basis: 300px;
background-clip: content-box;
padding: 0 15px;
margin-top: 30px;
}
.BuyBoxSection .subscribe-buybox-nature-plus {
background-color: #f3f3f3;
flex-shrink: 1;
flex-grow: 4;
flex-basis: 100%;
background-clip: content-box;
padding: 0 15px;
margin-top: 30px;
}
.BuyBoxSection .title-readcube,
.BuyBoxSection .title-buybox {
display: block;
margin: 0;
margin-right: 10%;
margin-left: 10%;
font-size: 24px;
line-height: 32px;
color: #222;
text-align: center;
font-family: Harding, Palatino, serif;
}
.BuyBoxSection .title-asia-buybox {
display: block;
margin: 0;
margin-right: 5%;
margin-left: 5%;
font-size: 24px;
line-height: 32px;
color: #222;
text-align: center;
font-family: Harding, Palatino, serif;
}
.BuyBoxSection .asia-link,
.Link-328123652,
.Link-2926870917,
.Link-2291679238,
.Link-595459207 {
color: #069;
cursor: pointer;
text-decoration: none;
font-size: 1.05em;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 1.05em6;
}
.BuyBoxSection .access-readcube {
display: block;
margin: 0;
margin-right: 10%;
margin-left: 10%;
font-size: 14px;
color: #222;
padding-top: 10px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection ul {
margin: 0;
}
.BuyBoxSection .link-usp {
display: list-item;
margin: 0;
margin-left: 20px;
padding-top: 6px;
list-style-position: inside;
}
.BuyBoxSection .link-usp span {
font-size: 14px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection .access-asia-buybox {
display: block;
margin: 0;
margin-right: 5%;
margin-left: 5%;
font-size: 14px;
color: #222;
padding-top: 10px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection .access-buybox {
display: block;
margin: 0;
margin-right: 10%;
margin-left: 10%;
font-size: 14px;
color: #222;
opacity: 0.8px;
padding-top: 10px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection .price-buybox {
display: block;
font-size: 30px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
padding-top: 30px;
text-align: center;
}
.BuyBoxSection .price-buybox-to {
display: block;
font-size: 30px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
text-align: center;
}
.BuyBoxSection .price-info-text {
font-size: 16px;
padding-right: 10px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
}
.BuyBoxSection .price-value {
font-size: 30px;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
}
.BuyBoxSection .price-per-period {
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
}
.BuyBoxSection .price-from {
font-size: 14px;
padding-right: 10px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection .issue-buybox {
display: block;
font-size: 13px;
text-align: center;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 19px;
}
.BuyBoxSection .no-price-buybox {
display: block;
font-size: 13px;
line-height: 18px;
text-align: center;
padding-right: 10%;
padding-left: 10%;
padding-bottom: 20px;
padding-top: 30px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
}
.BuyBoxSection .vat-buybox {
display: block;
margin-top: 5px;
margin-right: 20%;
margin-left: 20%;
font-size: 11px;
color: #222;
padding-top: 10px;
padding-bottom: 15px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 17px;
}
.BuyBoxSection .tax-buybox {
display: block;
width: 100%;
color: #222;
padding: 20px 16px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: NaNpx;
}
.BuyBoxSection .button-container {
display: flex;
padding-right: 20px;
padding-left: 20px;
justify-content: center;
}
.BuyBoxSection .button-container > * {
flex: 1px;
}
.BuyBoxSection .button-container > a:hover,
.Button-505204839:hover,
.Button-1078489254:hover,
.Button-2737859108:hover {
text-decoration: none;
}
.BuyBoxSection .btn-secondary {
background: #fff;
}
.BuyBoxSection .button-asia {
background: #069;
border: 1px solid #069;
border-radius: 0;
cursor: pointer;
display: block;
padding: 9px;
outline: 0;
text-align: center;
text-decoration: none;
min-width: 80px;
margin-top: 75px;
}
.BuyBoxSection .button-label-asia,
.ButtonLabel-3869432492,
.ButtonLabel-3296148077,
.ButtonLabel-1636778223 {
display: block;
color: #fff;
font-size: 17px;
line-height: 20px;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
text-align: center;
text-decoration: none;
cursor: pointer;
}
.Button-505204839,
.Button-1078489254,
.Button-2737859108 {
background: #069;
border: 1px solid #069;
border-radius: 0;
cursor: pointer;
display: block;
padding: 9px;
outline: 0;
text-align: center;
text-decoration: none;
min-width: 80px;
max-width: 320px;
margin-top: 20px;
}
.Button-505204839 .btn-secondary-label,
.Button-1078489254 .btn-secondary-label,
.Button-2737859108 .btn-secondary-label {
color: #069;
}
.uList-2102244549 {
list-style: none;
padding: 0;
margin: 0;
}
/* style specs end */

Fig. 1: Different objectives of natural language processing (NLP) in behavioural science.
Fig. 2: Overview of the natural language processing (NLP) pipeline.
Fig. 3: Interpretability–accuracy trade-off in supervised natural language processing (NLP) models.

Similar content being viewed by others

How developments in natural language processing help us in understanding human behaviour

An analytical framework for corpus-based translation studies

Natural language analyzed with AI-based transformers predict traditional subjective well-being measures approaching the theoretical upper limits in accuracy

References

  1. Dixon, S. J. Number of social media users worldwide from 2017 to 2028. Statista https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/ (2024).

  2. Ceci, L. Number of sent and received e-mails per day worldwide from 2018 to 2027. Statista https://www.statista.com/statistics/456500/daily-number-of-e-mails-worldwide/ (2024).

  3. GilPress. WhatsApp statistics, users, demographics as of 2024. What’s the Big Data https://whatsthebigdata.com/whatsapp-statistics/ (2023).

  4. Robertson, C. E., Shariff, A. & van Bavel, J. J. Morality in the anthropocene: the perversion of compassion and punishment in the online world. PNAS Nexus 3, pgae193 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  5. Morant, L. The truth behind 6 second ads. Medium https://medium.com/@Lyndon/the-tyranny-of-six-seconds-592b94160877 (2019).

  6. Wilkerson, J. & Casas, A. Large-scale computerized text analysis in political science: opportunities and challenges. Annu. Rev. Political Sci. 20, 529–544 (2017).

    Article 

    Google Scholar 

  7. Kennedy, B., Ashokkumar, A., Boyd, R. L. & Dehghani, M. in Handbook of Language Analysis in Psychology (eds Dehghani M. & Boyd, R. L.) 3–62 (Guilford, 2022).

  8. Jackson, J. C. et al. From text to thought: how analyzing language can advance psychological science. Perspect. Psychol. Sci. 17, 805–826 (2022).

    Article 
    PubMed 

    Google Scholar 

  9. Boyd, R. L. & Pennebaker, J. W. Language-based personality: a new approach to personality in a digital world. Curr. Opin. Behav. Sci. 18, 63–68 (2017).

    Article 

    Google Scholar 

  10. Kahn, J. H., Tobin, R. M., Massey, A. E. & Anderson, J. A. Measuring emotional expression with the linguistic inquiry and word count. Am. J. Psychol. 120, 263–286 (2007).

    Article 
    PubMed 

    Google Scholar 

  11. Rocklage, M. D., Rucker, D. D. & Nordgren, L. F. Persuasion, emotion, and language: the intent to persuade transforms language via emotionality. Psychol. Sci. 29, 749–760 (2018).

    Article 
    PubMed 

    Google Scholar 

  12. Rathje, S., van Bavel, J. J. & van der Linden, S. Out-group animosity drives engagement on social media. Proc. Natl Acad. Sci. USA 118, e2024292118 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  13. Rogers, N. & Jones, J. J. Using Twitter bios to measure changes in self-identity: are Americans defining themselves more politically over time? J. Soc. Comput. 2, 1–13 (2021).

    Article 

    Google Scholar 

  14. Guntuku, S. C., Yaden, D. B., Kern, M. L., Ungar, L. H. & Eichstaedt, J. C. Detecting depression and mental illness on social media: an integrative review. Curr. Opin. Behav. Sci. 18, 43–49 (2017).

    Article 

    Google Scholar 

  15. Pennebaker, J. W. & King, L. A. Linguistic styles: language use as an individual difference. J. Pers. Soc. Psychol. 77, 1296–1312 (1999).

    Article 
    PubMed 

    Google Scholar 

  16. Pennebaker, J. W., Chung, C. K., Frazee, J., Lavergne, G. M. & Beaver, D. I. When small words foretell academic success: the case of college admissions essays. PLoS ONE 9, e115844 (2014).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  17. Pennebaker, J. W. & Francis, M. E. Cognitive, emotional, and language processes in disclosure. Cogn. Emot. 10, 601–626 (1996).

    Article 

    Google Scholar 

  18. Manning, C. & Schütze, H. Foundations of Statistical Natural Language Processing (MIT Press, 1999).

  19. Tausczik, Y. R. & Pennebaker, J. W. The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 29, 24–54 (2009).

    Article 

    Google Scholar 

  20. Feuerriegel, S., Hartmann, J., Janiesch, C. & Zschech, P. Generative AI. Bus. Inf. Syst. Eng. 66, 111–126 (2024).

    Article 

    Google Scholar 

  21. Rathje, S. et al. GPT is an effective tool for multilingual psychological text analysis. Proc. Natl Acad. Sci. USA 121, e2308950121 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  22. Steigerwald, E. et al. Overcoming language barriers in academia: machine translation tools and a vision for a multilingual future. BioScience 72, 988–998 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  23. Henrich, J., Heine, S. J. & Norenzayan, A. Most people are not WEIRD. Nature 466, 29 (2010).

    Article 
    PubMed 

    Google Scholar 

  24. Ghai, S. It’s time to reimagine sample diversity and retire the WEIRD dichotomy. Nat. Hum. Behav. 5, 971–972 (2021).

    Article 
    PubMed 

    Google Scholar 

  25. Blasi, D. E., Henrich, J., Adamou, E., Kemmerer, D. & Majid, A. Over-reliance on English hinders cognitive science. Trends Cognit. Sci. 26, 1153–1170 (2022).

    Article 

    Google Scholar 

  26. Shibayama, S., Yin, D. & Matsumoto, K. Measuring novelty in science with word embedding. PLoS ONE 16, e0254034 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  27. Just, J., Ströhle, T., Füller, J. & Hutter, K. AI-based novelty detection in crowdsourced idea spaces. Innovation 6, 359–386 (2023).

    Google Scholar 

  28. Toubia, O. & Netzer, O. Idea generation, creativity, and prototypicality. Mark. Sci. 36, 1–20 (2017).

    Article 

    Google Scholar 

  29. Blodgett, S. L., Barocas, S., Daumé III, H. & Wallach, H. Language (technology) is power: a critical survey of “bias” in NLP. In Proc. Annual Meet. Assoc. Computational Linguistics (eds. Jurafsky, D. et al.) 5454–5476 (ACL, 2020).

  30. Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan & Zou, James Word embeddings quantify 100 years of gender and ethnic stereotypes. Proc. Natl Acad. Sci. USA 115, E3635–E3644 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  31. Page, R. Narratives Online: Shared Stories in Social Media (Cambridge Univ. Press, 2018).

  32. Yu, C. H., Jannasch-Pennell, A. & DiGangi, S. Compatibility between text mining and qualitative research in the perspectives of grounded theory, content analysis, and reliability. Qualitative Rep. 16, 730–744 (2011).

    Google Scholar 

  33. Hamilton, W. L., Leskovec, J. & Jurafsky, D. Diachronic word embeddings reveal statistical laws of semantic change. In Proc. Annual Meet. Assoc. Computational Linguistics (eds. Erk, K. & Smith, N.) 1489–1501 (ACL, 2016)

  34. Kulkarni, V., Al-Rfou, R., Perozzi, B. & Skiena, S. Statistically significant detection of linguistic change. In Proc. Int. Conf. World Wide Web (eds. Gangemi, A. et al.) 625–635 (ACM, 2015)

  35. Dunivin, Z. O., Yan, H. Y., Ince, J. & Rojas, F. Black lives matter protests shift public discourse. Proc. Natl Acad. Sci. USA 119, e2117320119 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  36. Jakubik, J., Vössing, M., Pröllochs, N., Bär, D. & Feuerriegel, S. Online emotions during the storming of the US capitol: evidence from the social media network Parler. In Proc. Int. AAAI Conf. Web and Social Media 423–434 (AAAI, 2023).

  37. Murphy, G. The Big Book of Concepts. (MIT Press, 2004).

  38. Boroditsky, L. Does language shape thought?: Mandarin and English speakers’ conceptions of time. Cognit. Psychol. 43, 1–22 (2001).

    Article 
    PubMed 

    Google Scholar 

  39. Gilardi, F., Alizadeh, M. & Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl Acad. Sci. USA 120, e2305016120 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  40. Ziabari, A. S. et al. Reinforced multiple instance selection for speaker attribute prediction. In Proc. Conf. North American Chapter of the Assoc. Computational Linguistics: Human Language Technologies (eds. Duh, K., Gomez, H. & Bethard, S.) 3307–3321 (ACL, 2024)

  41. Krugmann, J. O. & Hartmann, J. Sentiment analysis in the age of generative AI. Customer Needs Solut. 11, 3 (2024).

    Article 

    Google Scholar 

  42. Mohammad, S. M. in Emotion Measurement (ed. Meiselman, H. L.) 201–237 (Elsevier, 2016)

  43. Kratzwald, B., Ilić, S., Kraus, M., Feuerriegel, S. & Prendinger, H. Deep learning for affective computing: text-based emotion recognition in decision support. Decis. Support. Syst. 115, 24–35 (2018).

    Article 

    Google Scholar 

  44. Hartmann, J., Heitmann, M., Siebert, C. & Schamp, C. More than a feeling: accuracy and application of sentiment analysis. Int. J. Res. Mark. 40, 75–87 (2023).

    Article 

    Google Scholar 

  45. Mohammad, S. M., Kiritchenko, S., Sobhani, P., Zhu, X. & Cherry, C. SemEval-2016 Task 6: detecting stance in tweets. In Proc. Int. Workshop on Semantic Evaluation (eds. Bethard, S. et al.) 31–41 (ACL, 2016).

  46. Mohammad, S. M., Sobhani, P. & Kiritchenko, S. Stance and sentiment in tweets. ACM Trans. Internet Technol. Argumentati. Soc. Media 17, 3 (2017).

    Google Scholar 

  47. Liu, B. & Zhang, L. in Mining Text Data (eds Aggarwal, C. C. & Zhai, C.) 415–463 (Springer US, 2012).

  48. Spitzley, L. A. et al. Linguistic measures of personality in group discussions. Front. Psychol. 13, 887616 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  49. Lutz, B., Adam, M., Feuerriegel, S., Pröllochs, N. & Neumann, D. Which linguistic cues make people fall for fake news? A comparison of cognitive and affective processing. In Proc. ACM on Human–Computer Interaction (eds. Nichols, Jeff) 1–22 (ACM, 2024).

  50. van Kleef, G. A., van den Berg, H. & Heerdink, M. W. The persuasive power of emotions: effects of emotional expressions on attitude formation and change. J. Appl. Psychol. 100, 1124–1142 (2015).

    Article 
    PubMed 

    Google Scholar 

  51. Schwartz, H. A. et al. Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8, e73791 (2013).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  52. Vine, V., Boyd, R. L. & Pennebaker, J. W. Natural emotion vocabularies as windows on distress and well-being. Nat. Commun. 11, 4525 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  53. Eichstaedt, J. C. et al. Facebook language predicts depression in medical records. Proc. Natl Acad. Sci. USA 115, 11203–11208 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  54. Chen, S., Zhang, Z., Wu, M. & Zhu, K. Detection of multiple mental disorders from social media with two-stream psychiatric experts. In Proc. Conf. Empirical Methods in Natural Language Processing (eds. Bouamor, H., Pino, J. & Bali, K.) 9071–9084 (ACL, 2023).

  55. Eichstaedt, J. C. et al. Psychological language on Twitter predicts county-level heart disease mortality. Psychol. Sci. 26, 159–169 (2015).

    Article 
    PubMed 

    Google Scholar 

  56. Mooijman, M., Hoover, J., Lin, Y., Ji, H. & Dehghani, M. Moralization in social networks and the emergence of violence during protests. Nat. Hum. Behav. 2, 389–396 (2018).

    Article 
    PubMed 

    Google Scholar 

  57. Tan, C., Niculae, V., Danescu-Niculescu-Mizil, C. & Lee, L. Winning arguments: interaction dynamics and persuasion strategies in good-faith online discussions. In Proc. Int. Conf. World Wide Web (eds. Bourdeau, J. et al.) 613–624 (ACM, 2016).

  58. Denny, M. J. & Spirling, A. Text preprocessing for unsupervised learning: why it matters, when it misleads, and what to do about it. Political Anal. 26, 168–189 (2018).

    Article 

    Google Scholar 

  59. Toetzke, M., Banholzer, N. & Feuerriegel, S. Monitoring global development aid with machine learning. Nat. Sustain. 5, 533–541 (2022).

    Article 

    Google Scholar 

  60. Tenzer, H., Feuerriegel, S. & Piekkari, R. AI machine translation tools must be taught cultural differences too. Nature 630, 820 (2024).

    Article 
    PubMed 

    Google Scholar 

  61. Fokkens, A. et al. Offspring from reproduction problems: what replication failure teaches us. In Proc. Annual Meet. Assoc. Computational Linguistics (eds. Schuetze, H., Fung, P. & Poesio, M.) 1691–1701 (ACL, 2013).

  62. Ulmer, D. et al. Experimental standards for deep learning in natural language processing research. In Findings of the Association for Computational Linguistics: Empirical Methods in Natural Language Processing (eds. Goldberg, Y., Kozareva, Z. & Zhang, Y.) 2673–2692 (ACL, 2022).

  63. Salton, G. A Theory of Indexing (Society for Industrial and Applied Mathematics, 1975).

  64. Le, Q. & Mikolov, T. Distributed representations of sentences and documents. In Proc. Int. Conf. Machine Learning 1188–1196 (PMLR, 2014)

  65. Collobert, R. & Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proc. Int. Conf. Machine Learning 160–167 (ACM, 2008).

  66. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (eds. Burges, C. J. et al.) 3111–3119 (Curran Associates Inc., 2013).

  67. Pennington, J., Socher, R. & Manning, C. D. GloVe: global vectors for word representation. In Proc. Conf. Empirical Methods in Natural Language Processing (eds. Moschitti, A., Pang, B. & Daelemans, W.) 1532–1543 (ACL, 2014).

  68. Dai, A. M., Olah, C. & Le, Q. V. Document embedding with paragraph vectors. Preprint at https://doi.org/10.48550/arXiv.1507.07998 (2015).

  69. Harris, Z. S. Distributional Structure (Word, 1954).

  70. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding.In Proc. Conf. North American Chapter of the Assoc. Computational Linguistics (eds. Burstein, J., Doran, C. & Solorio, T.) 4171–4186 (ACL, 2019).

  71. Tokita, C. K. et al. Measuring receptivity to misinformation at scale on a social media platform. PNAS Nexus 3, page396 (2024).

  72. Hart, R. P. & Carroll, C. DICTION: The Text-Analysis Program (Sage, 2011).

  73. Stone, P. J., Dunphy, D. C. & Smith, M. S. The General Inquirer: A Computer Approach to Content Analysis (The MIT Press, 1966).

  74. Rinker, T., Goodrich, B. & Kurkiewicz, D. qdap: Bridging the Gap between Qualitative Data and Quantitative Analysis (R Project for Statistical Computing, 2013).

  75. Mohammad, S. M. & Turney, P. D. Crowdsourcing a word–emotion association lexicon. Comput. Intell. 29, 436–465 (2013).

    Article 

    Google Scholar 

  76. Graham, J., Haidt, J. & Nosek, B. A. Liberals and conservatives rely on different sets of moral foundations. J. Pers. Soc. Psychol. 96, 1029–1046 (2009).

    Article 
    PubMed 

    Google Scholar 

  77. The Weaponized Word. Lexicons. Weaponized Word https://weaponizedword.org/lexicons (2024).

  78. Robertson, C. E. et al. Negativity drives online news consumption. Nat. Hum. Behav. 7, 812–822 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  79. Boyd, R. L., Ashokkumar, A., Seraj, S. & Pennebaker, J. W. The Development and Psychometric Properties of LIWC-22 (Univ. of Texas at Austin, 2022).

  80. Thelwall, M., Buckley, K. & Paltoglou, G. Sentiment strength detection for the social web. J. Am. Soc. Inf. Sci. Technol. 63, 163–173 (2011).

    Article 

    Google Scholar 

  81. Baccianella, S., Esuli, A. & Sebastiani, F. SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Proc. Seventh International Conference on Language Resources and Evaluation (LREC’10) (eds. Calzolari, N., et al.) http://www.lrec-conf.org/proceedings/lrec2010/pdf/769_Paper.pdf (European Language Resources Association, 2010).

  82. Hutto, C. & Gilbert, E. VADER: a parsimonious rule-based model for sentiment analysis of social media text. In Proc. Int. AAAI Conf. Web and Social Media 216–225 (AAAI, 2014).

  83. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D. I. & Kappas, A. Sentiment strength detection in short informal text. J. Am. Soc. Inf. Sci. Technol 61, 2544–2558 (2010).

    Article 

    Google Scholar 

  84. Pröllochs, N., Feuerriegel, S. & Neumann, D. Statistical inferences for polarity identification in natural language. PLoS ONE 13, e0209323 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  85. Song, H. et al. In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Commun. 37, 550–572 (2020).

    Article 

    Google Scholar 

  86. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article 
    PubMed 

    Google Scholar 

  87. Hussain, Z., Mata, R. & Wulff, D. U. Novel embeddings improve the prediction of risk perception. EPJ Data Sci. 13, Article 38 (2024).

    Article 

    Google Scholar 

  88. Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (Larochelle, H. et al.) 1877–1901 (Curran Associates Inc., 2020).

  89. Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://doi.org/10.48550/arXiv.2302.13971 (2023).

  90. Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems (Guyon, I. et al.) 5998–6008 (2017).

  91. Demszky, D. et al. Using large language models in psychology. Nat. Rev. Psychol. 2, 688–701 (2023).

    Google Scholar 

  92. Abdurahman, S. et al. Perils and opportunities in using large language models in psychological research. PNAS Nexus 3, 245 (2024).

    Article 

    Google Scholar 

  93. Kamalloo, E., Dziri, N., Clarke, C. & Rafiei, D. Evaluating open-domain question answering in the era of large language models. In Proc. Annual Meet. Assoc. Computational Linguistics (eds. Rogers, A. et al.) 5591–5606 (ACL, 2023).

  94. Zhang, T. et al. Benchmarking large language models for news summarization. Trans. Assoc. Comput. Linguist. 12, 39–57 (2024).

    Article 

    Google Scholar 

  95. Zhu, W. et al. Multilingual machine translation with large language models: empirical results and analysis. In Findings of the ACL: North American Chapter of the Assoc. Computational Linguistics (eds. Duh, K. et al.) 2765–2781 (ACL, 2024).

  96. Lin, Z. How to write effective prompts for large language models. Nat. Hum. Behav. 8, 611–615 (2024).

    Article 
    PubMed 

    Google Scholar 

  97. Atreja, S., Ashkinaze, J., Li, L., Mendelsohn, J. & Hemphill, L. Prompt design matters for computational social science tasks but in unpredictable ways. Preprint at https://doi.org/10.48550/arXiv.2406.11980 (2024).

  98. Kuribayashi, T., Oseki, Y. & Baldwin, T. Psychometric predictive power of large language models. In Findings of the ACL: North American Chapter of the Assoc. Computational Linguistics (eds. Duh, K. et al.) 1983–2005 (ACL, 2024).

  99. Zhang, B., Liu, Z., Cherry, C. & Firat, O. When scaling meets LLM finetuning: the effect of data, model and finetuning method. In Proc. Int. Conf. Learn. Representations https://doi.org/10.48550/arXiv.2402.17193 (2024).

  100. Wulff, D. U. & Mata, R. Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nat. Hum. Behav. https://doi.org/10.1038/s41562-024-02089-y (2025).

  101. Dubey, A. et al. The llama 3 herd of models. Prerprint at https://doi.org/10.48550/arXiv.2407.21783 (2024).

  102. Grimes, M., Krogh, Gvon, Feuerriegel, S., Rink, F. & Gruber, M. From scarcity to abundance: scholars and scholarship in an age of generative artificial intelligence. Acad. Manag. J. 66, 1617–1624 (2023).

    Article 

    Google Scholar 

  103. Shu, B. et al. You don’t need a personality test to know these models are unreliable: assessing the reliability of large language models on psychometric instruments. In Proc. Conf. North American Chapter of the Assoc. Computational Linguistics: Human Language Technologies (eds. Duh, K. et al.) 5263–5281 (ACL, 2024).

  104. Hofmann, V., Kalluri, P. R., Jurafsky, D. & King, S. AI generates covertly racist decisions about people based on their dialect. Nature 633, 147–154 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  105. Caliskan, A., Bryson, J. J. & Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science 356, 183–186 (2017).

    Article 
    PubMed 

    Google Scholar 

  106. Hartmann, J., Schwenzow, J. & Witte, M. The political ideology of conversational AI: converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation. Preprint at https://doi.org/10.48550/arXiv.2301.01768 (2023).

  107. Hu, T. et al. Generative language models exhibit social identity biases. Preprint at https://doi.org/10.48550/arXiv.2310.15819 (2023).

  108. Balloccu, S., Schmidtová, P., Lango, M. & Dusek, O. Leak, cheat, repeat: data contamination and evaluation malpractices in closed-source LLMs. In Proc. Conf. European Chapter of the Assoc. Computational Linguistics (eds. Graham, Y. & Purver, M.) 67–93 (ACL, 2024).

  109. Palmer, A., Smith, N. A. & Spirling, A. Using proprietary language models in academic research requires explicit justification. Nat. Comput. Sci. 4, 2–3 (2024).

    Article 
    PubMed 

    Google Scholar 

  110. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990).

    <a data-track="click_references" rel="nofollow noopener" data-track-label="10.1002/(SICI)1097-4571(199009)41:63.0.CO;2-9″ data-track-item_id=”10.1002/(SICI)1097-4571(199009)41:63.0.CO;2-9″ data-track-value=”article reference” data-track-action=”article reference” href=”https://doi.org/10.1002%2F%28SICI%291097-4571%28199009%2941%3A6%3C391%3A%3AAID-ASI1%3E3.0.CO%3B2-9″ aria-label=”Article reference 110″ data-doi=”10.1002/(SICI)1097-4571(199009)41:63.0.CO;2-9″>Article 

    Google Scholar 

  111. Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).

    Google Scholar 

  112. Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. Int. Conf. Knowledge Discovery and Data Mining (eds. Simoudis, E. et al.) 226–231 (AAAI, 1996).

  113. Grootendorst, M. BERTopic: neural topic modeling with a class-based TF-IDF procedure. Preprint at https://doi.org/10.48550/arXiv.2203.05794 (2022).

  114. Jelinek, F., Mercer, R. L., Bahl, L. R. & Baker, J. K. Perplexity: a measure of the difficulty of speech recognition tasks. J. Acoust. Soc. Am. 62, S63 (1977).

    Article 

    Google Scholar 

  115. Campello, R. J., Moulavi, D. & Sander, J. Density-based clustering based on hierarchical density estimates. In Pacific-Asia Conf. Knowledge Discovery and Data Mining (eds. Pei, J. et al.) https://doi.org/10.1007/978-3-642-37456-2_14 (2013).

  116. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  117. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. & Blei, D. Reading tea leaves: how humans interpret topic models. In Adv. Neural Inf. Process. Syst. (eds. Bengio, Y. et al.) 288–296 (Curran Associates Inc., 2009).

  118. Sievert, C. & Shirley, K. LDAvis: a method for visualizing and interpreting topics. In Proc. Workshop on Interactive Language Learning, Visualization, and Interfaces (eds. Chuang, J. et al.) 63–70 (ACL, 2014).

  119. Kosar, A., Pauw, Gde & Daelemans, W. Comparative evaluation of topic detection: humans vs. LLMs. Comput. Linguist. Neth. J. 13, 91–120 (2024).

    Google Scholar 

  120. DiStefano, P. V., Patterson, J. D. & Beaty, R. E. Automatic scoring of metaphor creativity with large language models. Creativity Res. J. https://doi.org/10.1080/10400419.2024.2326343 (2023).

  121. Yu, Y., Chen, L., Jiang, J. & Zhao, N. Measuring patent similarity with word embedding and statistical features. Data Anal. Knowl. Discov. 3, 53–59 (2019).

    Google Scholar 

  122. Kelly, B., Papanikolaou, D., Seru, A. & Taddy, M. Measuring technological innovation over the long run. Am. Econ. Rev. Insights 3, 303–320 (2021).

    Article 

    Google Scholar 

  123. Goldberg, A., Srivastava, S. B., Manian, V. G., Monroe, W. & Potts, C. Fitting in or standing out? The tradeoffs of structural and cultural embeddedness. Am. Sociol. Rev. 81, 1190–1222 (2016).

    Article 

    Google Scholar 

  124. Ireland, M. E. et al. Language style matching predicts relationship initiation and stability. Psychol. Sci. 22, 39–44 (2011).

    Article 
    PubMed 

    Google Scholar 

  125. Niederhoffer, K. G. & Pennebaker, J. W. Linguistic style matching in social interaction. J. Lang. Soc. Psychol. 21, 337–360 (2002).

    Article 

    Google Scholar 

  126. Dhillon, I. S. & Modha, D. S. Concept decompositions for large sparse text data using clustering. Mach. Learn. 42, 143–175 (2001).

    Article 

    Google Scholar 

  127. Steck, H., Ekanadham, C. & Kallus, N. Is cosine-similarity of embeddings really about similarity? In Companion Proc. ACM Web Conf. (eds. Chua, T. et al.) 887–890 (ACM, 2024).

  128. Lederer, W. & Küchenhoff, H. A short introduction to the SIMEX and MCSIMEX. Newsl. R. Proj. 6, 26–31 (2006).

    Google Scholar 

  129. Burton, J. W., Cruz, N. & Hahn, U. Reconsidering evidence of moral contagion in online social networks. Nat. Hum. Behav. 5, 1629–1635 (2021).

    Article 
    PubMed 

    Google Scholar 

  130. Egami, N., Fong, C. J., Grimmer, J., Roberts, M. E. & Stewart, B. M. How to make causal inferences using texts. Sci. Adv. 8, eabg2652 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  131. Feder, A. et al. Causal inference in natural language processing: estimation, prediction, interpretation and beyond. Trans. Assoc. Comput. Linguist. 10, 1138–1158 (2022).

    Article 

    Google Scholar 

  132. Maarouf, A., Bär, D., Geissler, D. & Feuerriegel, S. HQP: a human-annotated dataset for detecting online propaganda. In Findings of the ACL (eds. Ku, L. et al.) 6064–6089 (ACL, 2024).

  133. Berger, J. et al. Uniting the tribes: using text for marketing insight. J. Mark. 84, 1–25 (2020).

    Article 

    Google Scholar 

  134. Mohammad, S. M. Ethics sheet for automatic emotion recognition and sentiment analysis. Comput. Linguist. 48, 239–278 (2022).

    Article 

    Google Scholar 

  135. Rivers, C. M. & Lewis, B. L. Ethical research standards in a world of big data. F1000Research 3, 38 (2014).

    Article 

    Google Scholar 

  136. Boegershausen, J., Datta, H., Borah, A. & Stephen, A. T. Fields of gold: scraping web data for marketing insights. J. Mark. 86, 1–20 (2022).

    Article 

    Google Scholar 

  137. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 1–35 (2021).

    Article 

    Google Scholar 

  138. Zhao, J., Wang, T., Yatskar, M., Ordonez, V. & Chang, K.-W. Men also like shopping: reducing gender bias amplification using corpus-level constraints. In Proc. Conf. Empirical Methods in Natural Language Processing (eds. Palmer, M. et al.) 2989–2989 (ACL, 2017).

  139. Hackenburg, K. & Margetts, H. Evaluating the persuasive influence of political microtargeting with large language models. Proc. Natl Acad. Sci. USA 121, e2403116121 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  140. Sharma, A., Lin, I. W., Miner, A. S., Atkins, D. C. & Althoff, T. Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nat. Mach. Intell. 5, 46–57 (2023).

    Article 

    Google Scholar 

  141. Colleoni, E., Rozza, A. & Arvidsson, A. Echo chamber or public sphere? Predicting political orientation and measuring political homophily in Twitter using big data. J. Commun. 64, 317–332 (2014).

    Article 

    Google Scholar 

  142. Wojcik, S. P., Hovasapian, A., Graham, J., Motyl, M. & Ditto, P. H. Conservatives report, but liberals display, greater happiness. Science 347, 1243–1246 (2015).

    Article 
    PubMed 

    Google Scholar 

  143. Frimer, J. A., Brandt, M. J., Melton, Z. & Motyl, M. Extremists on the left and right use angry, negative language. Pers. Soc. Psychol. Bull. 45, 1216–1231 (2019).

    Article 
    PubMed 

    Google Scholar 

  144. Sterling, J., Jost, J. T. & Bonneau, R. Political psycholinguistics: a comprehensive analysis of the language habits of liberal and conservative social media users. J. Pers. Soc. Psychol. 118, 805–834 (2020).

    Article 
    PubMed 

    Google Scholar 

  145. Brady, W. J., Wills, J. A., Jost, J. T., Tucker, J. A. & van Bavel, J. J. Emotion shapes the diffusion of moralized content in social networks. Proc. Natl Acad. Sci. USA 114, 7313–7318 (2017).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  146. Brady, W. J., Wills, J. A., Burkart, D., Jost, J. T. & van Bavel, J. J. An ideological asymmetry in the diffusion of moralized content on social media among political leaders. J. Exp. Psychol.: Gen. 148, 1802–1813 (2019).

    Article 
    PubMed 

    Google Scholar 

  147. Lanning, K., Pauletti, R. E., King, L. A. & McAdams, D. P. Personality development through natural language. Nat. Hum. Behav. 2, 327–334 (2018).

    Article 
    PubMed 

    Google Scholar 

  148. Slatcher, R. B., Chung, C. K., Pennebaker, J. W. & Stone, L. D. Winning words: individual differences in linguistic style among US presidential and vice presidential candidates. J. Res. Pers. 41, 63–75 (2007).

    Article 

    Google Scholar 

  149. Wiechmann, P., Lora, K., Branscum, P. & Fu, J. Identifying discriminative attributes to gain insights regarding child obesity inHispanic preschoolers using machine learning techniques. In Proc. IEEE Int. Conf. Tools with Artificial Intelligence, 11–15 (IEEE, 2017).

  150. Teague, S. J. & Shatte, A. B. R. Exploring the transition to fatherhood: feasibility study using social media and machine learning. JMIR Pediatrics Parent. 1, e12371 (2018).

    Article 

    Google Scholar 

  151. Joel, S., Eastwick, P. W. & Finkel, E. J. Is romantic desire predictable? Machine learning applied to initial romantic attraction. Psychol. Sci. 28, 1478–1489 (2017).

    Article 
    PubMed 

    Google Scholar 

  152. Lasser, J. et al. From alternative conceptions of honesty to alternative facts in communications by US politicians. Nat. Hum. Behav. 7, 2140–2151 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  153. Frimer, J. A. et al. Incivility is rising among American politicians on Twitter. Soc. Psychol. Pers. Sci. 14, 259–269 (2023).

    Article 

    Google Scholar 

  154. Shulman, H. C., Markowitz, D. M. & Rogers, T. Reading dies in complexity: online news consumers prefer simple writing. Sci. Adv. 10, eadn2555 (2024).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  155. Newman, M. L., Pennebaker, J. W., Berry, D. S. & Richards, J. M. Lying words: predicting deception from linguistic styles. Pers. Soc. Psychol. Bull. 29, 665–675 (2003).

    Article 
    PubMed 

    Google Scholar 

  156. Zhou, L., Burgoon, J. K., Nunamaker, J. F. & Twitchell, D. Automating linguistics-based cues for detecting deception in text-based asynchronous computer-mediated communications. Group. Decis. Negotiation 13, 81–106 (2004).

    Article 

    Google Scholar 

  157. Ho, S. M., Hancock, J. T., Booth, C. & Liu, X. Computer-mediated deception: strategies revealed by language–action cues in spontaneous communication. J. Manag. Inf. Syst. 33, 393–420 (2016).

    Article 

    Google Scholar 

  158. Siering, M., Koch, J.-A. & Deokar, A. V. Detecting fraudulent behavior on crowdfunding platforms: the role of linguistic and content-based cues in static and dynamic contexts. J. Manag. Inf. Syst. 33, 421–455 (2016).

    Article 

    Google Scholar 

  159. Zhang, D., Zhou, L., Kehoe, J. L. & Kilic, I. Y. What online reviewer behaviors really matter? Effects of verbal and nonverbal behaviors on detection of fake online reviews. J. Manag. Inf. Syst. 33, 456–481 (2016).

    Article 

    Google Scholar 

  160. Constâncio, A. S., Tsunoda, D. F., Silva, H. F. N., Da Silveira, J. M. & Carvalho, D. R. Deception detection with machine learning: a systematic review and statistical analysis. PLoS ONE 18, e0281323 (2023).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  161. Thompson, B., Roberts, S. G. & Lupyan, G. Cultural influences on word meanings revealed through large-scale semantic alignment. Nat. Hum. Behav. 4, 1029–1038 (2020).

    Article 
    PubMed 

    Google Scholar 

  162. Morin, O. & Acerbi, A. Birth of the cool: a two-centuries decline in emotional expression in Anglophone fiction. Cogn. Emot. 31, 1663–1675 (2017).

    Article 
    PubMed 

    Google Scholar 

  163. Jackson, J. C., Gelfand, M., De, S. & Fox, A. The loosening of American culture over 200 years is associated with a creativity‐order trade-off. Nat. Hum. Behav. 3, 244–250 (2019).

    Article 
    PubMed 

    Google Scholar 

  164. Charlesworth, T. E. S. & Banaji, M. R. Patterns of implicit and explicit attitudes: I. Long-term change and stability from 2007 to 2016. Psychol. Sci. 30, 174–192 (2019).

    Article 
    PubMed 

    Google Scholar 

  165. Charlesworth, T. E. S., Caliskan, A. & Banaji, M. R. Historical representations of social groups across 200 years of word embeddings from Google Books. Proc. Natl Acad. Sci. USA 119, e2121798119 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  166. Simchon, A., Brady, W. J. & van Bavel, J. J. Troll and divide: the language of online polarization. PNAS Nexus 1, pgac019 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  167. Pröllochs, N., Bär, D. & Feuerriegel, S. Emotions explain differences in the diffusion of true vs. false social media rumors. Sci. Rep. 11, 22721 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  168. Pröllochs, N., Bär, D. & Feuerriegel, S. Emotions in online rumor diffusion. EPJ Data Sci. 10, 51 (2021).

    Article 

    Google Scholar 

  169. Yin, D., Bond, S. D. & Zhang, H. Anxious or angry? Effects of discrete emotions on the perceived helpfulness of online reviews. MIS Q. 38, 539–560 (2014).

    Article 

    Google Scholar 

  170. Chung, J., Johar, G. V., Li, Y., Netzer, O. & Pearson, M. Mining consumer minds: downstream consequences of host motivations for home-sharing platforms. J. Consum. Res. 48, 817–838 (2022).

    Article 

    Google Scholar 

  171. Park, G. et al. Automatic personality assessment through social media language. J. Pers. Soc. Psychol. 108, 934–952 (2015).

    Article 
    PubMed 

    Google Scholar 

  172. O’Dea, B. et al. The relationship between linguistic expression in blog content and symptoms of depression, anxiety, and suicidal thoughts: a longitudinal study. PLoS ONE 16, e0251787 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  173. Preotiuc-Pietro, D. et al. The role of personality, age, and gender in tweeting about mental illness. In Proc. 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality 21–30 (ACL, 2015).

  174. Cohn, M. A., Mehl, M. R. & Pennebaker, J. W. Linguistic markers of psychological change surrounding September 11, 2001. Psychol. Sci. 15, 687–693 (2004).

    Article 
    PubMed 

    Google Scholar 

  175. Garcia, D. & Rimé, B. Collective emotions and social resilience in the digital traces after a terrorist attack. Psychol. Sci. 30, 617–628 (2019).

    Article 
    PubMed 

    Google Scholar 

  176. Ashokkumar, A. & Pennebaker, J. W. Social media conversations reveal large psychological shifts caused by COVID-19’s onset across US cities. Sci. Adv. 7, eabg7843 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  177. Di Kramer, A., Guillory, J. E. & Hancock, J. T. Experimental evidence of massive-scale emotional contagion through social networks. Proc. Natl Acad. Sci. USA 111, 8788–8790 (2014).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  178. Kacewicz, E., Pennebaker, J. W., Davis, M., Jeon, M. & Graesser, A. C. Pronoun use reflects standings in social hierarchies. J. Lang. Soc. Psychol. 33, 125–143 (2014).

    Article 

    Google Scholar 

  179. Rude, S., Gortner, E.-M. & Pennebaker, J. Language use of depressed and depression-vulnerable college students. Cogn. Emot. 18, 1121–1133 (2004).

    Article 

    Google Scholar 

  180. Netzer, O., Feldman, R., Goldenberg, J. & Fresko, M. Mine your own business: market-structure surveillance through text mining. Mark. Sci. 31, 521–543 (2012).

    Article 

    Google Scholar 

  181. Seraj, S., Blackburn, K. G. & Pennebaker, J. W. Language left behind on social media exposes the emotional and cognitive costs of a romantic breakup. Proc. Natl Acad. Sci. USA 118, e2017154118 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  182. Berger, J. & Milkman, K. L. What makes online content viral? J. Mark. Res. 49, 192–205 (2012).

    Article 

    Google Scholar 

  183. Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M. & Duncan, J. W. Predicting consumer behavior with web search. Proc. Natl Acad. Sci. USA 107, 17486–17490 (2010).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  184. Scheffer, M., van de Leemput, I., Weinans, E. & Bollen, J. The rise and fall of rationality in language. Proc. Natl Acad. Sci. USA 118, e2107848118 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  185. Ferrara, E., Varol, O., Davis, C., Menczer, F. & Flammini, A. The rise of social bots. Commun. ACM 59, 96–104 (2016).

    Article 

    Google Scholar 

  186. Auxier, B. & Anderson, M. Social Media Use in 2021 (Pew Research Center, 2021).

  187. Barberá, P. & Rivero, G. Understanding the political representativeness of Twitter users. Soc. Sci. Comput. Rev. 33, 712–729 (2015).

    Article 

    Google Scholar 

  188. Schoenmueller, V., Netzer, O. & Stahl, F. The polarity of online reviews: prevalence, drivers and implications. J. Mark. Res. 57, 853–877 (2020).

    Article 

    Google Scholar 

  189. Robertson, C. E., Del Rosario, K., Rathje, S. & van Bavel, J. J. Changing the incentive structure of social media may reduce online proxy failure and proliferation of negativity. Behav. Brain Sci. 47, e81 (2024).

    Article 
    PubMed 

    Google Scholar 

  190. Robertson, C., Del Rosario, K. & van Bavel, J. J. Inside the Funhouse Mirror Factory: How Social Media Distorts Perceptions of Norms (OSF, 2024).

  191. Bär, D., Pröllochs, N. & Feuerriegel, S. New threats to society from free-speech social media platforms. Commun. ACM 66, 37–40 (2023).

    Article 

    Google Scholar 

  192. Zhunis, A., Lima, G., Song, H., Han, J. & Cha, M. Emotion bubbles: emotional composition of online discourse before and after the COVID-19 outbreak. In Proc. ACM Web Conf. (eds. Faforest, F. et al.) 2603–2613 (ACM, 2022).

  193. Rathje, S., He, J. K., Roozenbeek, J., van Bavel, J. J. & van der Linden, S. Social media behavior is associated with vaccine hesitancy. PNAS Nexus 1, pgac207 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  194. Canché, M. S. G. Machine driven classification of open-ended responses (MDCOR): an analytic framework and no-code, free software application to classify longitudinal and cross-sectional text responses in survey and social media research. Expert. Syst. Appl. 215, 119265 (2023).

    Article 

    Google Scholar 

  195. Hartmann, J., Bergner, A. & Hildebrand, C. MindMiner: uncovering linguistic markers of mind perception as a new lens to understand consumer‐smart object relationships. J. Consum. Psychol. 33, 645–667 (2023).

    Article 

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

S.F., A.M., D.B., D.G., J.S. and N.P. outlined the article and wrote the first draft. A.M. created the first draft of the figures. All authors contributed to subsequent iterations of the article. All authors reviewed, edited and approved the manuscript before submission.

Corresponding author

Correspondence to
Stefan Feuerriegel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Psychology thanks April Bailey, Morteza Dehghani and Dirk Wulff for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feuerriegel, S., Maarouf, A., Bär, D. et al. Using natural language processing to analyse text data in behavioural science.
Nat Rev Psychol (2025). https://doi.org/10.1038/s44159-024-00392-z

Download citation

  • Accepted: 22 November 2024

  • Published: 02 January 2025

  • DOI: https://doi.org/10.1038/s44159-024-00392-z


Leave a Reply

Your email address will not be published. Required fields are marked *