Research Applications

Research Applications#

research

Textual analysis#

Textual analysis becoming increasingly popular in asset pricing, macro, and other fields.
Most successful applications use text to measure economic concepts that are otherwise hard or impossible to measure.
So-far, simplest applications have been the most successful.
Many cutting-edge methods of machine learning are not necessary or even counter-productive, similar to kitchen-sink regressions, prone to over-fitting.
Advice:
- keep it simple.
- stay close to the text, read a lot.
- frontier is more in learning from new data than in fancy techniques.

textual

Firm-level political risk#

by Hassan, Hollander, van Lent, and Tahoun (2019)

What share of the conversation between management and participants centers on political risks?
Extract all two-word combinations (“bigrams”) from training libraries indicative of discussion of political topics, P, and non-political topics

Conference Call Transcripts

Transcripts of 326,247 earnings conference calls of 11,943 firms headquartered in 84 different countries, available 2002-20 from EIKON.
Typically four calls per year, after earnings releases.
Management presentation followed by Q&A with firm’s analysts (0-70 questions, average duration 45 min).

Measure of Political Risk

Count the number of occurrences of (exclusively) political bigrams in conjunction with a synonym for risk or uncertainty and divide by the total number of bigrams in the transcript:

$PRisk_{it} = \frac{1}{B_{it}}\sum_{b}^{B_{it}}\{1[ b \in \mathbb{P}\setminus\mathbb{N}] \times 1[|b-r| \lt 10] \times f_{b,\mathbb{P}}/B_{\mathbb{P}}\}$

where $r$ is the position of the nearest synonym of risk or uncertainty and $b = 0,1,\ldots B_{it}$ are the bigrams contained in call of firm $i$ at time $t$.

(Application of “$tf \times idf$”)

Synonyms for “risk” or “uncertainty”

h:500px

Measuring news about the mean: $PSentiment_{it}$

Use same approach to measure mean of political news:
- Count positive and negative words (“sentiment”) used in conjunction with a political bigram:
  
  $PSentiment_{i,t} = \frac{1}{B_{it}} \sum_{b}^{B_{it}} \Big( 1[ b \in \mathbb{P}\setminus\mathbb{N}] \times \frac{f_{b,\mathbb{P}}}{B_{\mathbb{P}}} \times \sum_{c=b-10}^{b+10} S(c) \Big) $,
  
  where $S$ assigns sentiment to each $c$ (Laughran & McDonald 2011)
  
  $S(c) = \begin{cases} +1 \text{ if } c \in \mathbb{S}^{+} \\ -1 \text{ if } c \in \mathbb{S}^{-} \\ 0 \text{ otherwise} \end{cases}$
- Find that $Corr(PRisk_{it}, PSentiment_{it}) = −0.095^{***}$

$PRisk_{it}$ identifies conversations about risks associated with political topics.

Bigrams with highest scores intuitively linked to politics (‘the constitution,’ ‘public opinion,’ ‘interest groups,’ ‘the FAA’ …) Transcripts with highest PRisk it indeed center around discussions about ballot initiatives, legislation, regulation, government expenditure,…

Transcripts with highest PRisk

h:500px

Example: Duke Energy Corporation

A coal company’s $PRisk_{it}$

h:500px

Sources and Transmission of Country Risk#

by Hassan, Schreger, Schwedeler, and Tahoun (2021)

For each of 56 countries assemble a training library, $\mathbb{T}^{C}$

All “Country Commerce” reports published by the Economist Intelligence Unit 2002-2016
All names of the country, names of towns with more than 15,000 inhabitants in 2018, and administrative subdivisions from geonames.org and CIA World Factbook.

Use $tf \times idf$ to identify two-word combinations (‘bigrams’) that are indicative of discussion of each country. For example, a mention of “Angela Merkel” is indicative of discussions of country $C$ if,

The bigram if frequent in $C$’s training library
It is rarely used in other countries’ libraries

Four Dimensions of $CountryRisk_{i,c,t}$

Risk a given set of firms K associates with country c:

$CountryRisk_{c,t}^{K} = \frac{1}{N_K}\sum CountryRisk_{i,c,t}$
Foreign risks perceived by firm i at time t:

$ForeignRisk_{i,t} = \sum_{c \ne d(i)} CountryRisk_{i,c,t}$
Transmission of risk from o to d at time t:

$TransmissionRisk_{o \to d,t} = \frac{1}{N_d}\sum CountryRisk_{i,o,t}$
Global Risk at time t:

$GlobalRisk_{t} = \frac{1}{N_I}\frac{1}{N_C}\sum_{i \in I}\sum_{c \in C} CountryRisk_{i,c,t}$

Measuring Exposure, Sentiment, and Firm Risk

$CountryExposure_{i,c,t}$ : tf × idf weighted share of words related to country c
$CountrySentiment_{i,c,t}$ : tf × idf weighted sum of tone words toward country c (Loughran & McDonald 2011) (Proxy for positive/negative news about country c)
$FirmRisk_{i,t}$ : Unweighted count of risk words. (Proxy for overall risk faced by the firm)

Financial and Non-Financial Risk: Greece

h:500px

Financial and Non-Financial Risk: Thailand

h:500px

Financial and Non-Financial Risk: Global

h:500px

Hoberg and Philips (2016)#

Use similarity between product descriptions in 10Ks to identify industry clusters.
How similar are two firms’ products?
The most popular way of calculating similarity is cosine similarity.

$ S_{i,j} = c_i \cdot c_j $

where $c_i$ is the normalized representative vector of words for document $i$.
A creative way of figuring out who is competing with whom!

Product Descriptions to Vector

Only keep nouns (webster.com) and proper nouns. Drop most commonly used nouns.
Vector ($c_i$) is binary values for included words.
Cosine similarity between all firm-year pairs results in a huge matrix with firm-year as rows and columns.
Cluster firm-pairs year by year to form yearly industry clusters.

Result

Sample industry that changed a lot

Use this to track who is competing with whom, form industry definitions.
One drawback of 10K’s: only available for US firms.

h:500px

Measuring Economic Policy Uncertainty (EPU)#

h:500px

This proxy for Economic Policy Uncertainty (EPU) comes from computer searches of newspapers

US index: 10 major papers get monthly counts of articles with:
- E {economic or economy}, and
- P {regulation or deficit or federal reserve or congress or legislation or white house}, and
- U {uncertain or uncertainty}
Divide the count for each month by the count of all articles
Normalize and sum 10 papers to get the U.S monthly index

Constructing the US News-Based EPU Index

Newspapers:

Boston Globe
Chicago Tribune
Dallas Morning News
Los Angeles Times
Miami Herald
New York Times
SF Chronicle
USA Today
Wall Street Journal
Washington Post

Validation: Running Detailed Human Audits

10 undergraduates read ≈ 10,000 newspaper articles to date using a 63-page audit guide to code articles if they discuss “economic uncertainty” and “economic policy uncertainty”

h:500px

US News-based economic policy uncertainty index

h:500px

Twitter text uncertainty measures

h:500px

“world uncertainty index” covering 143 countries from Economist Intelligence Unit text

h:500px

Global average of all 143 countries

h:500px

The Diffusion of Disruptive Technologies#

Bloom, Kalyani, Lerner, and Tahoun (2021), The Diffusion of Disruptive Technologies

Construct text-based measures of exposure to 20 different technologies at the firm, patent, and job-level, 2002-19.
Use these novel data to study the spread of new technologies across firms, regions, occupations, and skill-levels.

Five Stylized Facts on Disruptive Technologies

Development & initial employment in disruptive technologies is geographically highly concentrated.
Over time, hiring associated with new technologies gradually spreads: “region broadening.”
Over time, skill level in tech jobs declines sharply: “skill broadening.”
Low-skill jobs associated with a given technology spread out significantly faster than high-skill jobs.
Pioneer locations retain long-lasting advantage in high-skilled jobs.

Data Sources

Full text of USPTO patents (1976-2016)
- Typically follow a research paper format – invention title, abstract, claim, description.
Transcripts of Earnings Conference Calls (2002-19)
- Discussions of 300k+ quarterly earnings by 12k publicly listed firms.
- Typically contains management presentation followed by analyst Q & A.
Full text of 200 M+ online job postings from BG (2007, 2010-19).
- Scraped from job forums (e.g., Glassdoor.com) and employer websites.
- Geo-coded and assigned to SOC Codes

Step 1: Identify Technical Bigrams from Patents

Identify two-word combinations (bigrams) that are indicative of discussion of novel technologies.

Extract all (17 mil+) bigrams US patents (1976-2016)
Remove any bigrams that were commonly in use prior to 1970 (Corpus of Historical American English)
Keep bigrams which account for at least 1000 citations.

List of 35,063 ‘technical bigrams’ associated with influential inventions.

Top Bigrams in Patents

h:500px

Step 2: Identify Disruptive Technologies from Earnings Calls

Identify technical bigrams that are discussed in EC with increasing frequency (keep those at <10% of max in first year) – Total 305.

h:500px

Technical vs non-technical bigrams

Non technical bigrams = bigrams in earnings calls and NOT in patents

h:500px

Step 3: Bigrams to Technologies

Two alternative approaches

“Supervised”: Group bigrams with similar meaning to measure the spread of 29 specific technologies, add `synonyms’ and manually audit each bigram. (Main specification)
- Smart Devices - mobile devices; smartphone tablet; android phones; smart phones …
- 3d printing - 3d printer; 3d printing; additive manufacturing; d printed
“Unsupervised”: Treat each tech bigram as a separate technology without any further intervention. (Robustness check)

Technology Exposure

Measure technology exposure at the patent, earnings call, and job level as

$ \text{exposure}_{i,\tau,t} = 1\{b_{t} \in D_{i,t}\} $

where $D_{i,t}$ is the set of bigrams contained in a job posting/earnings call posted at time $t$ and $b_{\tau}$ is a bigram associated with technology $\tau$.

Example Jobs Exposed to Smart Devices

h:500px

On average, each technical bigram appears in 59,013 job postings. Compare to 157 average mentions of top non-technical bigrams from earnings calls.

Define an Emergence Year for each Technology

Measure the share of earnings calls mentioning a technology
Define a “technology year of emergence” as year in earnings calls when the time series first attains at least 10% of its maximum.

h:500px

Share Exposed Firms and Job Postings – Corr. 80%

h:500px

Pioneer Locations

Define pioneer locations as ones which account for 50% of technology patents 10 years before emergence year.

h:500px

Broadening over Time and Pioneer Locations

h:500px

Parts of Speech Predict Loan Repayment#

Netzer, Lemaire, and Herzenstein (2019), “When Words Sweat”

Imagine you consider lending $2,000 to one of two borrowers on a crowdfunding website. The borrowers are identical in terms of demographic and financial characteristics. However, the text they provided when applying for a loan differs:

Borrower #1:

“I am a hard working person, married for 25 years, and have two wonderful boys.
Please let me explain why I need help.
I would use the $2,000 loan to fix our roof.
Thank you, god bless you, and I promise to pay you back.”

Borrower #2:

“While the past year in our new place has been more than great,
the roof is now leaking and I need to borrow $2,000 to cover the cost of the repair.
I pay all bills (e.g., car loans, cable, utilities) on time.”

Which borrower is more likely to default?

“Loan requests written by defaulting borrowers are more likely to include words (or themes) related to the borrower’s family, financial and general hardship, mentions of god, and the near future, as well as pleading lenders for help, and using verbs in present and future tenses.”

Loan Application Words Predicting Repayment

h:500px

“Legislative Influence Detectors”#

by Burgess et al

The two largest interest group associations: ALEC (on the conservative side) and ALICE (on the liberal side)

h:500px

Compare bill texts across states in two-step process:

find candidates using elasticsearch (tf-idf similarlity);
compare candidates using text reuse score.

From Pork to Policy#

h:500px

Topic modeling Federal Reserve Bank transcripts#

Analyze speech transcripts from FOMC (Federal Open Market Committee).
- private discussions among committee members at Federal Reserve (U.S. Central Bank)
- 150 meetings, 20 years, 26,000 speeches, 24,000 unique words.
Pre-processing:
- drop stopwords, stems; vocab = 10,000 words
LDA:
- K = 40 topics selected for interpretability / topic coherence.

h:500px

Pro-Cyclical Topics

h:500px

Counter-Cyclical Topics

h:500px

Effect of Transparency

h:500px

In 1993, there was an unexpected transparency shock where transcripts became public.
Increasing transparency results in:
- higher discipline / technocratic language (probably beneﬁcial)
- higher conformity (probably costly)
Highlights tradeoffs from transparency in bureaucratic organizations.

Text matching for causal inference#

Application to online censorship in China by Roberts, Stewart, and Nielsen (2018)

Construct a corpus of chinese social media posts, some of which are censored.
- 593 bloggers, 150,000 posts, 6 months
They use a variation of propensity score matching to identify almost identical posts, some of which were censored, and some of which were not.
Outcome:
- Using text of subsequent posts, measure how likely they are to be censored (how censorable)
- Can see whether censorship has a deterrence or backlash eﬀect.

Censorship has a backlash eﬀect

h:500px

Bloggers who are censored respond with more censorable content.

Research Applications

Contents

Research Applications#

Textual analysis#

Firm-level political risk#

Sources and Transmission of Country Risk#

Hoberg and Philips (2016)#

Measuring Economic Policy Uncertainty (EPU)#

The Diffusion of Disruptive Technologies#

Parts of Speech Predict Loan Repayment#

“Legislative Influence Detectors”#

From Pork to Policy#

Topic modeling Federal Reserve Bank transcripts#

Text matching for causal inference#