Читать книгу An Introduction to Text Mining - Gabe Ignatow - Страница 31

На сайте Литреса книга снята с продажи.

Online Data Sources

Researchers often prefer to use ready-made data rather than, or often in addition to, constructing their own data sets using crawling and scraping tools. While many sources of data are in the public domain, some require access through a university subscription. For example, sources of news data include the websites of local and regional news outlets as well as private databases such as EBSCO, Factiva, and LexisNexis, which provide access to tens of thousands of global news sources, including blogs, television and radio transcripts, and traditional print news. One example of the use of such databases is a study of academic research on international entrepreneurship by the management researchers Jones, Coviello, and Tang (2011). Jones and colleagues used EBSCO and ABI/INFORM search tools to select their final data set of 323 journal articles on international entrepreneurship published between 1989 and 2009. They then used thematic analysis (see Chapter 11) to identify themes and subthemes in their data.

In addition to being able to access digitized news sources, researchers have access to writing produced by organizations including political statements, organizational calendars, and event reports. These data include recent online writing as well as digitized historical archives. Unfortunately, many online data sources are not simple to access. Most news databases allow access to a few articles but generally do not allow access to their entire database, as the subscriptions universities pay for are based on the assumption that researchers want to read a few articles on a subject rather than use large numbers of articles as primary data. Yet despite these limitations, a large and growing number of digital text collections are available for text mining researchers to use (see Appendix A). Among the most useful of these collections is the Corpus of Contemporary American English (COCA; http://corpus.byu.edu/coca), the largest public access corpus of English. Created by Davies of Brigham Young University, the corpus contains more than 520 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990 to 2015 and is updated regularly. The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. COCA and related corpora are often used by social scientists as secondary data sources in order to compare word frequencies between their main data source and “standard” English (e.g., Baker et al., 2008).

Another major source of digital data is represented by social media platforms, many of which provide their own application programming interfaces (APIs) for programmatic access to their data. The Twitter APIs (http://dev.twitter.com), for instance, allow one to access a small set of random tweets every day, or larger keyword-based collections of tweets (e.g., all the recent tweets with the hashtag #amused). If larger collections are necessary, they can be obtained through third-party vendors such as Gnip or others, which cover several social media sites and often partly curate the data. Twitter also provides limited demographic information on their users, such as location and self-maintained free-text profiles that sometime can include gender, age, industry, interests, and others.

Blogs can also be accessed through an API—for instance, the Blogger platform offers programmatic access to the blogs and the profile of the bloggers, which includes a rich set of fields covering location, gender, age, industry, favorite books and movies, interests, and so on. Other blog sites, such as LiveJournal, also include additional information on the bloggers, for instance, their mood when writing a blog post.

Facebook is another very large platform for social media, although less available for public access. The main way for developers to access Facebook data is via their Graph API, but the access is nonetheless limited to the content of those profiles that are either publicly available, or are “friends” (in Facebook terms) of the developers. An interesting data set for social science research is the myPersonality¹ data set: It was compiled using a Facebook application, and it includes the profiles and updates of a large number of Facebook users who have also completed taken a battery of psychological surveys (e.g., personality, values).

1 Available upon request from http://mypersonality.org.

In addition, there are several other social media websites, with different target audiences, such as Instagram (where users upload mainly images they take), Pinterest (with “pins” of interesting things, covering a variety of domains from DIY to fashion to design and decoration), and many review platforms such as Amazon, Yelp, and others.

If you are interested in assembling your own data set, Chapter 6 provides an overview of software tools for scraping and crawling websites to collect your own data, and Chapter 5 provides instruction related to data selection and sampling.

Подняться наверх