Читать книгу Machine Learning For Dummies - John Paul Mueller, John Mueller Paul, Luca Massaron - Страница 34

Obtaining data from public sources

Оглавление

Governments, universities, nonprofit organizations, and other entities often maintain publicly available databases that you can use alone or combined with other databases to create big data for machine learning. For example, you can combine several Geographic Information Systems (GIS) to help create the big data required to make decisions such as where to put new stores or factories. The machine learning algorithm can take all sorts of information into account — everything from the amount of taxes you have to pay to the elevation of the land (which can contribute to making your store easier to see).

The best part about using public data is that it’s usually free, even for commercial use (or you pay a nominal fee for it). In addition, many of the organizations that created them maintain these sources in nearly perfect condition because the organization has a mandate, uses the data to attract income, or uses the data internally. When obtaining public source data, you need to consider a number of issues to ensure that you actually get something useful. Here are some of the criteria you should think about when making a decision:

 The cost, if any, of using the data source

 The formatting of the data source

 Access to the data source (which means having the proper infrastructure in place, such as an Internet connection when using Twitter data)

 Permission to use the data source (some data sources are copyrighted)

 Potential issues in cleaning the data to make it useful for machine learning

 Potential security issues in accessing the data, adding it to other data sources, and managing it locally

 Ensuring that the data is the original data, rather than data that purports to be original but has been biased or modified in other ways that would change the results of using it

 Determining that the data doesn’t contain personally identifiable information that the data source originator may not have permission to use. (Chapter 22 covers issues like this one.)

Machine Learning For Dummies

Подняться наверх