Читать книгу Data Science For Dummies - Lillian Pierson - Страница 34
Defining data engineering
ОглавлениеIf engineering is the practice of using science and technology to design and build systems that solve problems, you can think of data engineering as the engineering domain that’s dedicated to building and maintaining data systems for overcoming data processing bottlenecks and data handling problems that arise from handling the high volume, velocity, and variety of big data.
Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big datasets. Data engineers often have experience working with (and designing) real-time processing frameworks and massively parallel processing (MPP) platforms (discussed later in this chapter), as well as with RDBMSs. They generally code in Java, C++, Scala, or Python. They know how to deploy Hadoop MapReduce or Spark to handle, process, and refine big data into datasets with more manageable sizes. Simply put, with respect to data science, the purpose of data engineering is to engineer large-scale data solutions by building coherent, modular, and scalable data processing platforms from which data scientists can subsequently derive insights.
Most engineered systems are built systems — they are constructed or manufactured in the physical world. Data engineering is different, though. It involves designing, building, and implementing software solutions to problems in the data world — a world that can seem abstract when compared to the physical reality of the Golden Gate Bridge or the Aswan Dam.
Using data engineering skills, you can, for example:
Integrate data pipelines with the natural language processing (NLP) services that were built by data scientists at your company.
Build mission-critical data platforms capable of processing more than 10 billion transactions per day.
Tear down data silos by finally migrating your company’s data from a more traditional on-premise data storage environment to a cutting-edge cloud warehouse.
Enhance and maintain existing data infrastructure and data pipelines.
Data engineers need solid skills in computer science, database design, and software engineering to be able to perform this type of work.