We conclude this book with a discussion of ethics in data science: With great power comes great responsibility.
The authors express their deepest gratitude to Wiley for making the publication a reality.
El Paso, TX and Mahwah, NJ, USA
September 2021
Maria Cristina MarianiOsei Kofi TweneboahMaria Pia Beccar‐Varela
1 Background of Data Science
1.1 Introduction
Data science is one of the most promising and high‐demand career paths for skilled professionals in the 21st century. Currently, successful data professionals understand that they must advance past the traditional skills of analyzing large amounts of data, statistical learning, and programming skills. In order to explore and discover useful information for their companies or organizations, data scientists must have a good grip of the full spectrum of the data science life cycle and have a level of flexibility and understanding to maximize returns at each phase of the process.
Data science is a “concept to unify statistics, mathematics, computer science, data analysis, machine learning and their related methods” in order to find trends, understand, and analyze actual phenomena with data. Due to the Coronavirus disease (COVID-19) many colleges, institutions, and large organizations asked their nonessential employees to work virtually. The virtual meetings have provided colleges and companies with plenty of data. Some aspect of the data suggest that virtual fatigue is on the rise. Virtual fatigue is defined as the burnout associated with the over dependence on virtual platforms for communication. Data science provides tools to explore and reveal the best and worst aspects of virtual work.
In the past decade, data scientists have become necessary assets and are present in almost all institutions and organizations. These professionals are data‐driven individuals with high‐level technical skills who are capable of building complex quantitative algorithms to organize and synthesize large amounts of information used to answer questions and drive strategy in their organization. This is coupled with the experience in communication and leadership needed to deliver tangible results to various stakeholders across an organization or business.
Data scientists need to be curious and result‐oriented, with good knowledge (domain specific) and communication skills that allow them to explain very technical results to their nontechnical counterparts. They possess a strong quantitative background in statistics and mathematics as well as programming knowledge with focuses in data warehousing, mining, and modeling to build and analyze algorithms. In fact, data scientists are a group of analytical data expert who have the technical skills to solve complex problems and the curiosity to explore how problems need to be solved.
1.2 Origin of Data Science
Data scientists are part mathematicians, statisticians and computer scientists. And because they span both the business and information technology (IT) worlds, they're in high demand and well‐paid. Data scientists were not very popular some decades ago; however, their sudden popularity reflects how businesses now think about “Big data.” Big data is defined as a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data‐processing application software. That bulky mass of unstructured information can no longer be ignored and forgotten. It is a virtual gold mine that helps boost revenue as long as there is someone who explores and discovers business insights that no one thought to look for before. Many data scientists began their careers as statisticians or business analyst or data analysts. However, as big data began to grow and evolve, those roles evolved as well. Data is no longer just an add on for IT to handle. It is vital information that requires analysis, creative curiosity, and the ability to interpret high‐tech ideas into innovative ways to make profit and to help practitioners make informed decisions.
1.3 Who is a Data Scientist?
The term “data scientist” was invented as recently as 2008 when companies realized the need for data professionals who are skilled in organizing and analyzing massive amounts of data. Data scientists are quantitative and analytical data experts who utilize their skills in both technology and social science to find trends and manage the data around them. With the growth of big data integration in business, they have evolved at the forefront of the data revolution. They are part mathematicians, statisticians, computer programmers, and analysts who are equipped with a diverse and wide‐ranging skill set, balancing knowledge in several computer programming languages with advanced experience in statistical learning and data visualization.
There is not a definitive job description when it comes to a data scientist role. However, we outline here some stuffs they do:
Collecting and recording large amounts of unruly data and transforming it into a more usable format.
Solving business‐related problems using data‐driven techniques.
Working with a variety of programming languages, including SAS, Minitab, R, and Python.
Having a strong background of mathematics and statistics including statistical tests and distributions.
Staying on top of quantitative and analytical techniques such as machine learning, deep learning, and text analytics.
Communicating and collaborating with both IT and business.
Looking for order and patterns in data, as well as spotting trends that enables businesses to make informed decisions.
Some of the useful tools that every data scientist or practitioner needs are outlined below:
Data preparation: The process of cleaning and transforming raw data into suitable formats prior to processing and analysis.
Data visualization: The presentation of data in a pictorial or graphical format so it can be easily analyzed.
Statistical learning or Machine learning: A branch of artificial intelligence based on mathematical algorithms and automation. Artificial intelligence (AI) refers to the process of building smart machines capable of performing tasks that typically require human intelligence. They are designed to make decisions, often using real-time data. Real-time data are information that is passed along to the end user immediately it is gathered.
Deep learning: An area of statistical learning research that uses data to model complex abstractions.
Pattern recognition: Technology that recognizes patterns in data (often used interchangeably with machine learning).
Text analytics: The process of examining unstructured data and drawing meaning out of written communication.
We will discuss all the above tools in details in this book. There are several scientific and programming skills that every data scientist should have. They must be able to utilize key technical tools and skills, including R, Python, SAS, SQL, Tableau, and several others. Due to the ever growing technology, data scientist must always learn new and emerging techniques to stay on top of their game. We will discuss the R and Python programming in Chapters