Becoming a Data Head. Alex J. Gutman. Читать онлайн. Newlib. NEWLIB.NET

Автор: Alex J. Gutman
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Экономика
Год издания: 0
isbn: 9781119741718
Скачать книгу
means. Note that not every dataset will have a header row. In such cases, the header row is implied, and the person working in the dataset must know what each feature means.

      There are many ways to encode information, but data workers use a few specific types of encodings that store information and communicate results. The two most common data types are described as numeric or categorical.

      Numeric data is mostly made up of numbers but might use additional symbols to identify units. Categorical data is made up of words, symbols, phrases, and (confusingly) sometimes numbers, like ZIP codes. Numeric and categorical data both split into further subcategories.

      There are two main types of numeric data:

       Continuous data can take on any number in a number line. It represents a fundamentally uncountable set of values. Consider the weather. The outside temperature, if collected and turned into data, would represent a continuous variable. A local news station might measure a temperature of 65.62 Fahrenheit. However, they may choose to report this number to you as 65 degrees Fahrenheit, 66 degrees Fahrenheit, or 65.6 Fahrenheit.

       Count (or discrete) data, unlike continuous data, restricts the precision of the data to a whole number. For example, the number of cars you own can be 0, 1, 2, or more, but not 1.23. This reflects the underlying reality of the thing being measured.1

      Categorical data also has two main types:

       Ordered (or ordinal) data is categorical data with an inherent order. Surveys, for example, take advantage of ordinal data when they ask you to rate your experience from 1−10. While this looks like count data, it's not possible to say the difference between survey ratings 10 and 9 is the same as the difference between 1 and 0. Of course, ordinal categorial data does not have to be encoded as numbers. Shirt size, for example, is ordinal: small, medium, large, extra-large.

       Unordered (or nominal) categorical data does not have an underlying order to follow. Table 2.1, for example, has a Location feature with values Print, Online, Television. Other nominal variables include Yes or No responses; or Democrat or Republican party affiliation. Their order as presented is always arbitrary—it's not possible to say one category is “greater than” another.

      You'll notice Table 2.1 has a Date feature, which is an additional data type that is sequential and can be used in arithmetic expressions like numeric data.

      The preceding section talked about data types within a dataset, but there are larger categories to describe data that refers to how it was collected and how it's structured.

      Observational vs. Experimental Data

      Data can be described as observational or experimental, depending on how it's collected.

       Observational data is collected based on what's seen or heard by a person or computer passively observing some process.

       Experimental data is collected following the scientific method using a prescribed methodology.

      Most of the data in your company, and in the world, is observational. Examples of observational data include visits to a website, sales on a given date, and the number of emails you receive each day. Sometimes it's saved for a specific purpose; other times, for no purpose at all. We've also heard the phrase “found data” to reference this type of data; it's often created as byproducts from things like sales transactions, credit card payments, Twitter posts, or Facebook likes. In that sense, it's sitting in a database somewhere, waiting to be discovered and used for something. Sometimes observational data is collected because it's free and easy to collect. But it can be deliberately collected, as with customer surveys or political polls.

      This setup can span across industries, from drug trials to marketing campaigns. In digital marketing, web designers frequently experiment on us by designing competing layouts or advertisements on web pages. When we shop online, a coin flip happens behind the scenes to determine if you are shown one of two advertisements, call them A and B. After several thousand unknowing guinea pigs visit the site, the web designers see which had led to more “click-throughs.” And because ads A and B were shown randomly, it's possible to determine which ad was better with respect to click-through rates because all other potential confounding features (time of day, type of web surfer, etc.) have been balanced out through randomization. You might hear experiments like this called “A/B tests” or “A/B experiments.”

      We will talk more about why this discrepancy matters in Chapter 4, “Argue with the Data.”

      Structured vs. Unstructured Data

      Data is also said to be structured and unstructured. Structured data is like the data in your spreadsheets or in Table 2.1. It's been presented with a sense of order and structure in the form of rows and columns.

      Unstructured data refers to things like text from Amazon reviews, pictures on Facebook, YouTube videos, or audio files. Unstructured data requires clever techniques to convert it into structured data required for analysis methods (see Part III of this book).

      Is Data One or Many?

      The word data is actually the plural version of the word datum. (Like criteria—the plural of criterion. Or agenda—the plural of the word agendum.) If we were following proper rules of language, we would say “these data are continuous” and not “this data is continuous.”

      Data does not always look like a dataset or spreadsheet. It's often in the form of summary statistics. Summary statistics enable us to understand information about a set of data.

      The three most common summary statistics are mean, median,