How can we analyze our company's data? First steps

Given a set of data, we may, a priori, want to get answers to key questions for our business from your analysis
Date 4/12/2017
Category Big Data
• What are the factors that most influence the productivity of my employees? • Can I predict the loss or gain of a business opportunity in my company? • What is the probability of a shutdown in my production line?

Before going deeper into the subject of this post and so that you can know basic information about types of data analysis, I recommend reading the article The level of maturity of organizations in data analysis where my colleague Ivan Toda explains, among other ideas, why data has become the best raw material for companies.

For this reason, before applying detailed analysis techniques, we must ask ourselves whether we have the necessary raw material to obtain the right answers.

During this phase, the user becomes familiar with the set: the data are described in terms of structure, size, typology and distribution of the variables. This first analysis is known as descriptive analysis.

Data types

This first exercise of detecting both the number of variables available and their type is essential. Depending on the types of data available, it will be possible to determine the analysis techniques best suited to these data. The classification of the variables according to the typology of the data is as follows:

  1. Categorical:
  • Dichotomous: it can take only two values, which can be translated into 0 and 1. Sometimes they are called mark type variables. For example: man/woman.
  • Nominal: numeric variables whose values represent a category or identify a group of belonging. For example: colors.
  • Ordinal: they also represent values belonging to a category. The difference with a nominal variable is that in this case the values can be ordered on a scale. For example: mild/moderate/strong pain.
  1. Numerical:
  • Discrete: they are numeric variables that have an accounting number of values between any two values. For example: number of inhabitants in a city.

Continuous: numeric variable that can take any value within a range of values. It is very important to know the units in which it is measured. Example: height of a person.

Graphical representation

It is said that "a picture says more than a thousand words", a saying that acquires special relevance when analyzing a set of data.

Some types of representations are as follows:

  • Sector Chart: This pie chart is used to represent the proportions in which a given value appears with respect to the total. The variables represented by this type of graph are categorical.


Representation of the values taken by the categorical variable with the classes: 0, 1, 2, 3, 4, 5, 6, 7.
 

  • Histograms: This type of representation gives an idea of the distribution of data that a variable takes. For example, we can see if the distribution of data values is adjusted to a theoretical normal distribution (as can be seen in the image at the bottom). The use of histograms is necessary when graphing continuous variables


Blue columns: reading frequencies. Green line: Theoretical normal distribution that best fits the data.
 

  • Bar graph: This graph represents the proportions in which a certain value appears, without necessarily being reflected on a total. It is mainly used to represent categorical variables


Representation of the values taken by the categorical variable "Marital status".
 

  • Line graph: This type of graph represents evolution in the data. It is mainly useful when representing variables of continuous type for which values are available over time.


Representation of the temporal evolution of a variable per year-month.

 

Detection and treatment of outliers

Outliers or anomalies are those that clearly stand out from the rest because they are very infrequent values, such as, for example, values that are too high or too low.

These values may be caused by errors in data entry or collection, or they may come from actual readings that are actually happening. These values are normally excluded from the analyses as they distort the results.

The most used techniques, along with intuitive ones, are those related to the calculation of distances between the different data, since the idea of being a different point from the others is equivalent to being a distant point from the others.


In green the atypical value that distorts the fit of a dataset (points) to a linear model (red line).

Detection and treatment of lost values

It is also important to detect if there are missing values. Knowing the nature of lost readings (device reading failure, wi-fi network disconnection, etc.) can be helpful in deciding what to do with this data loss.

It is not always the best option to remove them from the analysis, as they may be useful in some cases when extracting information, or simply removing them restricts the size of the data set too much.

Basic statistics

Position measurements

The purpose of these statistics is to summarize information as to where (around what value) the data are located.

To see how each of them is calculated let's give a practical example. Suppose we have the following set of data with the heights of family members measured in meters:

(1.75, 1.62, 1.62, 0.96, 1.43)

  • Arithmetic mean. It is obtained from the sum of all values divided by the number of summands.


     

  • Median: 1.62. If we order the values of our dataset it is the one that takes the central position.
  • Fashion: 1.62. This is the most frequently repeated value in a data distribution.
Dispersion measures

Once the values around which the data are found are known, we must ask ourselves: how much do they move with respect to those values? For this, there are a number of statistics known as dispersion measures. The purpose of these statistics is to summarize the information regarding variability.

We follow the example of the family:

  • Range: it is the interval in which the data moves, that is to say, the one defined by the minimum and maximum values of our set.

[0.96, 1.75]

  • Variance: in general, it is defined as the hope of the square of the difference of the variable with respect to its mean.

  • Standard deviation: given by the square root of the variance.

It is during this first exploratory analysis that the most evident relationships between the different observations, as well as possible errors in data entry, begin to be reflected.

This first source of knowledge is the one that allows us to continue investigating the data by means of more rigorous mathematical analysis. In short, you cannot extract value from a data set without first knowing it.

If you want more information about the different types of data analysis that exist and how to apply them to your company or business, you can download the recording of the following webinars:

Watson Analytics: Analyze and predict your business data

Predictive Analytics: Magic or Reality: Can we predict the future?

IBM Watson: tools for cognitive analysis and predictive analysis

Authors

public://lidia-orellana-avatar.jpg

Lidia Orellana

Linkedin
Back