The first step to do Data Analysis is to have the data.
Raw data can be collected on paper or in web forms.
For the use of advanced techniques on data analysis the data must be complete.
Different statistical data analysis techniques would apply for different data types and amount of data.
When big amounts of data are available it will be important to understand the structure of the data and how is stored and related.
Data cleaning is the process to remove empty lines or invalid lines from the data received.
In same cases on written texts it can exist variations of the same words that must be first fixed or they can compromise the results.
In special cases, it must be done with special attention in order that keeps enough data do to the data analysis.
For instance, if he have a sample with 15 patients with a strange disease he have to try to get as much as possible from the existing data.
For a telecommunication company to clean the addresses of 30 000 costumers it took 1 months to be done by 2 persons. Just after that it was possible to do a complete data analysis.
Usually just of part of all the data needs to have a manual data cleaning when the automatic procedures fail to obtain results.
The way as data is organized is very important to be able to analyze it.
Data taken from structured databases has usually already a good structure.
Data taken from written notes requires more work before it can be analyzed.
When same fields are duplicated with several records for the same person it will affect any data analysis that is done related to the persons.
Data Analysis is what is done when we want to obtain information from data.
Going a little further from information is possible to get knowledge and from knowledge is possible to get Wisdom.
Is possible to apply a lot of calculus to all the data but the results will fall in 3 areas:
– Normal results
Results that everybody is expecting and that don’t get nothing new. For instance, older workers usually work at more time in the organization.
– Too detailed results
When there is lots of data we can get many pages of results filled with data than in most cases is not relevant
– Interesting results
Only a part of the results that provides a interesting outcome. For instance, patients with a lower arterial pressure are more likely to die when with a certain disease.
Descriptive statistics is usually obtained by getting calculus like averages, maximum, minimum and standard deviation of data.
A step further is to do the correlations among variables to understand what is more related.
To take out the information from data it is needed to have a understatement of what is behind it for higher levels. And it is important that to take more interesting results out of data.
For instance, with medical data we may have some relations between 2 variables one called 5 sec and another 10 sec.
One has to know that this 2 variables result from a the same data read after 5 seconds and after 10 seconds.
As so it will be normal that its values are quite similar.