13 Preparing for analysis

Author

Samuel Blay Nguah MD

For practical data analysis, a few considerations will have to be made. In this chapter, we consider these.

13.1 Review data collection

Since the data to be analysed will come from effective data collection and entry a review of the collection process is done if the data analyst was not part of this initial process. This will include a good knowledge of the variables collected, how they were collected, how they were entered into the electronic form and how the data will be delivered to the analyst. It is commonplace for an analyst to be presented with data that the investigator has manipulated prior, especially in Excel. However, it is often a much better idea if the data is presented as the raw and unedited version. That is more informative.

13.2 Plan for analysis

The next stage is to plan the specific analysis to be done and other tools needed, i.e., statistical software. It is at this stage that the analysts should ensure the required analysis is done vis-a-vis the study objective and that he/she has the requisite competence to perform such analysis. The planning process may involve the following:

Determination of the specific statistical techniques necessary for the objective stated.
Drawing dummy tables necessary to summarise the analysis to be done
Data processing and Quality checks may include determining missingness and wrongful data entry values from the data proportion.
Planning the data analysis based on the observations above comes next.

13.3 Data Migration

Data is often entered into one software and transferred to another for analysis. Each data entry software has its format for recording and storing data. Therefore, the entered data will have to be transferred to the data analysis software in a format that can be understood by the analysis software. Most software can export and import data in Excel format, thus making Microsoft Excel format the most common format for migrating data from one place to another, and one software to another.

13.4 Data analysis software

Data analysis software varies, with some significant differences between them. Generally, more sophisticated analysis requires correspondingly sophisticated software. Some of these software include:

R: This is one of the most well-used software. It is free, very versatile and open source but has a very steep learning curve. The real power of this software stems from the over 10,000 packages written by various users worldwide. R is completely free!
Python: This is in a similar category as R and only operates a command-line interface
Stata: Stata is one of the most well-used software for data analysis. Cost is however one of the limiting factors for its use especially by persons in developing countries
SPSS: The biggest advantage of SPSS is its well-developed graphical user interface that allows ease of use without writing commands. However, cost is also a limiting factor.

13.5 Data Cleaning

After data has been imported into the analysis software the biggest activity a data scientist will have to do is to first correct errors and prepare the data for the subsequent analysis. This unfortunately tends to be very extensive and could account for about 80% of all work to be done on the data. My general format is to:

Check for duplicate records in the data
Check for duplicate study IDs
Check for and evaluate missing data. This is very crucial as it could significantly affect the analysis method applicable to the data
Ensure all variables are recorded as they should be. For instance,e categorical and numeric variables should be recorded as such.
Labelling the variables for a better description
Check for outliers, especially in numeric variables. Outliers can be true data and can also be erroneous. These should therefore be dealt with on an individual basis. These can be detected by simply sorting the numeric data. Categorical data on the other hand can be tabulated to determine the categories and frequencies.
Check for erroneous but valid entries. An example of such data is a record of the last menstrual date of a male. student in a study involving adolescents. This might appear as appropriate data but on cross-tabulation will be revealed as erroneous.
Re-categorizing variables as needed for the plan analysis. Here, age for instance could be re-categorized as age groups.

13.6 Data analysis

This is the main focus of this section. It will therefore be treated in detail in subsequent chapters. As previously mentioned we will be employing the Jamovi statistical software for this.