Statistics Essays | Analysis of Data

Published Date: 23 Mar 2015 Last Modified: 24 Apr 2017

Consider and discuss the required approach to analysis of the data set provided.

As part of this explore also how you would test the hypothesis below and explain the reasons for your decisions. Hypothesis 1: Male children are taller than female children. Null hypothesis; There is no difference in height between male children and female children. Hypothesis 2: Taller children are heavier. Null hypothesis: There is no relationship between how tall children are and how much they weigh.

Analysis of data set

The data set is a list of 30 children's gender, age, height, the data weight, upper and lower limb lengths, eye colour, like of chocolate or not andIQ.

There are two main things to consider before and the data. These are the types of data and the quality of the data as a sample.

Types of data could be nominal, ordinal, interval or ratio.Nominal is also know as categorical. Coolican (1990) gives more details of all of these and his definitions have been used to decide the types of data in the data set.

It is also helpful to distinguish between continuous numbers, which could be measured to any number of decimal places an discrete numbers such as integers which have finite jumps like 1,2 etc.

Gender

This variable can only distinguish between male or female.There is no order to this and so the data is nominal.

Age

This variable can take integer values. It could be measured to decimal places, but is generally only recorded as integer. It is ratio data because, for example, it would be meaningful to say that a 20 year old person is twice as old as a 10 year old.

In this data set, the ages range from 120 months to 156months. This needs to be consistent with the population being tested.

Height

This variable can take values to decimal places if necessary. Again it is ratio data because, for example, it would be meaningful to say that a person who is 180 cm tall is 1.5 times as tall as someone 120cmtall. In this sample it is measured to the nearest cm.

Weight

Like height, this variable could take be measured to decimal places and is ratio data. In this sample it is measured to the nearest kg.

Upper and lower limb lengths

Again this variable is like height and weight and is ratio data.

Eye colour

This variable can take a limited number of values which are eye colours. The order is not meaningful. This data is therefore nominal(categorical).

Like of chocolate or not

As with eye colour, this variable can take a limited number of values which are the sample members preferences. In distinguishing merely between liking and disliking, the order is not meaningful. This data is therefore nominal (categorical).

IQ is a scale measurement found by testing each sample member. As such it is not a ratio scale because it would not be meaningful to say, for example, that someone with a score of 125 is 25% more intelligent than someone with a score of 100.

There is another level of data mentioned by Cooligan into which none of the data set variables fit. That is Ordinal Data. This means that the data have an order or rank which makes sense. An example would be if 10students tried a test and you recorded who finished quickest, 2^ndquickest etc, but not the actual time.

The data is intended to be a sample from a population about which we can make inferences. For example in the hypothesis tests we want toknow whether they are indicative of population differences. The results can only be inferred on the population from which it is drawn it would not be valid otherwise.

Details of sampling methods were found in Bland (2000). To accomplish the required objectives, the sample has to be representative of the defined population. It would also be more accurate if the sample is stratified by known factors like gender and age. This means that, for example, the proportion of males in the sample is the same as the proportion in the population.

Sample size is another consideration. In this case it is 30.Whether this is adequate for the hypotheses being tested is examined below.

Hypothesis 1: Male children are taller than female children.

Swift (2001) gives a very readable account of the hypothesis testing process and the structure of the test.

The first step is to set up the hypotheses:

The Null hypothesis is that there is no difference in height between male children and female children.

If the alternative was as Coolican describes it as "we do not predict in which direction the results will go then it would have been a two-tailed test. In this case the alternative is that males are taller it is therefore a specific direction and so a one-tailed test is required.

To test the hypothesis we need to set up a test statistic and then either match it against a pre-determined critical value or calculate the probability of achieving the sample value based on the assumption that the null hypothesis is true.

The most commonly used significance level is 0.05. Accordingto Swift (2001) the significance level must be decided before the data is known. This is to stop researchers adjusting the significance level to get the result that they want rather than accepting or rejecting objectively.

If the test statistic probability is less than 0.05 we would reject the null hypothesis that there is no difference between males and females in favour of males being heavier on the one sided basis.

However it is possible for the test statistic to be in the rejection zone when in fact the null hypothesis is true. This is called a TypeI error.

It is also possible for the test statistic to be in the acceptance zone when the alternative hypothesis is true (in other words the null hypothesis is false). This is called a Type II error. Power is 1 -probability of a Type II error and is therefore the probability of correctly rejecting a false null hypothesis. Whereas the Type I error is set at the desired level, the Type II error depends on the actual value of the alternative hypothesis.

Coolican (1990) sets out the possible outcomes in the following table:

	In acceptance zone	In rejection zone
NULL Hypothesis TRUE	OK	Type I error
NULL Hypothesis FALSE	Type II error	OK

Test method

The data for gender is categorical and for height the data is ratio. The sample is effectively split into 2 sub-sets for male and female.

Most books give the independent samples t-test as the main method for testing this hypothesis e.g. Curwin, et al (2001), Swift (2001).

Bland (2000) states that in order to use this test the samples must both be from a normal populations and additionally the distributions must have the same variance. Bland also suggests modifications to the test when the variances cannot be assumed to be the same. Programs likeSPSS will calculate both for equal and non-equal variances. SPSS also gives a test for equality of variances.

When the assumptions of normality and independence are met then the t-test is the best test according to Bland because it has higher power than the equivalent non-parametric test which is the Mann-Whitney U-test.However, the Mann-Whitney test is more robust in that it does not assume that the data is normally distributed.

It is a matter of weighing up the pros and cons. If normality can be assumed then the independent samples t-test is best. If not, then the U-test should be used. Tests such as a histogram or Q-Q(quantile-quantile) plot can be used to check normality to help the decisionBland (2000).

Because the test is one-sided we would be looking for the male mean to be higher and the critical value to come from 0.05 in the one tail of the distribution. For the t test this would be looked up with n1 + n2 - 2degrees of freedom where n1 and n2 are the numbers of males and females respectively.

It is also useful to work out a 95% confidence interval for the population mean. This gives an idea of the spread of the estimate. Larger sample sizes will reduce the confidence interval.

It was mentioned above that the inferences made are only valid for the population being sampled and only so if the sample is representative, which means selecting the sample from the whole population such that each member has equal probability of selection.

For the results to be reliable as Coolican (1990) says that if a research finding can be repeated it is reliable. So, if the sample is repeated the same result would indicate reliability.

Hypothesis 2: Taller children are heavier.

The null hypothesis is that there is no relationship between how tall children are and how much they weigh. The alternative hypothesis is that taller children are heavier, which is a one-sided test. That is, the alternative is not simply that there is a relationship, which would be two-sided.

Both heights and weights are ratio data. This enables the data to be examined by tests where normality is an underlying assumption.

In order to visually check the relationship a scatter graph is pretty well essential. This would give an idea of the strength and nature of the relationship. The relationship may not be linear as is often assumed. If so then the scatter should show indication of a curve.

The strength of the relationship can be tested by using thePearson correlation coefficient ( r ). This is closely related to a regression analysis which would be fitting a straight line equation to the data with height being the independent (x) variable and weight being dependent (y).

The correlation coefficient can be tested using a 1 sided t-test. This has n-2 degrees of freedom, 28 in this case. The value of r would need to be positive to indicate that taller children are heavier.

Analysis of the regression residuals can give us a lot more information than simply carrying out a correlation calculation. See Bland(2000). They can be plotted to see whether they are normally distributed using a histogram or Q-Q plot. Also, non-linearity should be apparent if this is the case.

If the data shows a non-linear relationship then it would be necessary to transform the data using logs or other mathematical functions. The transformed variables would then need to be analyses for normality and linearity.

According to Bland there is an alternative to the Pearson correlation coefficient which does not assume that the data is normally distributed. This is the Spearman Rank Correlation Coefficient. This is based on the distribution of the ranks of the data and not the data itself. This makes it more robust in terms of departures from assumptions, however it is less powerful. In other words there is more chance of making a Type II error.

Again, if the sample is repeated the same result would indicate reliability.

Summary

The stages that need to be gone through in order to test hypotheses such as those above is as follows.

Define the population about which inferences are to be made. This acts as a basis from which to obtain a sample. The results will only apply to this population.
Decide how to obtain a representative sample from the population. Decide the sample size required and whether the sample can be stratified to make it more accurate.There is a large amount of literature available to help with sampling.
Setup null and alternative hypotheses giving particular attention to whether the alternative should be one or two sided.
Decide on the significance level before calculating the test statistic. This is usually 0.05 but sometimes 0.01. To be objective the value must be set before carrying out the analysis.
Choose the testing method. Particular emphasis should be placed on the assumptions that the rest requires, particularly the assumption of normality which is needed for tests like t tests. The pros and cons of the various alternatives should be weighed up. This often boils down to power versus robustness.
Checknormality and linearity where appropriate by drawing graphs like QQ and histograms for normality and scatter graphs for exploring relationships.
Calculate the test statistic and compare with the critical value. Also calculate the probability of obtaining the sample value. Reject the null hypothesis if the sample value is outside the critical range or single value if it is a one-sided test.
Be aware of the possibility of a Type II error which is accepting the null hypothesis when it is in fact false.

References

Bland, J.Martin (2000) An Introduction to MedicalStatistics 3^rd Edition Oxford. Oxford Medical Publications.

Curwin, Jon and Slater, Roger (2001) QuantitativeMethods for Business Decisions

London, Thomson Learning

Coolican, Hugh (1990) Research Methods and Statistics inPsychology London, Hodder and Stoughton

Swift, Louise (2001) Quantitative Methods for Business, Management and Finance, Basingstoke, Palgrave

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now