Module 4: Test and its terms (Notes)
In general, test is defined as a series of questions on the basis of which some information is sought.
In psychological or educational point of view. Test is a standardized procedure to measure quantitatively or qualitatively one or more than one aspects of trait by means of sample of verbal or non verbal behavior.
Simply, it is an instrument or systematic procedure for measuring a sample of behavior by posing a set of questions in a uniform manner. Because a test answer the question, “how well does the individual perform either in comparison with others or in comparison with a domain of performance tasks”?
General Characteristics of a Test
It is a set of stimuli, which means that the stimuli (popularly known as items) in the test are organized in a certain sequence and are based upon some principles of test construction.
Usually, the items are placed in increasing order of difficulty and its procedure of administration is standardized to ensure maximum objectivity.
Both quantitative and qualitative measurements are possible through psychological and educational test (Sigh, 1998 p.14).
Purpose of Testing
To compare the same individual on two or more than two aspects of trait. Two or more than two persons may be compared on the same trait.
To provide information for grading, reporting to parents and promoting students.
To motivate the students by evaluating the current status of the pupils
To select, classify, certify and place students by diagnosing their strengths and weaknesses
To collect information for effective educational and vocational counseling
Characteristics of Good Test
a) Objectivity
It must be free from the subjective element.
There should be complete interpersonal agreement among experts regarding the meaning of the items and scoring of the test.
It relates to two aspects of the test objectivity of the test items and objectivity of the scoring system.
By objectivity of items is meant that the items should be phrased in such a manner that they are interpreted in exactly the same way by all those who take the test.
Items should have uniformity of order of presentation (either ascending or descending order).
By objectivity of scoring is meant that the scoring method of the test should be standard one so that complete uniformity can be maintained when the test is scored by different experts at different times.
b) Reliability
Reliability means consistency of scores obtained by same individual when re examined with the test on different sets of equivalent items or other variable examining conditions. Mainly it refers to self correlation of the test.
It means that the extent to which the results obtained are consistent when the test is administered once or more than once on the same sample with a reasonable time gap.
Reliability includes both internal as well as temporal consistency.
Consistency of scores or results obtained from two sets of items of a single test after a single administration is the index of internal consistency of the test scores.
Consistency in results obtained upon testing and retesting is an index of temporal consistency of the test scores.
Logically, Guilford defined the reliability is the proportion of the true variance in obtained test scores. According to Guilford, the score of each individual consist of two components true score (t) and error score (e). However, the time we add the True score (T) with the Error score (E), we will get the Total score (X). Therefore it can be expressed in mathematically T +E= X
Where T = true score
E = error score (may be plus or minus)
X = the total or actual score.
Therefore statistically or logically reliability as stated by Guilford Reliability= Variance of True Scores divided by Variance of Total Scores.
Hence, Ґtt =δt²/δx² and or 1-δe²/δx²
Where Ґtt = reliability coefficient
δt² = variance of true score
δe² = variance of error
δx² = variance of total score
Where δx²= δt²+ δe²
SD = √∑d²/N
δt = √216/5 = √43.2 = 6.572
δe = √46/5 = √9.2 = 3.033
δx = √282/5 = √56.4 = 7.509
True Variance= δt² = 6.572 x 6.572 = 43.191
Error Variance= δe² = 3.033 x 3.033 = 9.199
Total Variance = δx² = δt²+ δe² = 43.191+9.199 = 52.39
Ґtt = True variance divided by Total variance. Therefore, 43.191÷52.39 = .82; this means that 82% of obtained variance is attributed between obtained score and corresponding true score table to true variance and there is close relationship.
Characteristics of Reliability
Reliability is consistency of a test scores
It is the measure of variable error or measurement error
It is the function of a test length
It refers, the stability of a test for a certain population
It is the temporal stability of a measuring instrument
It is the coefficient of stability
It is the coefficient of internal consistency
It is the self correlation
It is reproducibility of the scores
Importantly it refers to the accuracy or precision of a measuring instrument
It does not ensure the validity of the test always
Methods of Estimating Reliability
There are three most common methods of estimating the reliability coefficient of test scores.
These methods are
a) Test retest method is the simplest method of estimating reliability index of a test scores. In this method, the single form of the test is administered twice on the same sample with a reasonable time gap. In this way, two administrations of the same test yield the two independent sets of scores. The two sets, when correlated, give the value of the reliability coefficient. This reliability coefficient is known as coefficient of stability or temporal stability which indicates to what extent the examinees retain their relative position as measured in terms of the test score over a given gap of time.
A high test retest reliability coefficient indicates that the examinee who obtains a low score on the first administration tends to score low on the second administration and on the other hand, if examinee scores high on the first administration, tends to score high on the second administration.
Assumptions of the method
Number of item in the test should be large, therefore memory, practice and carry over will not effect the retest score
Innate ability of an individual should remains constant so the growth the maturity will not effect the retest scores.
The most appropriate and convenient time gap between the two administrations should be fortnight, which is considered neither too short no too long.
Limitations of the method
This method is less accurate than the other methods.
Memory, practice, carry over effects are observed while the test is repeated immediately.
If the interval between tests is long (six months or more), growth and maturity will effect the retests scores and tends to lower down the index of reliability.
There is no agreement among the psychometricians regarding the time gap between the test.
The individual’s Physical and mental health, emotional and motivational conditions do not remains the same in both the administrations.
It is time consuming method of estimating reliability.
b) Alternative forms reliability or Coefficient of Equivalence
Alternative forms reliability is known by various names such as the parallel-forms reliability, equivalent forms reliability and the comparable-forms reliability.
It is an improvement over the earlier method and it is one way of overcoming the problems of memory, practice, carry over and recall factors.
It is this method which requires that the test be developed in two forms and it should be comparable or equivalent.
Two forms of the test are administered on same sample of subjects on the same day after a considerable time interval.
Pearson’s method of correlation is used for calculating the coefficient of correlation between two sets of scores obtained by administering the two forms of the test. Such a coefficient is known as the coefficient of equivalence.
Assumptions of the method
The number of items in both forms should be equal.
Two forms of a test should be alike with reference to: content and type of items, the range of difficulty and discrimination indexes, mean and variance of both the forms, time administration
Limitations of the method
Practice and carry over factors can not be controlled, second form of the test scores are generally high.
It is difficult to construct parallel forms of test and satisfy all the conditions mentioned.
There is no agreement among the psychometrician about the interval between the two forms of test
Interval for administration the two forms will not be more than two weeks.
It is not possible to provide alike situations to measure the same pattern of behaviour.
The Split half Method of Reliability
This method of reliability is an improvement over the earlier two methods-coefficient of stability and coefficient of equivalence, as it involves both the characteristics of stability and equivalence.
This method particularly determines the internal consistency of the test and internal consistency reliability indicates the homogeneity of the test. If all the items of the test measure the same function or trait, the test is said to be a homogeneous one and its internal consistency reliability would be high.
Specifically in this method, the test is divided into two equal or nearly equal halves. The common way of splitting the test is the odd even method. However, all odd numbered items (like 1,3,5,7,9 etc.), constitute one part of the test and all even numbered items (like 2,4,6,8,10,12etc.), constitute another part of the test.
In this way, each examinee receives two scores: scores on odd-numbered items and scores on even-numbered items. In this way from single administration of the single form of the test two sets of scores are obtained and then Pearson’s method of correlation or Spearman Prophecy formula can be used for calculating the coefficient of correlation between the two parts of the test. Further, the coefficient of reliability of whole test is estimated with the help of formula
Ґtt = 2 x reliability of half test
1+reliability of half test
Assumptions of the method
The test should be divided into two equal or nearly equal halves.
All the items of the test should measure the same trait or ability
All the items of the test should be the same difficulty value
The assumptions of Pearson’s method i.e. linearity is applied to this method
Limitations of the method
Chance errors may effect scores on the two halves of the test in the same way, it tends to make the reliability index too high
A test can be divided into two parts in a number of ways, so that the reliability coefficient is not a unique value
This method can not be used in power test and heterogeneous tests
It is not possible to split the test items in two equivalent forms, because items of a test measures the different aspect of the same trait or ability
Factors influencing reliability
The reliability of test scores is influenced mainly by two factors: extrinsic and intrinsic.
Extrinsic factors are those factors which lie outside the test itself and tend to make the test reliable or unreliable. For example, variability in the range of ability of a group, environmental conditions, guessing by the examinee etc.
Intrinsic factors refer to those factors which lie within the test itself and influence the reliability of the test. For example, characteristics of items, total score, length of the test etc.
Important extrinsic factors affecting the reliability of a test may be enumerated as follows:
a) Group variability: When the group of examinees being tested is homogeneous in ability, the reliability of the test scores is likely to be lowered. But when the examinees vary widely in their range of ability, that is, the group of examinees is heterogeneous one, the reliability of the test scores is likely to be high.
b) Guessing by the examinees: Guessing in a test is an important source of unreliability. In two alternative response options there is a 50% chance of answering the items correctly on the basis of the guess. In multiple choice items the chances of getting the answer correct purely by guessing are reduced.
Guessing has two important effects upon the total test scores. First, it tends to raise the total score and thereby makes the reliability coefficient spuriously high.
Second, guessing contributes to the measurement error since the examinees differ in exercising their luck over guessing the correct answer.
c) Environmental conditions: As far as possible, the testing environment should be uniform. Arrangement should be such that light, sound and other comforts are equal and uniform to all the examinees, otherwise it will tend to lower the reliability of the test scores.
d) Momentary fluctuations in the examinee influence the test score. For example, A broken pencil, momentary distraction by the sudden sound of an aero plane flying above, anxiety regarding non completion of home work, mistake in giving the answer and knowing no way to change it, etc.
Intrinsic factors
Length of the test: A longer test tends to yield a higher reliability co-efficient than a shorter test. It means that if the length of test is increased by adding the same difficulty value and content, the reliability index will increase. Thus, lengthening the test or averaging total test scores obtained from several repetitions of the same test tends to increase the reliability. Even it has been demonstrated that the averaging the test scores of several applications essentially gives the same result as increasing the length of the test. The Spearman-Brown formula may be used to estimate the reliability of the test. The formula is:
Ґnn = (n)(Ґtt)
1+(n-1) Ґtt
Where Ґnn = reliability coefficient of lengthened test; n = number of items the test has been increased; Ґtt = reliability coefficient of the original test
Example-1: Suppose an intelligence test of 100 items has a reliability coefficient of 0.80. If the test is increased four times its present length, that is, 300 more items are added so that now the test becomes 400 items then how much would be reliability index of a test.
Ґnn = (4)(.80)/1+(4-1)(.80)
= 3.2/1+3x.80
= 1+2.4 = 3.2/3.4 = .94
Example: 2
Suppose the reliability of an intelligence test is .60. For how much time should the test be lengthened in order to reach a reliability coefficient of .90
n = Ґnn (1- Ґtt)
Ґtt (1- Ґnn)
Where n = number of time the test is to be lengthened; Ґnn = level of reliability coefficient required; Ґtt = reliability of the existing test
n = .90(1-.60)
.60(1-.90)
=.90 x .40 = .36 = 6
.60 x .10 .06
Range of the total scores: If the obtained total scores on the test are very close to each other, that is, if there is lesser variability among them, the reliability of the test is lowered. On the other hand, if the total scores on the test vary widely, the reliability of the test is increased. In statistically it can be said that when the standard deviation of the total score is high, the reliability is also high and vice versa.
Homogeneity of items: The concept of homogeneity of items includes two things item reliability(inter item correlation) and the homogeneity of function or trait measured from one item to another. When the items measure different functions and the inter correlations of items are zero or near it (that is, when the test is heterogeneous one), the reliability is zero or very low. When all items measure the same function or trait and when the inter-item correlation is high, the reliability of the test is also high.
Difficulty value of items: In general, items having index of difficulty at o.5 or close to it, yield higher reliability than items of extreme index of difficulty. In other words, when items are too easy or too difficult, the test yields very poor reliability(because such items do not contribute to the reliability)-than when items are of moderate difficulty values.
Discrimination value: When the test is composed of discriminating items, the item – total test score is likely to be high and then, the reliability is also likely to be high. But when items do not discriminate well between superior and inferior, that is when items have poor discrimination values, the item-total correlation is affected and then it would decrease the reliability of the test.
Scorer reliability: By scorer reliability is meant how closely two or more scorers agree in scoring or rating the same set of responses. If they do not agree, the reliability is likely to be lowered.
Suggestions for improving reliability
The group of examinees should be heterogeneous, that is, the examinees should vary widely in their ability or trait being measured.
Items should be homogeneous
The items should be of moderate difficulty value
Items should be of high discriminating index
Validity
Validity means truthfulness. Therefore, it can be defined as the extent to which the test measures what it intends to measure.
Validity is not the self-correlation of the test, rather it is correlation of the test with some outside independent criterion, which are regarded by experts as the best measure of the trait being measured by the test.
In broad sense, validity is concerned with generalizability. When a test is a valid one, it means its conclusion can be generalized in relation to the general population.
Generally a test which yields inconsistent results (poor reliability) is ordinarily not expected to correlate with some out side independent criterion. In other words, a test which has poor reliability is not expected to yield high validity. Thus, validity is dependent upon reliability.
This prediction is true for the homogeneous test only. If a test is heterogeneous, validity may be high even without high reliability. This is because in a heterogeneous test each part measures an independent function. Thus reliability can be said as a sufficient condition but not necessary condition for validity.
Characteristics of validity of a test scores
It is one of the most important characteristics of a measuring instrument.
It is an index of external correlation. The test scores are correlated with external criterion scores.
It relates to the purpose or objective of a test scores.
Validity ensures the reliability of a test. If a test is valid, it must be reliable.
It is also the function of a test length.
Types of validity
Validity may be different types - Content validity, validity, Criterion related validity and construct validity.
Content or Curricular Validity content validity is the degree to which a test measures an intended content area. In more simply, it measures how well the examinee has mastered the specific skills or a certain course of study.
Psychometrician are of the view that content validity requires both sampling validity and item validity. Item validity is basically concerned with whether the test items represent measurement in the intended content area, and sampling validity is concerned with the extent to which test samples the total content area.
For example, a test designed to measure knowledge of biology might have good item validity because all the items indeed deal with good biology facts but might have poor sampling validity, that is, all the items may deal only with vertebrates. Thus a test with good content validity also samples the appropriate content area. This becomes important because we can not possibly measure each and every aspect of a certain content area. Therefore the inferences about performance in the whole content my not be judged correctly.
Judgment of content validity
Content validity of a test is examined in two ways:
a) By the expert’s judgment
For example, an investigator wants to examine the content validity of a test on Tanzanian history. For this purpose, the content or items of the test will be submitted to a group of subject-matter experts. These experts will judge whether or not the items represent all the important events of Tanzanian history, whether or not some additional items should be added for complete coverage, what should the relative weights of the items of a particular event be, etc. the validity of the contents or items will be dependent upon a consensus judgment of the majority of the subject experts.
b) Statistical analysis: In this techniques, scores on the two independent test are correlated and both of which are said to measure the same thing.
Suppose one wants to know the content validity of a English spelling test. Then the teacher can correlate the scores on the said test with another similar English spelling test. A high correlation coefficient would provide an index for the content validity.
Although a high correlation coefficient can easily be demonstrated in two sets of scores obtained from two similar test, it does not fully guarantee content validity because high correlation may be due to the fact that both the tests measure the same incorrect things
Therefore test developer should specify:
The area of content explicitly so that all major portions in equal proportion be adequately covered by the items.
Content area should be fully defined in clear words and must include the objects.
The relevance of contents or items should be established in the light of the examinee’s responses to those items.
Criterion related Validity
Criterion related validity is one which is obtained by comparing (or correlating) the test scores with scores obtained on a criterion available at present or to be available in the future.
There are two subtypes of criterion related validity
i. Predictive validity
Predictive Validity is also designated as empirical validity or statistical validity. Basically in predictive validity a test is correlated against the criterion to be made available in some time in the future.
In other words, test scores are obtained and than a time gap (period) of months or years is allowed to elapse, after which the criterion scores are obtained. Subsequently, the test scores and the criterion scores are correlated and the obtained correlation becomes the index of validity coefficient.
Example suppose one wants to predict success in an M A class in terms of grade A, B, C, D and E – A being the best grade and E being the worst grade. The investigator may administer a test of intelligence at the time of their admission to the M A class and thus obtain a set of scores. After two years, on the basis of class-room performance, students are graded according to the above categories. Here grade points would constitute the criterion. Now the scores of intelligence test and scores of class room test after some months are to be correlated and if correlation will be high than we can say with all certainty that scores on intelligence are directly predicting the future performance of the students in M A class.
ii. Concurrent Validity: It is another subtype of criterion-related validity. Concurrent validity is very similar to predictive validity except that there is no time gap in obtaining test scores and criterion scores. The test is correlated with a criterion which is available at the present time.
For example, scores on newly constructed intelligence test may be correlated with scores obtained on an already standardized test of intelligence.
The resulting coefficient of correlation will be an indicator of concurrent validity. If the correlation is high then validity of the new test will be high. Likewise, an intelligence test may be validated (or correlated) against the marks obtained in the previous examination.
Construct Validity
Construct validity has also other names such as factorial validity and trait validity. In construct validity the meaning of the test is examined in terms of a construct.
What is construct? A construct is non- observable trait such as intelligence, anxiety etc. which explains our behavior.
Anastasi (1968) has defined it as ‘the extent to which the test may be said to measure a theoretical construct or trait.
It may be defined as the degree to which the individual possesses some hypothetical trait or ability or quality (construct)presumed to be reflected in the test performance.
Norms
Norms may be defined as the average performance on a particular test made by a standardized sample. By a standardization sample is meant a sample which is the true representative of the population.
Simple, norms refers to the average performance a representative sample on a given test.
There are four common types of norms age norms, grade norms, percentile norms and standard score norms.
These norms help in interpretation of the scores or add the value judgment.
Age Norms
Age norms are defined as the average performance of a representative sample of a certain age level on the measures of a certain trait or ability.
For example, we measure the weight of representative sample of 10 Yrs-old girls of the Dodoma region, and find out the average of the obtained weight, we can determine the age norms for weight of 10 Yrs-old girls.
Mainly age norms are most suited to those traits or abilities which increase systematically with age- height, weight and cognitive abilities like general intelligence, emotional intelligence and so on.
Practicability
A test must be practicable from the point of view of the time taken in its completion, length, scoring, etc.
The test should not be lengthy and scoring method must not be difficult nor one which can only be done by highly specialized persons.
Classification of Test
On the basis of the criterion of administrative conditions
a) Individual Test (Kohs Block Design Test)
b) Group Test (Bell Adjustment Inventory or Achievement Test for measuring cognitive skills)
On the basis of the criterion of scoring
i. Objective Test
ii. Subjective Test
On the basis of the criterion of time limit in producing the response
i. Power Test
ii. Speed Test
On the basis of the criterion of the nature or contents of items
i. Verbal Test
ii. Nonverbal Test
iii. Performance Test
iv. Non language Test
On the basis of the criterion of purpose or objective
i. Intelligence Test
ii. Aptitude Test
iii. Personality Test
iv. Achievement Test
On the basis of the criterion of standardization
i. Standardized Test
ii. Teacher-made Test
Steps of Test Construction
Planning of the Test
In planning stage a test constructor needs to plan the following:
Specify the broad and specific objectives of the test in clear terms.
Decides upon the nature of the content or items to be included.
The type of instructions to be included.
The methods sampling.
Detailed arrangement for the preliminary and final administration.
Probable length and time limit for the completion of the test.
Total number of reproductions of the test to be made.
Probable statistical methods to be adopted.
Preparation of manual for the test.
Writing Items of the Test
Item writing should starts with the planning done earlier.
Pre planned types of item should be included only in a test.
Items should be included only in keeping mind the stipulated purpose.
Simplest possible language have to be used.
It needs thorough knowledge and complete mastery of the subject matter, the person who will write the items
It needs the item writer must be fully aware of those persons for whom the test is meant.
It needs the item writer must be able to avoid irrelevant clues to the correct response.
It needs the item writer must be familiar with different types of items along with their advantages and disadvantages.
It needs group of subject experts for their criticism and suggestions, which must then be duly modified.
Items must be arranged in an increasing order of difficulty and those having the same form and dealing with the same contents are placed together.
Try out
Try out helps to
Find out the major weakness, omissions, ambiguities and inadequacies (non-functioning distracters in multiple choice items, very difficult and very easy items and the like) of the items.
Determine the difficulty values of each item which, helps in selecting items for their proper distribution in the final form.
Determine discriminatory power of each individual item.
Determine the reasonable time limit of the test.
Determine the appropriate length of the test.
Determine the inter correlations of items so that overlapping can be avoided.
Identify the weakness in directions or instructions of test.
Reliability of the Final Test
Compute the reliability coefficient by administering the final test on fresh sample.
The size of the sample for this purpose should not less than 100.
Common ways of calculating reliability coefficient are: test-retest, split-half, and equivalent method.
Validity of the Final Test
If test measures a trait that it intends to measure well, we can say that test is valid one.
After estimating reliability coefficient, a test constructer validates the new test against some outside independent criteria (standardized test).
Types of validity are: content validity-sampling validity and item validity, criterion related validity-predictive validity(with time gape administer criterion test) and concurrent validity(no time gape in obtaining test scores and criterion scores) construct validity. Validity can be judged by experts or statistically by Pearson‘s ‘r’.
Norms of the Final Test
Norms are prepared to meaningfully interpret the score obtained on the test for.
The obtained scores on the test themselves convey no meaning regarding the ability or trait being measured.
When the test scores are compared with the norms, a meaningful inference can immediately be drawn.
The common types of norms are the age norms, the grade norms, the percentile norms and standard score norms.
Preparation of Manual and Reproduction of the Test
It gives clear direction regarding
Norms of the test
Procedures of test administration
Scoring method and time limits
Orders for printing of the test and the manual
Post a Comment