© 2005 Paul Cooijmans
Norming a test means generalizing its scores. But there are levels in generalization; If you standardize raw scores the normal way (that is, express them in standard deviations away from the mean), you have the first level of generalization. That is how I.Q. is normed.
This first level is only useful if all tests you ever want to compare are taken by the same group of testees. The standard scores are valid within that group, but not across groups. Because a second group of testees has a different intelligence level (unless it is the same by coincidence), any particular standard score of that group reflects a different intelligence level than that particular standard score in the first group.
This has the following implications:
We need a second level of generalization, at which test scores can be compared even if the norming groups have different intelligence levels. A Golden Standard of intelligence, so to speak.
High-range tests have attempted this in the past decades by means of "anchored norming". Raw scores are mapped to scores on priorly taken tests. Although unmistakably better than regular psychology's norming methods, this has down sides:
Effectively this means high-range tests currently are not fully generalized at the second level; At best, they have been anchored cleverly to first-level generalizations. In addition, they have the advantage of offering scores independent of nationality, age and sex, they do not punish gifted persons by excluding hard items, and they do not suffer from the Flynn effect. Like gold, high-range intelligence is inert; unaffected by environmental factors like improved medical care and nutrition that (in my interpretation) cause the Flynn effect. If there is to be a golden standard of intelligence, it must be cast at the high end.
I have observed the following, ordered from most to least obvious:
[Remark as of August 2008: What follows below is speculative and I am not yet certain if it is the best possible way to go about in practice. Meanwhile I have invented the method of protonorms, which is somewhat more conventional, and I am relying on that rather than on the below described method, which is interesting theoretically nevertheless. For achieving a Golden Standard of intelligence, the protonorm method currently seems more promising.]
Thanks to these tendencies, it is possible to go from a test's internal statistics to second-level generalization without the need for anchoring to prior scores.
The scale I choose to use is one with a generalized mean of 50 and a generalized standard deviation of 10. This is called "t-scores" in statistics, but my implementation is one level of generalization higher than normal t-scores, so I call them Generalized T-scores, or GT. One may also say "General intelligence T-scores", or "Golden T-scores".
The following internal test statistics are needed:
"Active items" are test problems that have been solved by at least one testee AND missed by at least one testee. For multiple-choice items that can easily be guessed right this criterion must be more stringent, e.g. demanding that some percentage of the testees have solved it.
An example of active hardness: if there are 30 active items, and the mean is 12 right, the active hardness is (30 - 12)/30 = .6
Normalized z-scores are obtained from raw scores by computing the centile for each score and converting it into a z-score via the normal distribution. The z-scores are thus forced into a normal distribution, at the possible expense of outlying points being dumbed down. How centiles are computed lies outside the scope of this article, and is explained in books on statistics and psychometrics.
Hybrid z-scores are, for each raw score, the average of the actual z-score and the normalized z-score. They allow outliers to be expressed while still paying proper respect to characteristic irregularities in the raw score distribution (using actual z-scores derived directly from raw scores results in incorrect norms wherever there are local bumps and valleys in the raw score distribution).
Only when a test has been constructed so that it contains equal numbers of items from each level of difficulty, which results in a near normal raw score distribution, can actual z-scores be used instead of normalized or hybrid z-scores.
If a test yields scaled scores through a method of carefully chosen item weights, normalization is also not needed because item weighting itself has a normalizing effect.
Now two things need to be done; the mean must be mapped to a Generalized Mean (GM) for that test on the GT scale, and the SD must be mapped to a Generalized SD (GSD) for that test on the GT scale. The following formulas, amazingly, accomplish this:
Then the norm (Generalized T-score) for each raw score is given by:
GT = GM + GSD × z-score
"normal_range(1/n)" does the following: take 1/n (that is, the reverse of the number of testees) as a frequency in a normal distribution and find the corresponding z-score. Multiply by two (because the frequency occurs at both tails of the distribution) and by ten (to convert it to GT points).
The resulting range of GT points (to be visualized as centered over GT 50) is the possible range for the Generalized Mean. It cannot but be inside that range. Exactly where depends on the active hardness; when AH goes toward 1, GM goes toward the top of the range. When AH goes toward 0, the opposite.
The formula for GSD sets the Generalized Standard Deviation at 10 plus or minus a correction to account for the shrinking of the GSD as the Generalized Mean goes up. Analysis of existing data has resulted in an estimated shrink of 0.25 GT point for every 1 point rise of the GM. This may be fine-tuned later. The GSD shrinks to 0 when the GM approaches the highest expected score, which is somewhere between GT 81 and 90.
One may wonder how I can analyze existing data while this method has only just been invented; actually it is possible to get an approximation of the GT scale in a different way, and I have been doing that for several years already to obtain the high-range centiles in this table. It involves combining the achieved I.Q. scores (normed conventionally on prior scores) of a representative selection of high-range tests and computing the centile for each I.Q. inside that combination.
The resulting centiles are converted to GT scores via the normal distribution. This way it is possible to analyze how the GT scale lies relative to I.Q., and how each test's mean and SD on the GT scale relate to the test's internal statistics. The challenge was to eliminate the intermediate step of anchor-norming to I.Q.s and go directly from internal statistics to GT scores. At first sight this seemed an obvious impossibility. As it happens, I was not destined to accept "impossible".
With high-range tests, extrapolation of norms is typically needed because the test ceiling is (far) above the highest achieved score to allow for exactly those outliers these tests are meant for. My rule of thumb is that linear extrapolation is safe over half the square root of the actual score range. E.g., if the scores range from 6 to 42, it is safe to extrapolate 3 points in both directions. In practice I extrapolate somewhat further, but am aware that those norms are uncertain.
A related phenomenon is the edge effect near the test floor and ceiling. Scores do not behave linearly there, but tend to "taper"; take bigger steps as they approach the floor or ceiling. This reflects that someone who scores close to the floor (ceiling) may really be below (above) it, and has entered the test's score range as a result of its measurement error. My rule of thumb is that the edges are both half the square root of the total possible score range wide. E.g., a test with 100 items has edges of 5 on both sides. If you know the test's standard error (which is the standard deviation of the expected error) you can think about edges in a sophisticated way, like: if someone scores one standard error above the floor, there is a 16% chance (according to the normal distribution) he is really below the floor. However, it is uncertain if a test's standard error is constant over its full range (I think not), hence the rule of thumb.
The relation between GT and I.Q. according to current data can be derived fromthis table.
Future analysis may fine-tune this relation. The I.Q. scale meant is the one used for my tests during the past years, with a standard deviation of 15, and in calibration roughly in accordance with other tests accepted by I.Q. societies. My intention is to regard GT as the true standard, and I.Q. as mere additional information. Of course GT will need time to prove itself before we can rely on it as a standard. Then, I.Q. (when still desired for informative reasons) will be normed on GT, not the other way around.
A next step will be to apply principles explained in this article to individual test items, such that a GT score can be derived from one's score on any selected combination of items, without the need for that combination of items being normed itself. I can then put a number of items online (probably from my existing tests), let people choose which combination thereof to take, and still arrive at a standard score for that Kit Intelligence Test (KIT). This is also possible in "item-response theory" in psychometrics, via "adaptive testing", but my version differs from that in that it (my version) relies on the tendencies of self-selection as explained in this article.
Note added later: I have put the idea of a KIT meant above in practice for several months in 2005. It worked well in that the resulting scores seemed accurate and valid, but it was too labour-intensive to score, so that I stopped using it for the time being.