"Out of India"

Wed Dec 4 22:12:26 UTC 1996

Lars Martin Fosse <l.m.fosse at internet.no> wrote:

>>In cluster analysis, the aim is to study how the different variables
>>or different subjects group together. The results can be displayed
>>as a tree diagram. Closest neighbors are grouped together
>>and joined to a single vertex. Something not more closely related to
>>some compared to the others is left as an isolated twig. Then this
>>process is repeated with each group obtained at this stage. This is the
>>kind of diagram that is found in write-ups of the `African Eve' theory,
>for example.

>It should be added that there are several algorithms for cluster analysis
>that yield somewhat different results. It would be foolhardy simply to
>produce a cluster analysis and accept the result as authoritative, several
>attempts with different sets of criteria and different algorithms would have
>to be done.

That creates its own set of problems.

Large scale migrations are rare events. The burden of proof is on
those who assert that such an event occurred. If you do several
such analyses, you must report all of them and >explain< any negative
results; for example, that they failed to take into account all
measurable traits.

And any way, this thread started with Joe reporting hearsay report that
Lukacs had `shown' evidence of population discontinuity. I fail to see
that being questioned.

>>It may be worthwhile if Indology courses did require a course in
>>understanding statistics. For example, a rank correlation test on
>>the different attempts to arrange the books of Rgveda in a
>>chronological sequence would be eye opening.

>This has been attempted by Walter Wuest, but the result is not quite as
>eye-opening as one would wish. 

I guess I must stop being too cryptic. I was talking about comparing the
results arrived by different methods, but was avoiding mentioning
specific names.

I have in mind a statement by Witzel (in ``The Indo-aryans of South
Asia'', see p.96) that the chronological ordering of the books of Rgveda
by Wuest and by Hoffmann agree ``more or less''. I took the Wuest's
ranking as given by Witzel (I don't have ready access to Wuest's monograph)
and Hoffmann's ranking, leaving out Book 1 (which is missing from the
ordering quoted by Witzel), and computed the rank correlation. It comes out
to be a mere 0.07 (p-value .440). [Witzel quotes Wuest's ranking in four
groups separated by vertical bars. Treating being in the same group as
ties, and ignoring within group order improves the rank correlation to
only .22, (p-value .290).] [In this example, the p-value is the probability
that two random ordering of the books would produce correlation
coefficient at least this large.] 

>Since I wrote my thesis on the use of
>statistics in Indology, I second the opinion that such methods are valuable,
>but they are fraught with a large number of theoretical and practical
>difficulties that have to be solved. If anything is to be gained by using
>statistical methods in the study of Sanskrit texts, that "anything" will be
>gained with a great deal of very hard, painstaking drudgery, not to mention
>the problem of communicating the result to one's non-statistical colleagues
>afterwards!

I would like to read your thesis. Please let me where you published it.

There is one issue that I am sure you must have addressed, but which
I cannot resist talking about.
This is the use of controls. Let me share an experiment
I performed a few months back. I took the Mahabharata text from
John Smith's files, and looked at the frequency of different types of
vipulas in the various parvans. Just for fun, I looked at the cantos of
Kumarasambhava (Kale edition) and Raghuvamsa (Nirnayasagar edition)
that were in anushtub. To my great surprise (and horror), I found that 
Kumarasambhava as closer to Mahabharata (Bhishma and Drona parvans,
I did not try this with others) than it was to Raghuvamsa. and the
difference was fairly significant: I don't remember the p-values,
but were close to 0.05. [This does not prove that Kumarasambhava and
Raghuvamsa were composed by different persons. The most serious objection
would be that I did not look at Trishtub and Jagati patterns, where
Kalidasa conforms to the traditional poetical theory, but Mahabharata does
not. Then there is the question of critical edition of Kalidasa]

The point is that without controls, there is no good way of knowing
if the statistical significance says anything about practical
significance. In particular, the frequency of vipulas in short
works would remain suspect in my eyes as a valid means of
comparison, till someone explains the strange case of Kalidasa.

Girish Beeharry <gkb at ast.cam.ac.uk> wrote:
>This adds another item to the list of 'I don't understand'; which
>algorithms are more 'objective'? Has non-parametric statistics
>(eg maximum likelihood) been used in this area?

I have not personally done cluster analysis. So, don't take my word
on this. And I got interested in cluster analysis because of
the `Out of Africa' theory, where the problems are different in
nature.

As I understand it, we can't quite call one method more objective
than the other. Some methods look for clusters of some particular
shape. Others do not for clusters as such, but build trees by
joining nearest neighbors into one branch, then replacing them
by their average etc. [Out of Africa analysis has the problem of
large data-sets, which should not a problem in HLK case.] There does
not seem to be any clear agreement as to which method suits a given
situation better than others.

But, I doubt that in the case of small data-sets, the different methods
would give widely divergent answers. I don't believe that reanalyzing the
HLK data would produce the conclusion that there was a discontinuity
between 2000 BCE and 1000 BCE.

-Nath
Nath Rao (nathrao+ at osu.edu)		614-366-9341