Vreleksá Forum Index Vreleksá
The Alurhsa Word for Constructed: Creativity in both scripts and languages
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Quantifying the similarities between words.

 
Post new topic   Reply to topic    Vreleksá Forum Index -> Conlangs
View previous topic :: View next topic  
Author Message
eldin raigmore
Admin


Joined: 03 May 2007
Posts: 1621
Location: SouthEast Michigan

PostPosted: Thu Jun 12, 2008 8:25 pm    Post subject: Quantifying the similarities between words. Reply with quote

In a Yahoo! group I'm not in anymore the question of quantifying how similar two words are was discussed.

Here are two methods based on things professionals have written, that were not discussed on that group.

------------------------------------

(1)
(This method is original with me AFAIK, but the ideas on which I've based it are not original with me; unfortunately I can't remember the originator at the moment. I probably will in a few days.)

The segments (that is, phonemes) closer to the beginnings and closer to the endings of words, make more difference in how similar the words are felt to be, than the segments far from either end.

Those nearer the beginning are somewhat more important than those nearer the end. (Think of the word as a guy sitting in his bathtub; his head sticks out at one end and his feet stick out at the other, but his head sticks out further than his feet.)

A book I've read (whose author I unfortunately can't recall at the moment) showed research on how likely two words were to be confused with each other based on the first, first two, first three, or first four phonemes, and based on the last, last two, last three, or last four phonemes.

Concentrating on the beginning for the moment;
Every segment that's one of the first four segments of both words, makes them a bit similar; if the segment shows up in the same position in both words it makes them a bit more similar than if it doesn't.
Every segment that's one of the first three segments of both words, makes them a bit more similar than if it's the fourth segment in one of them; if the segment shows up in the same position in both words it makes them a bit more similar than if it doesn't.
Every segment that's one of the first two segments of both words, makes them a bit more similar than if it's the third segment in one of them; if the segment shows up in the same position in both words it makes them a bit more similar than if it doesn't.
If the two words have the same first segment, that makes them a bit more similar than if the first segment of one is the second segment of the other.

Concentrating on the end for the moment, we have similar results;
Every segment that's one of the last four segments of both words, makes them a bit similar; if the segment shows up in the same position in both words it makes them a bit more similar than if it doesn't.
Every segment that's one of the last three segments of both words, makes them a bit more similar than if it's the fourth-from-last segment in one of them; if the segment shows up in the same position in both words it makes them a bit more similar than if it doesn't.
Every segment that's one of the last two segments of both words, makes them a bit more similar than if it's the antepenultimate segment in one of them; if the segment shows up in the same position in both words it makes them a bit more similar than if it doesn't.
If the two words have the same last segment, that makes them a bit more similar than if the last segment of one is the penultimate segment of the other.

In all cases, similarity at the beginning is a bit more important than similarity at the end.

(A fact I shalll ignore for the purposes of this post:
If the first syllable of a word is unstressed, its influence is somewhat shared with the first stressed syllable. If the last syllable of a word is unstressed, its influence is somewhat shared with the last stressed syllable.)

So, here's my suggestion (and I'm just going to assume that both words have the first and last syllables stressed):
Give the pair 14 points if they have the same first phoneme.
Give the pair 13 more points if they have the same last phoneme.
Give the pair 12 more points if they have the same second phoneme.
Give the pair 11 more points if they have the same penultimate phoneme.
Give the pair 10 more points if the first word's 2nd phoneme is the 2nd word's 1st phoneme.
Give the pair 10 more points if the 2nd word's 2nd phoneme is the 1st word's 1st phoneme.
Give the pair 9 more points if the first word's penultimate phoneme is the 2nd word's last phoneme.
Give the pair 9 more points if the 2nd word's penultimate phoneme is the 1st word's last phoneme.
Give the pair 8 more points if they have the same 3rd phoneme.
Give the pair 7 more points if they have the same antepenultimate phoneme.
Give the pair 6 more points if the 1st word's 3rd phoneme is the 1st or 2nd phoneme of the 2nd word.
Give the pair 6 more points if the 2nd word's 3rd phoneme is the 1st or 2nd phoneme of the 1st word.
Give the pair 5 more points if the 1st word's antepenultimate phoneme is the last or penultimate phoneme of the 2nd word.
Give the pair 5 more points if the 2nd word's antepenultimate phoneme is the last or penultimate phoneme of the 1st word.
Give the pair 4 more points if they have the same 4th phoneme.
Give the pair 3 more points if they have the same 4th-from-last phoneme.
Give the pair 2 more points if the 1st word's 4th phoneme is the 1st or 2nd or 3rd phoneme of the 2nd word.
Give the pair 2 more points if the 2nd word's 4th phoneme is the 1st or 2nd or 3rd phoneme of the 1st word.
Give the pair 1 more points if the 1st word's 4th-from-last phoneme is the last or penultimate or antepenultimate phoneme of the 2nd word.
Give the pair 1 more points if the 2nd word's 4th-from-last phoneme is the last or penultimate or antepenultimate phoneme of the 1st word.

If both words have at least 8 segments, the maximum possible score is 72 (=14+13+12+11+8+7+4+3). The similarity can be a fraction, actual-score-divided-by-maximum-possible-score.

Obviously if either word is shorter than four segments, the maximum possible score will be different; and if either is shorter than eight segments, it could complicate things.

Also, if both words are longer than eight segments, they could get the maximum score and still not be the same.

Here are some examples:
"florist" and "florits": 14 + 12 + 9 + 9 + 8 + 7 + 4 + 3 = 66/72
"florist" and "florsit": 14 + 13 + 12 + 8 + 5 + 5 + 4 + 3 = 64/72
"florist" and "floirst": 14 + 13 + 12 + 11 + 8 + 5 + 1 + 1 = 65/72
"florist" and "flroist": 14 + 13 + 12 + 11 + 7 + 2 + 2 = 61/72
"florist" and "folrist": 14 + 13 + 11 + 7 + 6 + 6 + 4 + 3 = 64/72
"florist" and "lforist": 13 + 11 + 10 + 10 +8 + 7 + 4 + 3 = 66/72

Can anyone tell whether I got those right?

-------------------------------------------------

(2) "Wickelphones": named after some guy named Wickel.

(Note: this is probably not original with me; but I don't know who it is original with.)

A "Wickelphone" is a sequence of three consecutive phonemes.
For the sake of Wickelphones, a word boundary is regarded as a "special" phoneme; it can be the first "phoneme" of a Wickelphone (showing the word starts with the other phoneme(s) in the Wickelphone), or the last "phoneme" of a Wickelphone (showing the word ends with the other phoneme(s) of the Wickelphone), but it can't be the middle "phoneme" of a Wickelphone.

Here's the method.
Find all the Wickelphones that occur in either word; also find all those that occur in both words.
Make a fraction whose numerator is the number of Wickelphones that occur in both words, and whose denominator is the number of Wickelphones that occur in one word or the other or both.

Example:
"haplology" vs "haplogy".
In "haplology" we have;
/$ha/, /hap/, /apl/, /plo/, /lol/, /olo/, /log/, /ogy/, /gy$/.
In "haplogy" we have;
/$ha/, /hap/, /apl/, /plo/, /log/, /ogy/, /gy$/.

Every one of the seven wickelphones that occur in "haplogy" also occur in "haplology"; but of the nine wickelphones that occur in "haplology", two, (namely /lol/ and /olo/), don't also occur in "haplogy". So the similarity fraction is 7/9.

A couple of notes; this method doesn't care what order the wickellphones occur in; so, for instance, "bafedbaged" and "bagedbafed" would be 100% similar.

Also, it only cares whether or not a wickelphone occurs in a given word, not how many times it occurs; so, for instance, "badbad" and "badbadbad" would be 100% similar.

------------------------------------------------

There may be many pairs of distinct words in natlangs which would be judged "maximally similar" by both of these methods. These pairs of words are likely to get confused with each other by persons who are hearing one or both of them for the first time; for instance, children or foreigners, or even non-specialists or persons untrained in a specialty that one of the words belongs to. Consider "empathic" and "emphatic", for instance; or "realty" and "reality".

The same is true of the fictional speakers of your conlangs. If you want to make sure your words can't be confused with each other, make sure that distinct words don't come across as, maybe, more than 90% (or 98% or whatever fraction you choose) similar to each other by one or both of the above methods (or another method if you have one). On the other hand, if you have two words which are very similar by the above criteria, consider the likelihood some of your speakers will confuse them, and the others will make special efforts to distinguish them.

-------------------------------------------------------------

(3) A third method, not original with me, that I saw on a Yahoo! group I can't get to anymore, and so I can't find out the originator's ID.

Both of the above methods care whether two phonemes are identical or not; neither cares whether they are similar if they aren't the same.

The methods discussed on the Yahoo! group I mentioned, did rate phonemes according to how similar they were.

The language in question had a CV syllable structure; it had 15 consonants, if I remember corrrectly, and I don't remember how many vowels.

The conlanger numbered the consonants from 1 to 15 such that the more different the assigned numbers were, the easier it was to distinguish the two consonants from each other; and the closer the numbers were, the likelier the two consonants were to get confused with each other.

Separately, the vowels were also numbered to the same effect.

Given two words of the same length, the absolute value of the difference between the two corresponding phonemes' numbers was calculated; in order to avoid zeroes, a 1 was added to each such absolute value. Then the product of those numbers was calculated, and that was the "distance" between the two words. (Note that this couldn't be less than one; the "distance" between a word and itself was 1, not zero, which is why I've put "distance" in quotes.)

So if two words differed by only one phoneme (they were a "minimal pair"), and the phoneme in one word was very similar to the phoneme in another, the "distance" might be calculated as 2. But if the phonemes were as different as possible, the "distance" might be calculated as 14.

It would take four pairs of minimally-different phonemes to make two words' distance be 16; if there were only three, the distance would be 8, less than the distance between two words differing in only one phoneme, if that pair of phonemes happened to be very, very different.
_________________
"We're the healthiest horse in the glue factory" - Erskine Bowles, Co-Chairman of the deficit reduction commission
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Vreleksá Forum Index -> Conlangs
All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2002 phpBB Group
Theme ACID © 2003 par HEDONISM Web Hosting Directory


Start Your Own Video Sharing Site

Free Web Hosting | Free Forum Hosting | FlashWebHost.com | Image Hosting | Photo Gallery | FreeMarriage.com

Powered by PhpBBweb.com, setup your forum now!
For Support, visit Forums.BizHat.com