Karthik Durvasula: Near mergers

Background

I was recently at the University of Kansas for a talk. I thoroughly enjoyed the thought-provoking conversations with both faculty and students there. In one of the meetings I had with graduate students, the super interesting topic of near mergers came up. In what follows, I have fleshed out my thinking on near mergers in order to elaborate on some of the things we discussed in that meeting.

Definition of near merger: There is a small difference in the phonetic realisation of (two) categories that is (nearly) imperceptible to the speaker.

There are multiple facts that we have to pay attention to in the case of near mergers. First, what does `imperceptible’ mean? It often means something like the speaker thinks that they are not producing a difference in the two categories. This will become important for us later.

Second, and related to the previous point, in near mergers, the productions get pretty close, and are claimed to be below the level of awareness — actually, this latter point is quite difficult to establish, because this is equivalent to showing that the null hypothesis is true — one always needs to worry about the power of the test, and whether there is sufficient sensitivity in the experimental probe.

Third, near mergers are often accompanied with a lot of variation in the relevant population. In such cases, some speakers are said to have full (phonetic) merger while others are said to have little to no (phonetic) merger.¹ As we will see, this is not an appropriate inference without far more work than just the observation of a cline of differences between the exponences of the two categories.

Fourth, speakers are able to undo the near mergers for the historical appropriate cases. So, cases of hypercorrection are far and few between. See the careful discussion of multiple such cases in Labov (1994) (starting on page 372).²

Near mergers seem to often be viewed as a problem for any view of categorical phonological representations. For example, at the beginning of Chapter 13, Labov (1994) states ‘[T]his chapter will confront discoveries that make it difficult to maintain the categorical view without modification.’ However, I just don’t see where the problem is — the definition of near mergers is in terms of phonetic proximity, yet the conclusion is about the merger of ‘categories’. If anything, the putative argument itself is problematic as the term ‘categories’ refers to cognitive objects that are manipulable by a computational system, while phonetic proximity is about exponence. That is, the two parts of the argument are simply talking about different things.³ It is entirely possible for there to be phonetic proximity while still keeping the categories distinct. There is no tension here at all; and the tension is largely because of an unclarity between exponence and categories. In what follows, I will try to show this through very simple formalisation and simulations.

Though Labov (1994) writes that he is generally arguing against categorical representations, including the generativist position, it is clear in many places that he is actually arguing against a structuralist position (particularly a Bloomfieldian position). And in reading others talking about near mergers being a problem for categorical representations, I get the same feeling that they are arguing against a specific Bloomfieldian position. However, it is far from fair to take an inconsistency with the structuralist phonology position and say this applies to all views of categorical representations, particularly because generative phonology rose as a reaction to structuralist phonology!

Finally, throughout, I consistently avoid talking about the social factors or the possible presence/knowledge of multiple dialects within speakers, as those issues are tangential to the issue of categories that near mergers are seen to be a problem for.

Formalisation

In order to simulate near mergers, we need to make clear what we are assuming first. Here I am going to assume a few things that are mostly vanilla (however, some do disagree with them):

A category is abstract.
An exponence of a category is represented as a parametric distribution.
The parametric distribution is a Gaussian distribution.
So only a mean and a variance (or standard deviation) need to be kept track of.

Consequently, each category can be formally modelled as a random variable. Let’s assume there are two categories (random variables) under discussion, C₁ and C₂, where

C₁ \(\sim\) N(\(M_1,\sigma{}^2_1\))

C₂ \(\sim\) N(\(M_2,\sigma{}^2_2\))

We can now keep track of the phonetic difference between the two categories as a new random variable, Diff:

Diff = C₂ - C₁

Note, in all of this, the two categories are distinct random variables, and are independently manipulable, and we can use them to identify another random variable representing the phonetic difference between the two.

Given the formalisation, three important facts emerge:

What is the expected value of the new random variable?
E[Diff] = E[C₂ - C₁]

= E[C₂] - E[C₁]

= \(M_2 - M_1\)

(C₁ and C₂ needn’t be independent variables)
What is the variance of the new random variable?
Var(Diff) = Var(C₂ - C₁)

= Var(C₂) + Var(C₁) - 2Cov(C₂,C₁)

If C₁ and C₂ are independent, then we can say,

Var(Diff) = Var(C₂) + Var(C₁)

= \(\sigma{}^2_2 + \sigma{}^2_1\)
What is the shape of the distribution?
The new random variable, Diff, will be a Normally Distributed too.
(see here for proof).

From the first result, we can see that the (small) value of Diff doesn’t tell us anything about the merger of the two categories — one can still have two separate categories underlyingly generating a small difference. The last result is crucial for us, namely, that Diff will be normally distributed. That is, there will be a range of values of differences, that have a mean equal to \(M_2 - M_1\).

And if the category means are small enough or the standard deviations are large enough, you could get a difference between two categories that is zero or even less than zero in a particular instance. We should however not conclude from this that some values indicate no difference and others indicate a difference. The underlying categories (C₁ and C₂) are always the same; however, some exponences/manifestations of the two categories might be similar in value. In short, the exponence of two values doesn’t tell us about the underlying categories.

Simulating production

I will try to make the same argument using simulations in R (R Core Team 2021) using tidyverse functions (Wickham 2017) as necessary. In the previous section, I modelled just a single repetition per speaker; in what follows, I will extend the reasoning to cases where there are multiple repetitions per speaker.

First, let’s define the category means and standard deviations for the two categories. Let’s also model 5000 speakers (arbitrary number) who each produce a single production of each category and have the same category means and standard deviations.

cat1_mean = 500
cat1_sd = 75
cat2_mean = 575
cat2_sd = 75
N = 5000

From the above parameters, we can now generate ‘pronunciations’ of C₁ and C₂ for each of the 5000 speakers each drawn from a normal distribution, and then calculate the difference between the pronunciations in a new column (Diff). Finally, the differences are sorted in increasing order. The first 20 values of the generated data (before sorting) are shown to give a feel for the data.

#Generating single repetitions for multiple speakers
Data1=data.frame(Sub = 1:N) %>% 
  mutate(Cat1 = rnorm(N,cat1_mean,cat1_sd),
         Cat2 = rnorm(N,cat2_mean,cat2_sd),
         Diff = Cat2 - Cat1)

head(Data1,20)

   Sub     Cat1     Cat2      Diff
1    1 516.0626 649.4511 133.38855
2    2 468.8909 548.1435  79.25266
3    3 542.8546 516.2234 -26.63118
4    4 433.7486 633.5654 199.81673
5    5 330.5797 593.6950 263.11529
6    6 539.6835 471.2541 -68.42934
7    7 540.7770 758.8203 218.04334
8    8 449.8911 581.1121 131.22101
9    9 619.4359 538.1048 -81.33104
10  10 534.8108 550.9567  16.14594
11  11 524.1666 603.9318  79.76516
12  12 368.0541 539.8668 171.81267
13  13 411.8424 534.1064 122.26399
14  14 371.7333 555.7575 184.02427
15  15 479.5597 608.3841 128.82442
16  16 493.6906 552.5116  58.82099
17  17 524.2553 646.8881 122.63280
18  18 600.0570 663.0832  63.02615
19  19 367.2121 551.1945 183.98240
20  20 573.6916 513.1454 -60.54624

# Sorting
Data1 = Data1 %>%
  arrange(Diff) %>% 
  mutate(Rank = rank(Diff))

If we now plot the sorted difference values, we immediately see that there is a cline of differences from very small-to-0 differences to much larger differences.

#There will necessarily be a cline of differences
Data1 %>% 
  ggplot(aes(x=Rank,y=Diff))+
  geom_point()

The cline is of course an automatic consequence of the fact that the new random variable, Diff, is normally distributed, as mentioned in the previous section on Formalisation. This can be seen in the density plot below too, which is normal-ish.

Data1 %>% 
  ggplot(aes(x=Diff))+
  geom_density()

The lesson to learn here is that, even though every speaker had the same underlying categories, some show more of a difference and some shown a much smaller difference. Therefore, we can’t see the differences in exponence and infer that the categories are merged for some and not merged for others. Such variation is expected with sufficient proximity of the means or sufficiently larger standard deviations.

OK, the immediate reaction might be that this is not what happens in such experiments. Usually, there are multiple pronunciations for each category, and so we should be looking at the means of the pronunciations for each category for each speaker. As you will see, that is not the issue. Thanks to the Law of Large Numbers, the range will be compressed, but there will still be a cline. Furthermore, if the means are close enough for the two categories or if the standard deviations are large enough, the difference in the means for participants can still vary between 0 and some large value.

Let’s generate 10 repetitions of each category for each speaker, and then calculate the average for each category for each speaker, and then get a difference value. Finally, the differences are sorted in increasing order. The first 20 values of the generated data (before averaging by subject and then sorting) are shown to give a feel for the data.

#Generating multiple repetitions for multiple speakers
NumRepsPerSub=10
Data2=data.frame(Sub = rep(1:N,each=NumRepsPerSub)) %>% 
  mutate(Cat1 = rnorm(N*NumRepsPerSub,cat1_mean,cat1_sd),
         Cat2 = rnorm(N*NumRepsPerSub,cat2_mean,cat2_sd),
         Diff = Cat2 - Cat1)

head(Data2,20)

   Sub     Cat1     Cat2       Diff
1    1 603.4363 618.0949  14.658582
2    1 456.0245 477.5881  21.563567
3    1 478.6785 610.8138 132.135252
4    1 524.4032 519.0791  -5.324131
5    1 528.5261 643.6761 115.150081
6    1 565.6977 584.5048  18.807135
7    1 496.4418 684.1046 187.662825
8    1 518.9374 525.3967   6.459315
9    1 568.3594 524.4862 -43.873226
10   1 583.5812 648.7491  65.167866
11   2 580.2276 662.7875  82.559860
12   2 449.5817 666.4188 216.837035
13   2 426.3650 495.8050  69.439994
14   2 534.0713 509.4989 -24.572463
15   2 342.6936 689.1340 346.440347
16   2 564.1594 558.8703  -5.289097
17   2 477.9123 581.7566 103.844309
18   2 492.4456 541.5309  49.085244
19   2 307.8589 610.7909 302.932026
20   2 544.0438 588.3908  44.346967

# By-subject averages and sorting
Data2 = Data2 %>% 
  group_by(Sub) %>% 
  summarise(meanDiff=mean(Diff)) %>% 
  arrange(meanDiff) %>% 
  mutate(Rank = rank(meanDiff))

As you can see below, there is still a cline that is quite large in range, and still includes values close to 0. Furthermore, though not shown here, the distribution of the values will tend towards normal (now thanks to the Law or Large Numbers and the the Central Limit Theorem).

#There will necessarily be a cline of differences
Data2 %>% 
  ggplot(aes(x=Rank,y=meanDiff))+
  geom_point()

The lesson to learn through the simulations is the same as what we saw through the formalisation — in general, phonetic differences or phonetic measurements don’t directly tell us about categories (unless you are in exemplar land, perhaps), particularly in single experiments. I made a nearly identical claim in Durvasula (2024).

Following up on that blog post, I’d say that one way to try to establish that there is more than random variation is to see if the pattern of behaviour is consistent across experiments that are substantially separated in time and in style. We want them to be separated in time/style because if the participant is recorded in the same type of experiments on the same day or close days, then one could say that whatever non-linguistic influences were present in the first experiment might be present in the second experiment too. To take the trivial example presented in the previous blog post, say someone is quite tired on day 1 because of how their life is structured, perhaps they are similarly tired on day 3 too, and therefore if you studied vowel durations, you might find that they are correlated and even longer than others’ productions — but this doesn’t automatically tell us that the person’s vowel categories/representations are necessarily different. Similarly, a participant might behave the same way in a task across a few days for any number of non-linguistic reasons; again, one can see the difficulty of trying to infer that different speakers have different (phonological) categories for what appear to be the same words based purely on the observations in a single experiment or closely related experiments.⁴ In short, one needs quite some care and very precise auxiliary hypotheses in establishing that speakers have different (phonological) categories, if (phonetic) exponence is used to infer categories.

A second way to establish phonological categories is to look at the behaviour of these sounds. For example, do the two categories trigger different phonological patterns? Has there been change over time in how the two categories trigger the same/different phonological processes?

We can now understand why near mergers can sometimes result in ‘reversals’, wherein the original contrast is returned for the original words (without hyper-correction) . There is no paradox here. As far as the computations are concerned, the two categories were always separate for the speaker, and so the parameters (means, in this case) can be moved further apart or brought close together. Ironically, this was effectively the solution that Labov (1994) gave at the end of Chapter 13, despite suggesting that near mergers were a problem for categorical phonology earlier.

In contrast, if the categories were truly merging, it is difficult to explain how the reversal can identify which words were historically with which categories to be able to undo them without any hyper-corrections in other words.

Simulating perception

This leads us to the last part related to perception. Note, a typical observation with near mergers is that the speakers don’t usually distinguish between the categories perceptually. However, as I pointed out in the introduction, ‘cannot distinguish’ is tested somewhat informally. Furthermore, as Labov (1994) himself notes that, with some reflection, some near-merger speakers are sometimes actually able to tell the difference between the two categories. This latter point should immediately raise the question of, ‘what is even the issue/definition of near mergers?’

OK, let’s ignore the latter issue and focus on the former issue wherein accuracy in identifying two categories is low to non-existent. The point should be obvious: if the means of the categories are close enough, or if the standard deviation of the categories are large enough, then identifying them will indeed be difficult. There is no surprise here at all.

To formalise this intution, one can model the perceiver as a simple Bayesian perceiver. The perceiver uses the category means and variances in a trivial Bayesian calculation to guess which category an input belong to. Let’s further assume that the perceiver attributes equal priors to the two categories (which we will set at 0.5 each).⁵

cat1_prior = 0.5
cat2_prior = 0.5
BayesianPerceiver = function(input,cat1_mean,cat2_mean,
                             cat1_sd,cat2_sd,
                             cat1_prior,cat2_prior){
  probabilityOfCat1 = cat1_prior*dnorm(input,cat1_mean,cat1_sd) / 
    (cat1_prior*dnorm(input,cat1_mean,cat1_sd) 
     + cat2_prior*dnorm(input,cat2_mean,cat2_sd))
}
BayesianPerceiver=Vectorize(BayesianPerceiver)

Now, if we feed the category means to the perceiver, the calculation shows that the perceiver will not be super confident about either category. In either case, the probability associated with the inferred category is not far from 0.5 (or chance, assuming no other source of bias/prior).

#Getting an input value close to/at Cat 1 mean.
BayesianPerceiver(input=cat1_mean,cat1_mean,cat2_mean,
                  cat1_sd,cat2_sd,cat1_prior,cat2_prior)

[1] 0.6224593

Above, I used the C₁ mean as the input to be perceived, but the same is true if we use the C₂ mean as the input to be perceived. The probability assigned to it being from C₂ won’t be far from 0.5, and therefore, the probability for the input stemming from C₁ would also be close to 0.5.

#Getting an input value close to/at Cat 2 mean.
1-BayesianPerceiver(input=cat2_mean,cat1_mean,cat2_mean,
                    cat1_sd,cat2_sd,cat1_prior,cat2_prior)

[1] 0.6224593

In fact, you can now use a whole range of input values from C₁ and C₂ and see that the average category probability won’t be far from 0.5 (depending on the means and standard deviations of the categories in the production, of course). Below, I simply use the multiple productions in the first set of simulated data in the previous section on Simulating production, and then calculate the average probability assigned to each category by the Bayesian perceiver.

#Overall perception across a range of values
Data1 %>% 
  mutate(ProbCat1ForInputCat1 = BayesianPerceiver(input=Cat1,cat1_mean,cat2_mean,
                                                  cat1_sd,cat2_sd,cat1_prior,cat2_prior),
         
         ProbCat2ForInputCat2 = 1-BayesianPerceiver(input=Cat2,cat1_mean,cat2_mean,
                                                    cat1_sd,cat2_sd,cat1_prior,cat2_prior)) %>% 
  summarise(meanCat1Probability = mean(ProbCat1ForInputCat1),
            meanCat2Probability = mean(ProbCat2ForInputCat2))

  meanCat1Probability meanCat2Probability
1           0.5999337           0.5989643

You can see above that, whether we look at the probabilities associated with perceiving the mean exponence values of the categories, or of the mean of the probabilities for a range of exponence values from each categories, under the right circumstances, the probabilities are not that far from chance (0.50 assuming no other source of bias/prior). Such a low accuracy can easily be confused as no difference in perception if one doesn’t run a careful study.

Furthermore, as Labov (1994) notes on p. 360, near mergers are also quite difficult for non near-merger listeners to identify. In discussing the <fool>\(\sim\)<full> near merger in Abuquerque (represented to the participants as ‘double-O’ vs. ‘double-L’), Labov (1994) notes ‘[T]hey had a great deal of trouble deciding which of Dan’s words were “double-O” and which were “double-L”, although they were ultimately correct in 83% of their judgements. Since 100% correct is normally the criterion for a passing grade in a commutation test, these results must be considered marginal.’ (boldfacing added by me)

The assumption that anything less than 100% accuracy on distinguishing words/sounds is rather extreme. Practically all segmental contrasts result in non-100% identification accuracy in experiments — variable or probabilistic percepts are true even in the case of clear contrasts in a language, as is known since the classic work by Miller and Nicely (1955). Their study shows that there is a lot of variability in responses even in the case of (licit) monosyllabic Cɑ nonce words (even in the high signal-noise ratio condition), with some CV sequences showing far more variable identification than others. For example, they observed that [θ] is often asymmetrically confused with [f] at extremely high rates (a phenomenon that is well-known today). So, a real word like [θɪk] might sometimes be perceived as the nonceword [fɪk]. Therefore, it is unclear to me what one actually learns about (phonological) categories from the fact that in near mergers people have incredible difficulty in teasing apart categories, apart from the fact that the two categories are likely auditorily proximal.

Discussion

The relevant facts related to near mergers:

The exponences of the two categories are phonetically close.
The two categories are difficult to distinguish perception.
The near merger can sometimes be reversed, crucially with few cases of hyper-correction.
Sometimes, some speakers in the community seem to show full merger, while others don’t.

All of these can be explained if we are clear about the distinction between (phonological) categories and phonetic exponence/manifestations. There is simply no issue for categorical representations, in general.

One can’t infer (phonological) catgories from the phonetic exponence directly
One upshot of the discussion above is that one simply can’t infer phonological categories directly from the phonetic manifestations. Similarly, in Du and Durvasula (2022) and Du and Durvasula (2024), we observe in Huai’an Mandarin that a derived Tone 3, which is often much closer to the underlying source (either Tone 1 or Tone 4) triggers Tone 3 sandhi while a phonetically proximal Tone 1 or Tone 4 sandhi doesn’t, and phonetically distal underlying Tone 3 does! Note, tone sandhis in Huai’an Mandarin are perfectly regular and happen at the post-lexical level, and so can’t be chalked up to lexical idiosyncracies of otherwise phonologically identical representations. Again, this suggests that one can’t and should not be inferring (phonological) categories directly from phonetic exponence.

What is a judgement in such perception experiments?
In such studies there appear to be two different kinds of ‘perception tests’, and the results are collapsed across them as if the task was the same. In one kind, other speakers are presented with recordings of a near-merger speaker and asked what word was uttered — this smells like a regular word perception/identification task.

In contrast, when the near-merger speaker themself is tested, it is often the case that they are simply asked if two words sound the same, and no recordings are typically played to them. In such cases, it is fair to ask, what is the task the listener is performing? We normally assume that the speaker can just inspect their own lexical representations and say that they are the same or different; and so, if they say the two words are the same, it must be that they think they have the same representation; hence the paradox.

I don’t see why this is obvious actually. Perhaps, the listener is simulating the pronuciation in their heads and then trying to see if they can ‘hear’ a difference, and respond based on that. We already have independent reasons to believe that such internal forward models are perhaps used in perception (see Poeppel and Monahan (2011) for discussion), so this is really not much of a stretch. Bottomline, a lot more careful scrutiny of the task is needed along with much higher power in such experiments. Given how low the accuracy can be (see simulations of the perceptual process above), one simply can’t simply ask the participant for a few pairs and infer clearly from that, as has often been done in the past.

So, when is there a real full merger?
The simple answer is when the learner posits a single category instead of two. Now, there are a variety of cases where a learner might do this:

The categories don’t trigger different behaviours in any way.
They are not involved in different phonological patterns.
The difference in categories along all relevant phonetic dimensions⁶ is below
the Just Noticeable Difference; then the learner could (if there is no other pattern
to the contrary) infer that there is just a single category.

In the case of true category mergers, there can be no reversals (without hypercorrections, that is) because the speaker simply has no clue how the historical relevant contrast plays on in their lexicon, since they represent the historical contrast with a single category. This is in fact case as argued for by Labov (1994), and is a reflection of what he calls Garde’s priciple of irreversibility (Garde 1961).

Du, Naiyan, and Karthik Durvasula. 2022. “Phonetically Incomplete Neutralization Can Be Phonologically Complete: Evidence from Huai’an Mandarin.” Phonology.

———. 2024. “Psycholinguistics and Phonology: The Forgotten Foundations of Generative Phonology.” Cambridge Elements on Phonology.

Durvasula, Karthik. 2024. “Karthik Durvasula: Individual Variation?” https://karthikdurvasula.gitlab.io/posts/2024-02-14-individual-variation/.

Garde, Paul. 1961. “Réflexions Sur Les Différences Phonétiques Entre Les Langues Slaves.” Word 17 (1): 34–62.

Labov, William. 1994. Principles of Linguistic Change. Volume i: Internal Factors (Language in Society 20). Vol. Volume I. Oxford, UK: Blackwell.

Maguire, Warren, Lynn Clark, and Kevin Watson. 2013. “Introduction: What Are Mergers and Can They Be Reversed?” English Language and Linguistics 17 (2): 229–39. https://doi.org/10.1017/S1360674313000014.

Miller, George A., and Patricia E. Nicely. 1955. “An Analysis of Perceptual Confusions Among Some English Consonants.” The Journal of the Acoustical Society of America 27 (2): 338–52. https://doi.org/10.1121/1.1907526.

Nunberg, Geoffrey. 1975. A Falsely Reported Merger in Eighteenth Century English. Vol. 1. 2. US Regional Survey.

Poeppel, David, and Phillip J Monahan. 2011. “Feedforward and feedback in speech perception: Revisiting analysis by synthesis.” Language and Cognitive Processes 26 (7): 935–51.

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org.

Senn, Stephen. 2014. “Mastering Variation: Variance Components and Personalised Medicine.” Statistics in Medicine 35 (7): 966–77. https://doi.org/10.1002/sim.6739.

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the “Tidyverse”. https://CRAN.R-project.org/package=tidyverse.

‘Watt (1998a, 1998b), Watt & Milroy (1999) and Maguire (2008) suggest that, for some speakers in Tyneside at least, complete merger in production was typical’. See p. 235 of Maguire, Clark, and Watson (2013)↩︎
The discussion of the complex observations/claims of Nunberg’s related to /aj/ and /oj/ in eighteenth century English (Nunberg 1975) was particularly interesting.↩︎
That is, in fancy talk, it’s a non-sequitur.↩︎
The same point is established by Senn (2014) in the context of personalised medicine.↩︎
Note, the observed distributions become interestingly complex when we play around with the prior probabilities, but that takes us away from our main point.↩︎
What are relevant phonetic dimensions is a minefield that I will leave for anotehr day.↩︎

Comment on this article Share:

Near mergers

Background

Formalisation

Simulating production

Simulating perception

Discussion

References

Citation