In the previous blog post, I discussed how perception can be viewed as repeated sampling with categorical representations — the view boils down to claiming that observed gradience in perception is really an averaging artefact over multiple categorical percepts.
Such a view derives a few interesting things. First, it derives the fact that there can be gradience in judgments, even within a single speaker. Second, it derives the fact that behavioural experiments with more task complexity will result in a more linear or gradient categorisation function. Third, it derives the fact that reaction times are longest for stimuli that are close to the categorical boundary. These are quite exciting results, as far as I am concerned.
Continuing the discussion in that blog post, there is one sort of observation that researchers have used to claim that, really, perception really involves gradient representations. The observation comes from electrophysiological work looking at a specific component of Event Related Potentials (ERPs), namely, the N100/N1 or P3/P300 (McMurray 2022; Toscano et al. 2010). For example, Toscano et al. (2010) presented participants with two synthesised VOT continua: beach~peach and dart~tart, where the VOT was increased in 5ms increments (range: 0-40ms). The task was a standard mismatch paradigm, where participants had to press a button for the target and non-target stimuli.1 The targets were presented, on average across participants, in 25% of the instances. Participants’ ERPs were recorded from onset of the stimulus, and multiple electrode sites2 were employed to get an average ERP response.
The average ERP response they observed showed that, as the VOT varied, the N100 component varied linearly. Their figure and corresponding caption are below in Figure 1.
Figure 1: N1 results. (A) Grand average ERP waveforms, averaged across frontal electrodes, for each VOT condition. (B) Mean N1 amplitude as a function of the nine VOT conditions and two stimulus continuum conditions (beach/peach and dart/tart). Error bars represent standard error. (C) Mean N1 amplitude as a function of VOT and target voicing (voiced or voiceless) for trials where participants made target responses. The size of each data point in the figure is proportional to the number of trials for that condition.
The above observation was taken as evidence that perception at its source depends on gradient or continuous representations. As I will argue here, this inference is not appropriate given the experimental probe, namely, ERP components.
Toscano et al. (2010) observed that the N1 response to VOT changes was linear, but participants’ responses were sigmoidal in a two-alternative forced choice task (as is normal in such behavioural experiments). Crucially, for their claim that perception is underlyingly based on gradient representations, they need the neural data to be a better and more direct measure of the underlying perceptual system than behavioural data. However, it is difficult to say that the N1 is somehow a better and more direct measure of perception than the behavioural response. Now, a priori, it definitely feels like directly measuring the brain should result in something like that, but really this is not clear, and I will try to show that below. In fact, the measure is likely extremely noisy, a fact that will become quite important to us later.
First, ERPs are based on scalp voltages, and so are not a direct measure of the processing in the auditory cortex. In contrast to their results, Steinschneider et al. (1994), Sharma and Dorman (1999), who measured Cortical Auditory Evoked Potentials (a more direct measure of the computation in the auditory cortex) do find more categorical results for the VOT continuum. Now, Toscano et al. (2010) say this difference in the experimental results is because “differences in the construction of stimuli allowed us to observe effects that may have been masked in previous studies”. However, given the measurement was quite indirect in their experiment compared to that in Steinschneider et al. (1994), Sharma and Dorman (1999), that difference in experimental probes itself could explain the difference, as we will see later.
Second, the N1 they measured in the ERP signal is simply assumed to be purely auditory. And this is a crucial assumption for their reverse inference about the underlying computation to follow. But, we have long known that ERP components don’t have such a simple 1-to-1 relationship to neural computation. For example, as mentioned on the Wiki article on the N100, the N1 has also been observed to be sensitive to changes in visual (Warnke, Remschmidt, and Hennighausen 1994), olfactory (Pause et al. 1996), heat (Greffrath, Baumgärtner, and Treede 2007), pain (Greffrath, Baumgärtner, and Treede 2007), balance (Quant, Maki, and McIlroy 2005), respiration blocking (Chan and Davenport 2008), and somatosensory (Wang et al. 2008) stimuli. All of these are observed in the fronto-central region of the scalp in EEGs — exactly the same region that Toscano et al. (2010) measured. Therefore, this sets up a reverse inference problem similar to the one that Poldrack (2006) noted. As Poldrack (2006) pointed out, correctly in my opinion, the general tendency to infer cognitive processes from neuroimaging data is very often quite inappropriate, because that reverse inference crucially depends logically on the measure or area being highly selective — which is rarely the case.
While Poldrack (2006) was specifically focussed on highlighting the problem of inferring from neuroimaging data (activation of specific neural regions), the logical point he made extends to beyond inferring from the activation of specific neural regions — it applies to all (neural) measures. Any measure has to be sufficiently selective for it to be useful in understanding the underlying truth. But neural measures rarely meet this criterion. And, it is clear from the above discussion that the criterion is definitely not met by the N100. Therefore, there is no sense in which the N100 is a better and more direct measure of perceptual computation than human behaviour itself. Relatedly, given that neural measures are typically sensitive to so many different inputs, they are likely to be much more noisy.
So, while studying neural responses and observing correlations with behaviour is (hopefully) helpful in further understanding how the brain works, one should be very cautious in using such data as an argument for specific cognitive computations or representations.
To summarise, in a general sense, the linear relationship between the N100 and VOT differences that Toscano et al. (2010) observed is not to be taken as an argument against categorical representations in perception.
As I have ranted many times to my students and in some of my more recent work, consistency is a very weak result — in this case, all we have said so far is that categorical representations are consistent with the linearity observed in the N100 response, largely because the latter is not particularly informative of the underlying computational system. But, can we go further than that? Can we actually understand why there is a linear relationship between the N100 and VOT in the task?
Here, I will point to the issue that I raised a couple of times above: namely, the N100 is likely a very noisy measure. As pointed out in the previous blog post, additional noise mathematically leads to a more linear response curve, when perception is seen as repeated sampling of categorical representations, as in the previous blog post.
I show this using a simulation very similar to the one I presented for the high/low memory load in the previous blog post: let’s say that the standard deviation related to the categorical boundary increases by 15 units in a higher noise experimental probe compared to a lower noise experimental probe, then the perceptual function is going to be less steep for the former as in the figure below. We know in these simulations that the underlying system uses discrete categories (because we designed it to be so), yet the higher noise experimental probe is more linear.
Furthermore, given the noise is inherent to the measure itself, more samples of the same will not reveal the categorical nature of the underlying system. The case with more inherent variance will always be more linear, and consequently, a larger sample size will not address this issue. In the previous simulation, the sample size was a 100 samples per input value. In what follows, I keep all the parameters the same, but change the sample size to 10,000. As can be seen in the figure below, the larger sample size makes the contours smoother (as expected), but the condition with more inherent variance is still observed to maintain the same more linear relationship, despite the fact that the underlying perceptual system uses categorical representations.
In my opinion, there is very little evidence that speech perception uses gradient representations, and in fact, there is much to gain in understanding if we think of speech perception as based on categorical representations.
Furthermore, neural data, which is potentially helpful in us gaining a better long-term understanding of how cognition is implemented, should be used with extreme care in making arguments about the underlying computational system, in my opinion. Using neural data to make claims about the computational system relies heavily on the measure being highly selective to the relevant computation — a condition that is rarely met in reality. As far as I can see, actual human behaviour is a much better tool to understanding underlying computations than neural measures, based on what we know currently. As Chomsky has often reminded us, we have the full wiring diagram for C. Elegans — at around 1000 neuron, it is an extremely simple neural system — yet, we have no idea how the system actually works and how the non-trivial computations the organism is clearly capable of are implemented.
The details about the target and non-target stimuli were actually not clear in the paper, at least not to me.↩︎
The following electrodes were used: F3, F4, Fz, C3, Cz, C4, P3, Pz, P4, T3, T4, T5, and T6. They were referenced to the left mastoid during recording and re-referenced offline to the average of the left and right mastoids.↩︎
For attribution, please cite this work as
Durvasula (2025, June 27). Karthik Durvasula: Perception as sampling with categorical representations. Part 2. Retrieved from https://karthikdurvasula.gitlab.io/posts/2025-06-27-Perception as repeated part 2/
BibTeX citation
@misc{durvasula2025perception, author = {Durvasula, Karthik}, title = {Karthik Durvasula: Perception as sampling with categorical representations. Part 2}, url = {https://karthikdurvasula.gitlab.io/posts/2025-06-27-Perception as repeated part 2/}, year = {2025} }