Appendix A: The Feature Sets and Decisions for Pooling
Le Roux and Rouanet (Reference Le Roux and Rouanet2010) advise that very infrequent features (e.g. those that occur in <5 per cent of the data) either need to be pooled with other related features or they might need to be discarded because infrequent features can overly influence the axes, as they contribute more to the overall variance. The list following presents the features occurring in fewer than 5 per cent of the turns in the TLC and the decisions that were made with respect to pooling or deleting features from the final dataset. The justifications for these decisions are also presented in the Table A1. Infrequent features that were specific types of a broader part-of-speech category were pooled into the broader part-of-speech or ‘other’ category. For example, the feature set distinguishes between different kinds of adverbs (e.g. place, time, downtoner, amplifier, quantifying adverbs), and then any other adverbs that are not one of these types are tagged as ‘other adverb’. Quantifying adverbs do not occur in more than 5 per cent of the tweets. Therefore, this feature was pooled with the ‘other adverb’ category. Essentially, because it does not occur frequently, the feature is dropped from the feature set, meaning that if quantifying adverb wasn’t included in the tagger, any occurrence of a quantifying adverb would be classed as an instance of ‘other adverb’. Thus, this is the logical category in which to place it. When there was more than one option (deleting or pooling including the feature in multiple categories), many of these options were tested by running several MCAs on different feature sets depicting the different pooling options. For example, copular verbs that are not be as a main verb did not occur in more than 5 per cent of the tweets, whereas be as a main verb did. Both features are part of the broader category of ‘stative forms’, and so they could be pooled together into one broad category, or copular verbs could be deleted from the feature set. To test the effect of either decision, two data matrices were created and each was subjected to MCA: one with all other linguistic features but with copular verbs deleted, and the other involving copular verbs being pooled with be as the main verb into the new category of ‘stative forms’. Although the active variables in each MCA are different, the individual turns are the same, meaning that they can be compared. Consequently, the coordinates and contributions of the individuals in each MCA were correlated to the other to observe if there was a substantial difference between the two feature sets. For the most part, the decision to delete a feature or pool it with other categories or broader features made little difference to the position of the turns, where the dimensions (at least the first ten) from one MCA were strongly positively correlated to the corresponding dimensions in the other MCA with regard to the contributions and coordinates of the individual tweets.
Features <5 per cent of the TLC | Decision | Justification |
---|---|---|
Adj+that complements clause | Deleted | All specific types of complement clauses occurred in fewer than 5 per cent of tweets. Even if the specific types were combined to form one broad category of complementation, they still did not occur frequently enough. As a result they were deleted. |
Adj+to complement clause | Deleted | Even if the specific types were combined to form one broad category of complementation, they still did not occur frequently enough. As a result they were deleted. |
Adverbs of frequency/usuality | Pooled with general adverbs | Adverbs are divided into different types and all other adverbs not specified are grouped into a broader ‘other adverbs’ category. If the specific type of adverb occurs infrequently then each instance can be recombined with the ‘other adverb’ feature. |
Agentless passives | Deleted | Passive constructions were divided into different types, yet even by recombining to form the broader category of passives, they did not occur frequently enough and so they were deleted. |
By-passive | Deleted | Passive constructions were divided into different types, yet even by recombining to form the broader category of passives, they did not occur frequently enough and so they were deleted. |
Comparative | Deleted | Comparatives could be pooled with superlatives to form a broader ‘gradation’ category. However, they do not occur enough times when combined and so they were deleted. |
Concessive subordinator | Pooled with general subordinators | Subordinators are divided into different types and all other subordinators are grouped into an ‘other subordinator’ feature. Therefore, if a specific type does not occur frequently it can rejoin the ‘other subordinator’ feature category. |
Conditional subordinator | Pooled with general subordinators | Subordinators are divided into different types and all other subordinators are grouped into an ‘other subordinator’ feature. Therefore, if a specific type does not occur frequently it can rejoin the ‘other subordinator’ feature category. |
Features <5 per cent of the TLC | Decision | Justification |
---|---|---|
Gerund | Deleted | No applicable broader category. |
Indefinite/quantifying pronoun | Deleted | Quantifying pronouns could have also been grouped with other quantifiers of different parts of speech (e.g. quantifying-determiners, quantifying-pre-determiners, quantifying-adverbs). They were not grouped this way because all instances did not meet the 5 per cent turn threshold. Additionally, there was no broader pronoun category without losing the distinction between other pronouns, such as first/second/third. |
Initial verb | Deleted | Whatever the verb is, it would also be classified as either one of the verb types or in the ‘other verb’ category. Therefore it does not need to be pooled with broader verb category. We could have combined with other initial verbs. However, we tested this by running the analysis on the feature combined with other initial verbs as well as with this feature deleted. Overall, the new initial verb feature influenced the dimensions too substantially and made the dimensions far less interpretable. |
Initial verb | Deleted | This feature is already counted as third-person singular verb form regardless of initial position. We could have combined it with other initial verbs. However, we tested this by running the MCA on one feature set where the feature combined all initial verbs, as well as another feature set where this feature was deleted. Overall, the new initial verb feature influenced the dimensions too substantially and made the dimensions far less interpretable. |
Initial verb be | Deleted | We could have combined this into a broader category of initial verbs with other initial verb instances. However, we tested this by running the MCA on one feature set where the feature combined all initial verbs, as well as another feature set where this feature was deleted. Overall, the new initial verb feature influenced the dimensions too substantially and made the dimensions far less interpretable. |
Features <5 per cent of the TLC | Decision | Justification |
---|---|---|
Reflexive pronoun | Deleted | No applicable broader category, though ‘pronouns’. Reflexive pronouns are counted according to first, second or third person, or it. |
Relative clause object gap | Deleted | Not enough instances of either kind of relative clause to combine into a broader category of relatives |
Relative clause subject gap | Deleted | Not enough instances of either kind of relative clause to combine into a broader category of relatives |
Split infinitive | Pooled with infinitives | Split infinitives were separated from infinitives as a particular type and so therefore were recombined with the broader category. |
Suasive verb | Pooled with general verbs | Different types of verbs were distinguished from general verbs and therefore infrequent types can be recombined with broader verb category. |
Subordinator with ellipted subject | Deleted | No applicable broader category. If it is a specific type of subordinator it will be classified as such as well. |
Superlative | Delete | Comparatives could be pooled with superlatives to form a broader ‘gradation’ category. However, they do not occur enough times when combined and so they were deleted. |
Synthetic negation | Deleted | No applicable broader category, albeit ‘negation’, meaning that we could have combined analytic negation with synthetic negation. However, we did not want to conflate this distinction as previous research has found this to be an important feature (e.g. Biber, Reference Biber1988; Clarke and Grieve, Reference Clarke and Grieve2017; Clarke, Reference Clarke and Golbeck2018). |
Time adverb | Pooled with general adverbs | Adverbs were divided into different types and all other adverbs that do not fall in these particular categories are grouped into a category called ‘other adverbs’. Therefore, if a specific type does not occur frequently it can rejoin the ‘other adverbs’ category. |
Time subordinator | Pooled with general subordinators | Subordinators are divided into different types and all other subordinators are grouped into an ‘other subordinator’ feature. Therefore, if a specific type does not occur frequently it can rejoin the ‘other subordinator’ feature category. |
The following are the features occurring in fewer than 5 per cent of the turns of the TLC. These are listed with the decisions and justifications for inclusion/exclusion in the final feature set.
After this pooling process was completed, each turn was analysed for the presence or absence of the following linguistic features.