1. Motivation - concept naps: diversity and challenges
Concept maps are structured (Chang, Reference Chang2007; Tergan et al., Reference Tergan, Keller and Burkhard2006; West et al., Reference West, Park, Pomeroy and Sandoval2002), graph-based visual tools (Grundspenkis & Strautmane, Reference Grundspenkis and Strautmane2010; Novak & Cañas, Reference Novak and Cañas2006; Patel et al., Reference Patel, Summers, Kumar and Edwards2024; Strautmane, Reference Strautmane2012) that depict relationships between concepts (Reference McClure, Sonak and SuenMcClure et al., 1999; Reference SunSun, 2004), attributes (Reference Besterfield-Sacre, Gerchak, Lyons, Shuman and WolfeBesterfield-Sacre et al., 2004), and ideas (Reference Coffey, Hoffman and CañasCoffey et al., 2006; Cornwell, Reference Cornwell1996; Kinchin et al., Reference Kinchin, Möllits and Reiska2019), offering a comprehensive framework for understanding and analysing mental models (Reference ChangChang, 2007; Reference Moon, Johnston and MoonMoon et al., 2018). These maps can have different words in the vertices considered as factual data (Reference HenryHenry, 1974); and having different words (or just lines) connecting these vertices which called edges, considered as information (flow of organized data) (Reference Gamble and BlackwellGamble, Paul R., 2001). Thus, creating a complex system of information flow giving meaning and purpose to the original intent and mental model of participants, finally to be considered as their knowledge (Reference Dewey and BentleyDewey, John, 1949). This approach facilitates the visualization of complex systems (Khajeloo & Siegel, Reference Khajeloo and Siegel2022; Tergan et al., Reference Tergan, Keller and Burkhard2006), enabling users to map out how various elements interact and relate to one another (Reference Schroeder, Nesbit, Anguiano and AdesopeSchroeder et al., 2018). Concept maps are usually administered with a purpose (Reference Torre, Daley, Stark-Schweitzer, Siddartha, Petkova and ZiebertTorre et al., 2007) and accompanies several experiments (Reference Gurupur, Pankaj Jain and RudrarajuGurupur et al., 2015; Reference Markham, Mintzes and JonesMarkham et al., 1994) and studies (Rosas & Ridings, Reference Rosas and Ridings2017; M Araceli Ruiz-Primo, Reference Ruiz-Primo2004; van Bon-Martens et al., Reference van Bon-Martens, van de Goor, Holsappel, Kuunders, Jacobs-van der Bruggen, te Brake and Van Oers2014) to help the researchers understand their participants and their thought processes. Hence it is a complex process-like engineering design which is often termed as a complex-socio-technical process (Reference Milne and LeiferMilne & Leifer, 1999). During such complex-socio-technical processes, there are several variables and uncertainties that present itself as diverse challenges to be coded accurately (Reference Lachner, Backfisch and NücklesLachner et al., 2018; Reference Yin, Vanides, Ruiz-Primo, Ayala and ShavelsonYin et al., 2005) and consistently (Reference Schau, Mattern, Zeilik, Teague and WeberSchau et al., 2001). The problem is compounded when thousands of unique concept maps are to be coded by multiple researchers from diverse backgrounds. This need for a robust coding scheme also warrants a digital concept map tool that can automate the coding process and deliver further analysis of concept maps for research purposes.
Concept maps are complex tools to use and analyse (Tergan et al., Reference Tergan, Keller and Burkhard2006; van Bon-Martens et al., Reference van Bon-Martens, van de Goor, Holsappel, Kuunders, Jacobs-van der Bruggen, te Brake and Van Oers2014). Since the research was currently dealing with hand-drawn concept maps by the participants, each participant was found to have their own unique way of representation, writing, drawing, connecting and expressing. Prior work has established that while the same topic is presented to participants, they can create different concept maps akin to their understanding (Reference Patel, Summers, Kumar and EdwardsPatel et al., 2024). This highlights that concept maps are like fingerprints and can be unique from person to person (Conceição et al., Reference Conceição, Samuel and Yelich Biniecki2017; Jackson & Trochim, Reference Jackson and Trochim2002; Lopez et al., Reference Lopez, Shavelson, Nandagopal, Szu and Penn2014). This can also be used to represent a particular person’s thought process at a particular time as well and hence it can be said that concept maps from a single person can also can evolve (D’Antoni et al., Reference D’Antoni, Zipp and Olson2009; Maria Araceli Ruiz-Primo et al., Reference Ruiz-Primo, Shavelson, Li and Schultz2001; van Bon-Martens et al., Reference van Bon-Martens, van de Goor, Holsappel, Kuunders, Jacobs-van der Bruggen, te Brake and Van Oers2014; Van Zele et al., Reference Van Zele, Lenaerts and Wieme2004).
The goal of this research is to use concept maps by the same participant over time as a simulated longitudinal study, which can reveal trends that can be used to evaluate the participants journey in a program or during the course of skills training or series of focussed interventions such as a semester in the context of an undergraduate engineering degree program (Reichherzer & Leake, Reference Reichherzer and Leake2006; Roberts & Johnson, Reference Roberts and Johnson2015; Tergan et al., Reference Tergan, Keller and Burkhard2006; Zanting et al., Reference Zanting, Verloop and Vermunt2003). These concept maps from the individual participants can be measured and analysed to understand the progress or evolution of the participant (Rosas & Ridings, Reference Rosas and Ridings2017; Stoyanov et al., Reference Stoyanov, Jablokow, Rosas, Wopereis and Kirschner2017; Trochim, Reference Trochim1989; Zanting et al., Reference Zanting, Verloop and Vermunt2003). These concepts maps can provide the required insight into the participants understanding of what is needed for their purpose (Reference Rosas and RidingsRosas & Ridings, 2017; Reference TrochimTrochim, 1989). The concept maps can also be used as an augmented tool to support student evaluations beyond mastery assessment measured through grades (Jackson & Trochim, Reference Jackson and Trochim2002; Lopez et al., Reference Lopez, Shavelson, Nandagopal, Szu and Penn2014; Maker & Zimmerman, Reference Maker and Zimmerman2020; Talbert et al., Reference Talbert, Bonner, Mortezaei, Guregyan, Henbest and Eichler2020).
2. Study design and deploymemt
This section presents a detailed overview of the study design and deployment process used in this research. It outlines the instruments deployed, the target sample size, and the administration process. The subsequent section elaborates on the coding scheme deployment, describing the collaborative efforts with a team of researchers to validate its implementation.
2.1. Study design
The study deployed concept maps as a part of a larger study that also collected surveys from participants. This study is designed to measure the engineering identity of undergraduate engineering students at R1 university in the suburban area of a metropolitan city in the U.S. Two instruments are deployed in classes twice, at the beginning and the end of a semester, to pre-service engineers enrolled in classes in the first, second, third, and fourth year of the engineering curriculum. The first instrument is a survey to capture the student’s engineering identity and the various influencers that affect their identity (Reference Kumar and SummersKumar & Summers, 2024). The instrument uses standard engineering identity items. Additionally, the survey asks the pre-service engineers to define “engineer”. This definition is augmented by a concept map of the pre-service engineer’s understanding of “engineer”, the second research instrument. The instruments have been deployed in Fall 2024 with approximately 5,000 responses collected. The snippet of concept map template shown in Figure 1, highlights the canvas on which the in-class activity was held in person for a duration of six minutes after the engineering students received an overview about concept maps.

Figure 1. Concept map template
The concept maps are collected during an in-class activity. As a result, they are all hand drawn. It is these hand drawn concept maps that need to be computationally measured for their topologic complexity (J. L. Mathieson & Summers, Reference Mathieson and Summers2010; J Mathieson & Summers, Reference Mathieson and Summers2009). As a critical first step, these concept maps are reconstructed into bi-partite graphs following a systematic coding approach. The students were instructed to then create a concept map to define “engineer”, with additional prompts that include: what does an engineer do? What should engineers focus on? What do you anticipate as the future for engineers? What does it mean for an engineer to be successful? and What skills or characteristics does an engineer possess? They were provided a bounded sheet with “engineer” as a concept node at the centre of the page. The time limit of six minutes was calibrated in pilot studies to ensure that there was sufficient time for students to build a concept map of sufficient complexity to be of interest (Reference Patel, Summers, Kumar and EdwardsPatel et al., 2024).
2.2. Coding scheme deployment process
The concept map coding scheme established in prior work is used to code and capture the mental model of participants expressed through their concept maps finally onto the computational format. The output from this process needs to be further processed computationally. This method enables researchers to convert the tool from one format (Concept Maps-Visual Representation Models) to another format (Computational Models) using a guidebook and training manual. Hence, it is important to present all the aspects of the coding scheme for a thorough investigation to capture all potential complexities that arise through type, scenarios and cases, which enables an early mitigation or failure-modes strategy to avoid future pitfalls. To validate the objective robustness of this coding scheme, 22 undergraduate researchers were presented six concept maps and asked to translate them into the spreadsheet format to capture the bi-partite graphs (concepts are nodes, connecting edges are relations). These bipartite graphs are then analysed for the structural complexity of the graphs using a 29 item complexity metric (J. L. Mathieson & Summers, Reference Mathieson and Summers2010; Owensby & Summers, Reference Owensby and Summers2014). The specific terms included in the concept maps are captured and analysed through a separate sematic coding approach.
3. Use case and validation
The method developed for the analysis of the data collected for this research was given to 22 undergraduate researchers who had not worked on concept maps or its analysis before. The guidebook, training manuals, videos and other hands-on training were provided for a duration of two weeks. The coders were tested once after the training period and once again after a couple more weeks to check if more time and effort on working on several concept maps would increase their understanding of the coding scheme and the process. All researchers were given the same material, information and time duration for the two interrater reliability (IRR) tests (D’Antoni et al., Reference D’Antoni, Zipp and Olson2009; Rye & Rubba, Reference Rye and Rubba2002; Watson et al., Reference Watson, Pelkey, Noyes and Rodgers2016; West et al., Reference West, Park, Pomeroy and Sandoval2002) conducted to check the validity of the method and tools of the coding scheme of concept maps. The consistency of interpretation and reconstruction of the concept graphs needs to be sufficient to allow for parallel processing of the concept maps without a loss of information. Thus, inter rater reliability and intra rater reliability tests are conducted to verify the consistency. As the concept maps could have multiple concepts within a single node, it is important to be able to split these consistently. Tests are runs on the objectivity of the code for vertex splitting and edge construction. Six concept maps (P1-AX05DA00, P2-AS22BL16, P3-YA25RE54, P4-GE04CO71, P5-ON25JE07, and P6-ES18TR44) are randomly selected for testing and two maps are presented in Figure 2 and Figure 3. P1 in Figure 3 is rated as a high complexity concept map of all the six maps.

Figure 2. “6” Original concept map original (complete but template cropped)

Figure 3. “1” Original concept map
3.1. Validation of coding scheme
This section highlights the analysis of the coding scheme with respect to the vertices and the edges. Three concept maps were selected from the first test (P1, P2, and P3) from a set of seven total concept maps and three concept maps selected from the second test (P4, P5, and P6) from a total of fifteen total concept maps. These 6 concept maps were chosen randomly using the random excel function. The results of the analysis are presented in this section using the averages of the vertices and edges coded by the participants using the coding scheme for all these six concept maps as part of this use case. The analysis was done to find the standard deviation and coding consistency amongst the 22 coders looking specifically the number of vertices and edges chosen to be coded by them by applying the specific types, scenarios and cases in the coding scheme.
3.1.1. Analysis of vertices
Figure 4 highlights the standard deviation on the average number of vertices. Here one can see that P1 (Figure 2) and P6 (Figure 3) were the most complex in nature and hence they had the highest deviation compared to the other four with P5 being the least.

Figure 4. Standard deviation of average of vertices for various concept maps
It shows that except for P6, the rest of the maps were coded to be around the same number of vertices. From Figure 5 one can see that except for G, T, and U, rest of the coders were all in the tight grouping of the histogram. The 22 coders are represented as letters from A to V sorted in an alphabetical order of their first name. Coders G, T, and U on an average added an additional vertex, while the remaining nineteen coders did not. The nineteen coders were approximately 0.8 vertex between each other, thus showing that they were almost around the same number of vertices when applying the coding scheme. There is a slight positive slope with the trendline for the 22 coders presented in this format.

Figure 5. Coding consistency of various coders through average of vertices
3.1.2. Analysis of edges
Like the analysis of the vertices, Figure 6 shows that edges on an average were the more difficult element on the concept maps to code and hence there is a larger magnitude of standard deviation recorded in their coding. The largest deviation is P2 which was coded with a deviation of 1.6 edges compared to its counterparts. Figure 7 shows the coding consistency related to edges for the 22 coders and from the trendline there is no relevance or order to this coding consistency with respect to the coders.

Figure 6. Standard Deviation of Average of Edges for Various Concept Maps

Figure 7. Coding consistency of various coders through average of edges
Another interesting observation in Figure 6 here would be that the set two of concept maps taken for this analysis reveals that the standard deviation is the same for all the three concept maps (P4, P5, and P6) highlighting that the coders might have gotten used to the coding scheme and were able to be in a particular deviation range. This also highlights consistency of the coding scheme and the coders. It is also seen that P3 was zero deviation on the coding of the edges. In Figure 7 we can wee that except for coder M and U, the rest of the twenty coders are all consistent within an average of 15.67 to 16.50 showing a deviation less than one edge. Thus, highlighting that the coding scheme is consistent and sufficient.
3.2. Inter-Rater Reliability (IRR) - Fleiss’ Kappa
The final validation of the coding scheme developed for concept maps can be shown with IRR testing. In this section we focus the analysis, validation and testing the hypotheses that the coding scheme is robust with reasonable agreement between the coders, and this is not based on chance as suggested by the IRR’s null hypothesis. The testing includes IRR for coding the vertices and edges. Also, as this research had 22 coders, the research will be using Fleiss’ Kappa (FKappa) (Falotico & Quatto, Reference Falotico and Quatto2015; Moons & Vandervieren, Reference Kumar and Summers2023) which is a statistical measure used to evaluate the agreement between multiple coders (more than two) when assigning categorical codings to a fixed set of items. It is an extension of Cohen’s Kappa. FKappa is the main statistic that shows the agreement between coders and will have a value between zero and one, where zeo would be no agreement and one would be in complete agreement eliminating the probability of agreement by chance and other bias. FKappa is widely used in research fields where IRR among multiple observers or evaluators needs to be quantified, such as healthcare, education, and social sciences.
It is also noted that unlike ANOVA or Z-test, the p-value is preferred to be zero in FKappa (or very close to zero), indicating that the null hypothesis (which assumes agreement by chance) can be rejected with very high confidence. In other words, the observed level of agreement among coders is statistically significant and not due to random chance. The research also acknowledges that a p-value of zero doesn’t necessarily mean perfect agreement; it just means that the observed agreement is statistically significant. The strength of the agreement is still determined by the kappa value itself. If kappa value is between 0.75 and One, it means excellent agreement. And between 0.4 to 0.75 it is moderate to substantial agreement and Zero to 0.4 would be slight to fair agreement. Thus, a p-value of Zero means the level of agreement among coders is real and not due to chance, but the extent of this agreement still depends on the Kappa value.
3.2.1. Vertices IRR
This section presents the IRR related to vertices to be checked between all coders on the vertices of the concept maps. The vertices are the nodes that are presented in the concept maps that need to be coded in the excel file as close to the original concept map, but also ensuring accuracy and consistency by applying the rules in the guidebook and the training manual. With the use of FKappa it can be seen that from Table 1 and Table 2, FKappa value for the coding of vertices is 0.6787, thus showing that most of the coders are in substantial agreement with each other, hence proving the effectiveness of the coding scheme and process (method and tool) for vertices.
Table 1. FKappa IRR calculation for coding of vertices

Table 2. FKappa final IRR results for coding of vertices

It can be observed from the coding as well that P1 and P6 can be seen to have more spread of values. If we consider P1 in the first row, we see that out of the 22 coders, 1 coded for 13 vertices, 19 coders coded 14 vertices, 1 coder coded it to have fifteen vertices and one more coder coded it to have seventeen vertices. Thus, showing that each row adding up to 22 coders to complete the calculation section of this table. Table 2 highlights several final parameters pertaining to Fkappa test, such as m - number of coders/coders, n- number of concept maps selected for the IRR test, pa - the average proportion of coders that agree for each item, pe - the expected agreement by chance, s.e - standard error which measures the variability of the Kappa statistic, z - a statistic that tests the null hypothesis, p - the probability of observing the computed z-value under the null hypothesis and it has to be less than ‘alpha’ - which is the significance level for confidence intervals, often set to 0.05, and finally the lower and upper which are the bounds of the confidence interval.
3.2.2. Edges IRR
Like the IRR of vertices, the research obtained a Fkappa value of 0.4562 with an alpha of 0.05 and other important values pertaining to the FKappa test as shown in Table 3. Thus, indicating that there was substantial agreement on the coding scheme applied to the edges as well. The research acknowledges that as the edges are the elements that create the complexity in concept maps which ultimately understands the flow of information by connecting the facts or data points. These edges also get further complicated when the writing utensil or the scan of the concept maps have issues. Thus, justifying the need for a digital tool that can eliminate such human-induced errors. Again, this Fkappa value for coding of edges according to the process and tool, proves and validates the coding scheme developed for concept maps.
Table 3. FKappa final IRR results for coding of edges

4. Conclusion and limitations
The final analysis and IRR validation demonstrates the effectiveness and the consistency of the coding scheme. An IRR with Fleiss’ Kappa value of 0.67 for vertices and 0.45 for edges shows that there is substantial agreement about the use of the coding scheme to effectively code complex concept maps that were hand-drawn. This paper presents an evidence-based approach to validate the coding scheme for concept maps. The coding scheme had also deployed another set of parameters from the concept maps to be coded as part of the ‘Summary’ that can help researchers to quickly bracket the concept maps into excellent, good, moderate, and bad. This can also aid in the further analysis of the maps and doubles itself as a filter to catch maps that may be overly complex. This classification could also be useful to connect with the larger research connected with these concept maps that is being investigated as a surrogate to measuring Engineering Identity.
The important identity metrics from the survey and the definitions (also obtained from the survey) can be mapped against the data, information, and knowledge from these concept maps, thus completing the loop of using concept maps as an augmented tool for engineering identity. Furthermore, inspection on excellent concept maps can also help us understand why it may have scored high and what it relates to in the survey. ‘Is this excellent map from a high performing student’ or ‘moderate’ or ‘low’. Thus, giving the evaluators an opportunity to dive deep into the root-cause analysis and provide the required support and systems for improvement for the student.
The use of such concept maps has been established by earlier research as a tool to measure learning and to be used as an augmented evaluation or assessment tool in academia. This research establishes the validity of the coding scheme barring which the concept maps are prone to coding errors as the complexity of concept maps can be compounded when it is coded incorrectly by multiple researchers. Concept maps can now be processed with an acceptable confidence level on the research, hence aiding the downstream analysis methods, such as the 29-complexity metrics that provides the topology analysis, vocabulary/keywords providing the sentiment and trends, and finally comprehensive analysis to subsequently rate or score the concept maps for meaning and purpose.
The results from these various types of analysis on the concept maps reveal micro and macro trends, which can then guide decision making for its stakeholders by creating a science for cohorts that can be relied upon. This can effectively form the basis on cohort science that can help investigate effectiveness and impact of programs and interventions. The established concept map coding scheme can be now used (with some further refinement) at scale, to process the larger data set of thousands of concept maps. The analysis from these maps can now augment the original study looking at undergraduate student’s engineering identity. Future work can also investigate the use of concept maps as an ideal replacement candidate or surrogate for the objective measurement of engineering identity; thus, eliminating the subjective nature of self-efficacy-based measurement as established by prior work.
The absolute value of using concept maps and can be extracted when we are clearly able to extract the data, information and knowledge trails from participants and map them longitudinally. This can enable a wide variety of use cases for concept maps beyond just a surrogate for evaluation. This can further be used in conjunction with artificial neural networks or other AI tools to also create a predictive tool, method, and practice to understand existing trends and map future trends that can help in planning and resource allocation to meet the projected demands or forecast. Measuring recall, retention, evaluation and projection which are important metrics for several sectors and domains, thus making concept maps a potential tool for measurement and design.