# NTTS paper # Integrating multimode survey data with VTL Keywords: data integration, multimode surveys, VTL A.1 Mixed-mode and web data collection H.1 - Open frameworks for replicability and reproducibility ## 1. Introduction ### 1.1. Context: diversification of survey modes For several years, Insee has been working on renovating its information system for the collection of business and individual surveys [1]. If the collection modes and protocols were previously simple at Insee, the survey collection information system must now respond to an increased move towards more multimode and more evolved and diversified protocols. The development of multimode surveys requires a delicate and rigorous orchestration. This involves, among other things, constructing Internet questionnaires (CAWI) that are consistent with survey questionnaires (CAPI/CATI) or with paper questionnaires (PAPI). Emblematic Insee surveys (LFS, Housing, Resources and Living Conditions, etc.) have been or are being constructed to move towards this type of complex protocols. ### 1.2. Issues/challenges on protocols and process - Data collection Beyond the delicate statistical or methodological questions raised by the advent of such protocols, it is appropriate to consider, in light of the experience acquired, the operational complexity induced by multimode, imposing real technical challenges on the collection platforms for operating the questionnaires (GSBPM: from "Design/ Design Collection" to "Collect/Run collection"). Insee has recently renovated its collection information system (with the "Metallica" program) for surveys of individuals based on the concept of active metadata: a single questionnaire specification (expressed in DDI) generates several collection instruments (multimode) within the platforms. Several surveys by Internet and paper were thus operated in 2021 (for example the "Daily life and health" survey: VQS) and by Internet and telephone in 2022 (for example the "Housing" survey) as the program works on these collection platforms. - Data integration The data, once collected in different modes, needs to be processed to be integrated later in the statistical operation process. The Metallica program has therefore pursued technical investments to reconcile the data from the different modes (GSBPM: "Process/Integrate data") and start the first processing (GSBPM: "Process/Classify and code"). ## 2. Methods In order to implement the integration of multimode survey data, Insee uses the Validation and Transformation Language (VTL) proposed by the SDMX initiative. ### 2.1. Before VTL at Insee Before the redesign of the tools in charge of metadata-driven survey collection and the use of formal specifications feeding the collection process (questionnaires, variables, processing), each survey had its own specific tool. Specifications were written by the survey designer (questionnaire model, dictionary of variables, initial processing) and then implemented (in Blaise) by teams of developers to build the collection instruments and a set of statistical processes (using SAS or R) to extract the data and ensure initial processing (data tabulation, multimode reconciliation, recoding, etc.). This work organization, which is still in place for some surveys before their migration to the new system, is quite costly because it requires developing and testing the chain from start to finish, including for a change in the questionnaire from one survey edition to the next. Complex survey protocols (panel, sequencing...) have processing chains developed in SAS that are quite complex and very difficult to maintain over time. ### 2.2. VTL - Validation and transformation language VTL is a standard language [2] for defining validation and transformation rules (set of operators, their syntax and semantics) for various kinds of statistical data. VTL is intended to be used by statisticians and is at the "business" rather than the technical level. VTL processing rules are used and interpreted thanks to Java and JavaScript implementations (Trevas [3]) in the tools of the new Metallica collection system. The designer can directly write these rules and they can be directly integrated into the system when it comes to specifying the expected treatment on the data, specific to the survey without additional developments. VTL is already used in the questionnaire design tool Pogues [4] to specify logical expressions within the questionnaire (conditional expressions, checks and filters). ## 3. Results For the VQS survey, a questionnaire in paper format and a questionnaire in Internet format were proposed to almost 240,000 respondents using the same specification. These data had to be reconciled because the response formats were not exactly the same. For example, a single-choice question can be implemented in web format by a set of checkboxes and a control of the uniqueness of the answer (usually a radio button) but will be implemented in paper format by a set of checkboxes where it will not be possible to control the uniqueness of the answer. The reconciliation processing of data from several modes will then consist in specifying, for the paper response, what to do in cases where several boxes are ticked: retain none, the first, the one consistent with other responses, etc. Beyond the specification of the questionnaire in DDI, a new type of metadata must be added to specify these processing phases, specific to the survey and the question. ## 4. Conclusions ### 4.1 Assessment of the solution While the redesign of the collection information system has led to a great deal of work on standardizing processing, there are still a number of specificities to be taken into account for each survey. The use of the VTL processing language, dedicated to the designer and interoperable with the rest of the highly standardized system, has already made it possible to optimize the implementation and renovation of certain household surveys (all Insee household surveys will be migrated to this system within the next 3-4 years) while guaranteeing the specificities of each one. The VTL grammar makes it possible to cover the vast majority of needs in terms of post-collection processing specific to each survey, even in the case of complex protocols. ### 4.2 Prospect The next step will be to further develop the concept within complex panel and mutlimode processes (e.g., the use of VTL rules for the post-collection processing necessary for re-collection or change of mode, including through the use of paradata) and to develop a tool dedicated to the designer's work: the simplified specification of VTL rules for post-collection processing in a working environment, integrated with the one that already exists for the specification of questionnaires (Pogues). References [1] E. Sigaud and B. Werquin, "La mise en musique d’enquêtes multimodes", Courrier des statistiques n°7 (2022) - (english version to come) https://www.insee.fr/fr/information/6035936?sommaire=6035950 [2] "Validation and Transformation Language (VTL)" on the official site for the SDMX community. A global initiative to improve Statistical Data and Metadata eXchange https://sdmx.org/?page_id=5096 [3] "Transformation engine and validator for statistics (Trevas)" on github.com https://github.com/InseeFr/Trevas [4] F. Cotton and T. Dubois, "Pogues, a questionnaire design tool", Courrier des statistiques n°3 (2019) https://www.insee.fr/en/information/5014167?sommaire=5014796 ----------------------- *Consignes* *Instructions: The abstract should be reasonably self-contained; with actual results/findings presented in a compact but intelligible format. Abstracts failing to contain such basic information would only be accepted under exceptional circumstances.* *Please respect the overall structure. Replace all text highlighted in yellow with the appropriate information corresponding to your abstract. Depending on your need for subsections, either replace or remove all text highlighted in green. * *Prior to submission, please delete all red instruction text, and remove all highlighting from the document. The colour of all text in the abstract should be black, non-highlighted.* *Please bear in mind that this is an abstract: the maximum length allowed is 4 pages.* ### Narrative - Context : Diversification of survey modes For several years, Insee has been working on renovating its information system for the collection of business and personal surveys. If the collection modes and protocols were previously simple at Insee, internal surveys of businesses and surveys conducted by interviewers for surveys of households or individuals, the collection information system must also respond to an increased desire to move towards more multimodes, towards more evolved and diversified protocols. The development of multimode surveys requires a delicate and rigorous orchestration. This involves, among other things, constructing Internet questionnaires (CAWI) that are consistent with survey questionnaires (CAPI/CATI) or consistent with paper questionnaires (PAPI). Emblematic Insee surveys (LFS, Housing, Resources and Living Conditions, etc.) have been or are being constructed to move towards this type of protocol. - Methodological challenges (not the subject here) The implementation of this type of protocol requires a certain amount of methodological and statistical work to collect and process the data correctly. These methodological works are not the subject here but necessarily impact the implementation within the information system of this data capture and processing. - Technical challenges - data collection (mention Metallica) Beyond the delicate statistical or methodological questions raised by the advent of such protocols, it is appropriate to consider, in light of the experience acquired, the operational complexity induced by multimode, imposing real technical challenges on the collection platforms for operating the questionnaires (GSBPM: from "Design/ Design Collection" to "Collect/Run collection"). INSEE has recently renovated its collection information system (Metallica program) for surveys of individuals based on the concept of active metadata: a single questionnaire specification (DDI) for several collection instruments (multimode) within the platforms. Several surveys by internet and paper were thus operated in 2021 (for example the "Daily life and health" survey: VQS) and by internet and telephone in 2022 (for example the "Housing" survey) as the program works on these collection platforms. - Technical challenges - data integration (ref. GSBPM), detail concrete examples The data, once collected in different modes, needs to be processed to be integrated later in the statistical operation process. The Metallica program has therefore pursued technical investments to reconcile the data from the different modes (GSBPM "Process/Integrate data") and start the first processing (GSBPM "Process/Classify and code"). - Cite a concrete example and describe the workflow For the VQS survey, a questionnaire in paper format and a questionnaire in Internet format were proposed to nearly 240,000 respondents from the same specification. These data had to be reconciled because the response formats were not exactly the same. For example, a single-choice question can be implemented in web format by a set of checkboxes and a control of uniqueness of the answer (usually a radio button) but will be implemented in paper format by Pour l'enquête VQS, un questionnaire au format papier et un questionnaire au format internet ont été proposé à près de 240 000 répondants à partir de la même spécification. Il a fallu donc réconcilier ces données car les formats de réponse n'étaient pas exactement les mêmes. Par exemple, une question à choix unique peut être implémenté au format web par un ensemble de cases à cochées et un contrôle d'unicité de la réponse (bouton radio généralement) mais sera implémenté sera implémentée au format papier par un ensemble de cases à coché où il ne sera pas possible de contrôlé l'unicité de la réponse. Le traitement de réconciliation de la données issues de plusieurs modes consistera alors à préciser, pour la réponse papier que faire dans les cas où plusieurs cases sont cochées : n'en retenir aucune, la première, celle cohérente avec d'autres réponses, ... Au-delà de la spécification du questionnaire (métadonnées DDI), il convient alors d'ajouter un nouveau type de métadonnées pour préciser ces phases de traitements, spécifique à l'enquête et à la question. - L'Insee utilise VTL pour la mise en œuvre - mentionner que c'était un désordre avant, même pour le monomode ? - pourquoi VTL, qu'est-ce que c'est ? - décrire la solution développée - donner quelques chiffres - Évaluation de la solution - Développements futurs - data integration (ref GSBPM), detail concrete examples - Cite a concrete example and describe workflow - Insee uses VTL for implementation - mention that it was a mess before, even for monomode? - why VTL, what is it? - describe solution developed - give some figures - Assessment of solution - Future developments