The Effects of a CLIL Programme on Linguistic Progress at Two Different Points in Time

In an attempt to explore the effects of different kinds of English as a Foreign Language (EFL) learning contexts, content and language integrated learning (CLIL) have been at the centre of FL acquisition research over the past decade. Studies have focused on the features and gains this setting brings, whether content is learnt at the same level of success as when taught in the learners’ L1, and whether that L1 is negatively affected by CLIL. However, to our knowledge, very little attention has been brought to how the seniority of the programme affects learner progress in the target language. This study aims to fi ll such a gap in the understanding that the programme will have developed and improved in terms of quality of exposure and interaction, and that learners’ EFL performance will be higher. To do that, we measured the effi cacy of a long-standing CLIL programme in Barcelona twelve years after it was launched and examined the reading, writing, and lexico-grammatical abilities of CLIL EFL learners aged 8, 11, and 14 compared with results obtained by learners measured at the onset of the programme in 2005. The results showed that the quality of the programme has increased over the last decade, guaranteeing a higher level of EFL student profi ciency when raw scores are considered, but not in terms of linguistic gains, in which only improvement in older students’ grammar and reading skills can be observed.


Introduction
Over the last few decades, instructed second language acquisition has experienced a change towards more meaningful activities in classrooms in order to generate a more natural and authentic learning context. In light of such developments, Content and Language Integrated Learning (CLIL) has gained ground in Europe 1 , and in this respect, Catalonia, in Spain, is a case in point. It is believed that through CLIL approaches, in which a curricular subject is taught through the medium of a foreign language (FL), learners are immersed in more meaningful communication than in conventional formal instruction (FI). Within CLIL, English as a FL (EFL) learners are exposed to a higher quantity and quality of input, and have more opportunities for interaction. Even if most research shows that CLIL programmes result in benefits for learners (APAC, 2006;Brinton, Wesche, & Snow, 1989;Lasagabaster, 2008;Pérez-Vidal, 2013;Roquet, 2011;Van de Craen & Pérez-Vidal, 2001), more research is needed to assess their real impact in different settings.
Therefore, the goal of the present study is to contribute to this area of research with data obtained from a CLIL programme with English as the medium of instruction in a mainstream educational institution in the city of Barcelona. The study seeks to examine the impact of the school's CLIL programme twelve years after it was launched to analyse its effects as a programme that is no longer at its pilot stage, but that has been running for over a decade. This date is important for the main objective of the study; the evaluation of the programme's efficacy in bringing along higher levels of proficiency in EFL. In this study, we examine CLIL learners' abilities from 8 to 14 years of age, with data collected in 2017, and cross-sectionally contrast it with data collected at the 1 CLIL programmes have developed more recently in South American countries and Australia but fall outside the scope of the current study. onset of the programme in 2005 (Roquet, 2011). This should allow us to gauge the expected beneficial impact of the programme and assess its development over the years. The present study thus represents a unique opportunity to examine in detail a CLIL programme that is no longer in its infancy (Pérez-Vidal, Escobar, & Roquet, 2013).
Thus, the present research aims at providing new and meaningful data regarding a specific CLIL programme, namely in terms of the effects of programme lengths, as seen through the students' level and gains in reading, writing, and lexico-grammatical skills. Although CLIL is a branch of EFL acquisition that has been largely researched in recent decades, many studies have stated the infancy of the CLIL programmes analysed as a research limitation. Therefore, the main aim of this paper is to fill in this gap and compare one CLIL programme at two points in time: a couple of years after it was implemented, and twelve years later, when the programme is more stablished and stable.

Literature Review
The present section is devoted to briefly presenting an overview of the existing CLIL literature to provide a context for the relevance of the current research, starting with a definition of the term CLIL, followed by a description of its main features. We then briefly sketch out the specific context of the study and present a description of the onset of CLIL programmes in Europe. We close this section by reviewing the effects of CLIL on linguistic progress and the possible reasons explaining them, with a special emphasis on how long programmes have been running.

Definition and Features of CLIL
CLIL can be defined as an approach in which 'a language different from the domestic languages is used as the medium of instruction for curricular subjects' (Pérez-Vidal & Roquet, 2015). This way the target language is used in a context of real and meaningful communication (Llinares & Morton, 2010;Lorenzo, Casal, & Moore, 2010). While being the medium of instruction, students are 'learning language through language use' (Brinton, Wesche, & Snow, 1989) by engaging in meaningful and critical interactions (Darvin, Yi Lo, & Lin, 2020), while the curricular subjects taught though CLIL are also object of attention (Pérez-Vidal, 2013). What needs to be highlighted here is that the main goal in CLIL is to teach both the FL and the curricular content in an integrated way (Lasagabaster, 2008). Therefore, a CLIL approach should include both content and language objectives.
As a result, the extant limited amount of hours of L2 exposure in the conventional educational curricula is increased and schools become a multilingual environment, where languages are to be presented as a whole (San Isidro & Lasagabaster, 2019). Therefore, through CLIL approaches, languages are handled in an integrated way in schools. Instead of being taught and used separately and independently from each other, CLIL is intended to raise students' general language awareness (Navarro-Pablo & López Gándara, 2019) and promote translanguaging (García, Aponte, & Le, 2020), that is, the alternation from one language to another (Garcia, 2009) and their integration for different pragmatic purposes, as well as pluriliteracies, as each different language is used for a specific curricular subject in parallel to developing learners' L1s (Nikula & Moore, 2015). In sum, CLIL ultimately promotes plurilingualism with a wide scope (Dalton-Puffer, 2008;Nikula & Moore, 2015).
There is a debate, which is yet not resolved, regarding the extent to which the CLIL subject adopts the L1 cultural approach and pedagogy, considering that usually both students and the teacher share the same L1. Critical to CLIL is the fact that the teachers' and students' L2 proficiency tends to be limited. CLIL programmes also show some variable characteristics that make each CLIL programme slightly different, such as timing, continuation, support for students, resources for teachers, institutional implication, social spreading, and the status of the L2 language and culture (Pérez-Vidal, 2013). Another critical variable feature of CLIL programmes is their seniority; many CLIL research has been conducted with programmes that were still at their infancy, hence, in their pilot phases (Moore, 2009). However, more research is needed to determine whether CLIL programmes that are no longer in their pilot phase can lead to higher benefits for the students in terms of their L2 learning process.

CLIL: from Europe to Spain
Since the onset of the European Union in 1949, the European Commission has shown great concern for Europe's linguistic heritage and cultural diversity, promoting citizens' mobility, linguistic diversity, and multilingualism. With the aim of guaranteeing that citizens in Europe become functionally proficient in their mother tongue and in two other European languages (Cenoz, 2015;Llinares & Morton, 2010;Pérez-Vidal, 2007), it promoted an early starting age of instruction in a second language, an intensive and transdisciplinary approach starting in primary school, the addition of a third language in secondary education taught intensively and transdisciplinarily, and a university system guaranteeing mobility through ERASMUS (Pérez-Vidal, 2013).
As a consequence, the role and relevance of 'context of learning' began to gain ground in the shape of interdisciplinary teaching approaches (Gené Gil, 2016;Llinares & Morton, 2010;Pérez-Vidal & Juan-Garau, 2012). Thus, CLIL appeared to be the most suitable approach since it provides a real, student-centred, and meaningful context for language learning. Consequently, the CLIL approach clearly gained ground at all education levels in many countries (Lorenzo, Casal, & Moore, 2010;Pérez-Vidal, 2007;Van de Craen & Pérez-Vidal, 2001).
Concerning the specific context of the present research, CLIL programmes have been launched and implemented in the majority of Spanish autonomous territories (Fernández-Sanjurjo, Fernández-Costales, & Arias Blanco, 2017;Pérez-Vidal, Lorenzo, & Trenchs, 2015). This is indeed the case in Catalonia, where the current study takes place: a bilingual community where semi-immersion programmes have gained ground by using English as FL as a medium of instruction of other curricular subjects. Semi-immersion programmes, such as CLIL, allow for more real and meaningful contact with the FL within the classroom while still learning the language in a FL context where there is little or no contact with the language outside of the educational setting. Within this background, in recent years, the number of schools implementing CLIL programmes in Catalonia has grown considerably.

Effects of CLIL Programmes
It has been stated that CLIL brings many benefits to learners and many different CLIL programmes have been analysed to see which linguistic skills benefit most from such programmes. (APAC, 2006;Brinton, Wesche, & Snow, 1989;Lasagabaster, 2008;Pérez-Vidal, 2013;Roquet, 2011;Van de Craen & Pérez-Vidal, 2001). There is however one critical aspect in any CLIL programme, the length of time a programme has been running, that has been scarcely considered when interpreting data. Moore (2009) already stressed the fact that much of the research on the impact of CLIL was being conducted with programmes that were barely in their pilot phases and that we needed to give them time to settle in schools and be refined.
In contrast with the pilot programmes that have been analysed in many studies in the last several decades, some of which will be summarized later on, a well-and fully-established CLIL programme would show more stability and complexity, which is a potential key factor for the programme's success (Navés, 2009). Developing and implementing a new CLIL programme is a complex task that requires the teachers to be trained and new materials to be created. After this pilot stage in which the programme is first set and designed, CLIL programmes that have been running for some years will have more experienced teachers who bring consistency to the programme, as well as experience when designing and refining all aspects of its actual implementation (Navés, 2009;Navés & Muñoz, 1999). Therefore, a CLIL programme that would have been running for many years in the same school is expected to be more robust and have more experienced teachers, which should potentially grant a more stable and complex design and application of the programme. In their study, Navarro-Pablo and López Gándara (2019) suggest that the CLIL group's higher scores may be attributed to the years that the programme has been implemented, the experience of the teachers, and the time students have been part of a CLIL programme. However, as mentioned above, most CLIL research has been conducted with CLIL programmes in their pilot phase. Thus, there is a need for more research to analyse the effects of a well-established CLIL programme as compared to a pilot phase one. This is precisely what the present study seeks to study: the analysis of a long-standing CLIL programme by measuring its efficacy in terms of student's linguistic gains correlated with the programme's maturity and growth.
One study, Roquet (2011), later summarized in Pérez-Vidal and Roquet (2015) and Roquet and Pérez-Vidal (2017), drew attention to this fact when the lack of statistically significant results was attributed to the pilot stage of the newly established programme. Indeed, these authors analysed CLIL learners compared to non-CLIL learners from the same school synchronically. The longitudinal study allowed for an analysis of the degree to which the learners' level of English progressed. All learners were administered five tasks: a written composition, a reading cloze, a dictation 2 , a grammar and vocabulary error correction test, and a grammaticality judgement test. Roquet's (2011) results partially confirmed the hypothesis: the group with CLIL instruction yielded significantly better results in all the tests except the dictation test as compared to their peers of the same age and grade receiving formal instruction (FI) only. It was, therefore, concluded that a FI+CLIL learning context had proved to be more beneficial than a FI learning context only. However, the lack of statistically significant results in some of the tests administered led the authors to state that an academic year may not have been sufficient and, equally importantly, that the programme was still in its infancy when data was collected, at a 'pilot phase' stage, so that it might have still been too early for it to prove to be fully efficient.
To our knowledge, no study thus far has actually tried to prove that a well-established programme should yield better results when learners' proficiency gains are tested than at its onset 3 . In the wake of such an idea, the present study seeks to explore whether a fully established CLIL programme offered at the school in 2017 yielded significantly higher results in the form of more advanced EFL abilities at two different points in time (Roquet, 2011).
As mentioned above, most research regarding CLIL programmes and their effects on linguistic skills and language development focused on newly-established programmes, comparing the results of students involved in CLIL classes with the students attending only FI EFL lessons. It has been stated that CLIL brings many benefits to learners (APAC, 2006;Brinton, Wesche, & Snow, 1989;Lasagabaster, 2008;Pérez-Vidal, 2013;Roquet, 2011;Van de Craen & Pérez-Vidal, 2001). However, there is still no consensus regarding which areas or language abilities benefit most from it. Lasagabaster (2008) found that CLIL Basque students showed significantly higher oral and written proficiency than their FI peers. Moore (2009) also proved that certain competences are more likely to be improved with CLIL: receptive skills, vocabulary, morphology, fluency, creativity, risk-taking, and emotive-affective factors. Lorenzo, Casal, and Moore (2010) found that CLIL learners outperformed their FI peers in the four main linguistic skills both productively and receptively.
On the contrary, in a longitudinal study conducted with secondary education students, Admiraal, Westhoff, and de Bot (2005) did not find any significant differences between the CLIL and the FI groups in terms of receptive vocabulary. However, the CLIL students in the study did have higher results in speaking and reading, which is contradictory with Pladevall-Ballester and Vallbona (2016), who did not find any significant differences in reading in primary school students. Surprisingly, the same study also showed that the non-CLIL group outperformed the CLIL group in listening, although significant progress was seen in both groups.
Related to oral productive skills, Rallo Fabra and Juan-Garau (2011) found that CLIL students were understood better and showed less accented pronunciation than their FI peers. Such findings are in line with Ruiz de Zarobe's (2008) previous longitudinal study in which better results were seen in the CLIL group not only in terms of pronunciation, but also in vocabulary, grammar, and fluency. Nevertheless, in a later longitudinal study with secondary education students, Rallo Fabra and Jacob (2015) did not find any statistically significant differences between the CLIL and the FI groups in fluency on pronunciation.
In terms of content learning, previous research on CLIL around Europe has reported similar results between groups, or even favourable outcomes for the CLIL cohort in both the learning of the subject content (Hughes & Madrid, 2019) and the languages used as the medium of instruction, leading therefore to state that CLIL programmes do not seem to have a detrimental effect on the learners' L1 or on their achievement when However, in their study, Fernández-Sanjurjo, Fernández-Costales, and Arias Blanco (2017) reported opposite results, since non-CLIL students performed slightly better than their CLIL peers when assessing their science knowledge, the subject taught through CLIL. In another study by Hughes and Madrid (2019), primary and secondary CLIL and non-CLIL students' science knowledge was examined, showing slightly higher results in science content in the case of the non-CLIL group in primary education when compared to a CLIL group the same age, a trend which was however changed as the CLIL group reached secondary education. The authors claimed that this might have been the consequence of the younger primary-level learners having a lower FL level as they start following a CLIL programme, an effect which is then mitigated later on (Hughes & Madrid, 2019). Similar results were reported in a longitudinal study by De Dios Martínez Agudo (2019) in which CLIL students showed better results than their peers in all linguistic domains and this tendency kept growing until the end of secondary education.
In sum, it can be claimed that CLIL programmes have been shown to bring about benefits for the students enrolled on them when compared to their FI peers in terms of vocabulary and morphology as well as motivation and creativity (Dalton-Puffer, 2008;Lasagabaster, 2008;Lorenzo, Casal, & Moore, 2010;Pérez-Vidal, 2011). However, there are some other competences for which contradictory results have been found; these include syntax, writing, informal language, pronunciation, fluency, and pragmatics (Dalton-Puffer, 2008;Pérez-Vidal, 2011, Roquet & Pérez-Vidal, 2017. Therefore, more research is needed, namely in the form of longitudinal studies (Lasagabaster & Doiz, 2017 Besides the abovementioned linguistic features, some benefits regarding the students' entire educational process (Lorenzo, Casal, & Moore, 2010) and their knowledge and usage of their mother tongue have also been identified (APAC, 2006), such as greater problem-solving skills, more independence as learners and linguistic spontaneity (DeKeyser, 2000;Derakhshan & Karimi, 2015), and positive attitudes and an interest in learning (Lasagabaster & Sierra, 2009).
Such benefits probably come from the nature of the CLIL environment where students are exposed to highquality input and opportunities for output practice and meaning negotiation are present (Roquet & Pérez-Vidal, 2017). Generally, the CLIL hours are extra hours of exposure and interaction in the FL (APAC, 2006;Dafouz Milne & Guerrini, 2009). In CLIL classrooms, learners are immersed in transdisciplinary lessons, usually including science, physical education, informatics, or arts and crafts content (Dafouz Milne & Guerrini, 2009;Pérez-Vidal & Roquet, 2015), which allows them to reduce the pressure to learn the language (Auerbach, 2006;Dalton-Puffer, 2008;Lasagabaster & Sierra, 2009;Van de Craen & Pérez-Vidal, 2001).
It is also important to note that, as pointed out by Bruton (2011) after reviewing several studies on CLIL, it seems that in many cases the students in the CLIL groups are self-selected, that is to say students who choose the CLIL option are more motivated, have a higher socio-economic family status, or have a higher linguistic level. Therefore, those CLIL benefits stated in the previous paragraphs may not only come from the communicative and more meaningful nature of such programmes, but also from the students' sociolinguistic background and initial proficiency level in the L2. However, Hüttner and Smit (2014) claimed that an approach such as CLIL, which increases student engagement in content and language learning simultaneously, attempts to diminish any discriminatory differences in learners' linguistic knowledge.
Finally, one other feature of CLIL pedagogy is that teachers' roles change from the FI conventional ones: the teacher stops being either the FL expert or the subject expert to actually become a combination of both (APAC, 2006), although sometimes the role of the content and the language teacher is still unclear (Darvin, Yi Lo, & Lin, 2020). In this respect, it has been claimed that teacher education in CLIL is a primary requirement. Information about CLIL as a multilingual approach, its challenges and advantages, the existing tools, and welltested pedagogical practices should be subject to training. CLIL should spur collaboration between the programme's language and content teachers (Darvin, Yi Lo, & Lin, 2020;Jaén Campos, 2016;Papaja, 2013), as they will need to work together.
In sum, the previously mentioned conditions taken together, namely the meaningful context that a fullyestablished CLIL programme provides, can be said to be optimal for learners' language acquisition and nonlinguistic positions vis-à-vis their target language in their language acquisition process (Dalton-Puffer, 2008; Kersten & Rohde, 2013). Nevertheless, many of the abovementioned studies were based on pilot programmes, which are naturally less stable since they are still developing. Therefore, further research is needed that analyses CLIL programmes that have been running for some years and that are more stable and robust. That is precisely what the present research aims at doing, analysing a fully-developed CLIL programme and contrasting how it affects students learning of the FL in comparison with the same programme in its pilot phase twelve years before.

The Present Study
The present study was conducted in a school in Barcelona, Catalonia, which had launched a well-planned CLIL programme in [2004][2005]. Through the programme, learners were exposed to one extra subject, science, in English, in addition to the English FI hours. After designing the programme in the 2002-2003 school year, training the teachers, and informing the families, the programme was launched the next year with the third and fifth graders. In the following school years, the programme was extended to secondary education and now it stretches up to the first year of the Baccalauréat, the first of two years of preparation for university entry.
Parallel to the implementation of the programme, the school increased the number of EFL conventional sessions from three to four or five sessions per week, starting at the age of four. As a result, when pupils reached the third grade of primary, they had been exposed to 420 hours of English, which was considered enough to follow a science course in English. The school also made provisions to include the vocabulary needed to understand a science course in English during the English language sessions.
Our study took advantage of the fact that the impact of this same programme on learners' EFL proficiency two years after its implementation had already been analysed in 2005 (see Roquet, 2011), and full access to this data was granted. Thus, the data collected in 2005 were used and compared with the new data collected for an appraisal of the linguistic effects of the CLIL programme more than a decade later.
With that goal in mind, the following research questions were set: RQ1. Did the 2017 CLIL learners display significantly higher linguistic abilities than the 2005 learners from the same academic grades and ages respectively, that is at the third and sixth grades of primary, and at the second grade of secondary?
We hypothesized that the 2017 cohort would yield better results because they would have benefited from an improved, more mature, and better-established CLIL programme.

RQ2
. What are the differential gains in linguistic abilities (reading, writing, grammar, and vocabulary) between the 2005 cohorts and the 2017 cohort, when comparing cross-sectional gains between the third and sixth grades in primary and between the sixth grade in primary and the second grade in secondary?
We hypothesized that gains between the third and six grades in primary and between the sixth grade in primary and the second grade in secondary would be grater in the 2017 cohort than in the 2005 cohort due to the growth the programme would have experienced.

RQ3
. What are the differential cross-sectional gains shown in the learners' linguistic abilities (reading, writing, grammar, and vocabulary) between the younger group (third grade in primary) and the eldest group, (second grade in secondary) when comparing the 2005 and the 2017 programmes?
We hypothesized that, as a result of the programme having evolved over the last decade, the 2017 cohort would show higher gains from the third grade of primary school to the second grade of secondary school when compared to the 2005 cohort.

Materials and Methods
In this section, the design is presented, the participants are described, and the materials and procedures for data collection and analyses are given.

Research Design
The current study cross-sectionally measured and contrasted CLIL learners following a CLIL programme consisting of a science subject taught in English, in addition to FI in that language. Students' linguistic abilities were measured in order to identify possible improvements in the programme.
As displayed in Table 1 below, learners were tested in two different years: 2005 and 2017. In both testing times, students were administered the tests in their third trimester, namely between April and May. The 2005 data is actually secondary data in this study as we are reanalysing Roquet's (2011) corpus, which has been contrasted with the new 2017 data for this study.
To that end, learners were tested at three different grades and ages coinciding with the onset, the middle, and the end of the CLIL programme. Data from the third grade in primary education serves as pre-test data, and data from the sixth grade in primary serves both as post-test data for the third graders and as pre-test data for learners in the second grade of secondary education.

Participants
The sample of participants analysed included a total of 90 students from the 2004-2005 cohort, and 90 students from the 2017 cohort. The participants in both groups were Catalan and Spanish bilingual learners from the same school and they were all following the school's CLIL programme, having contact with the English language not only in their FI class, but also in their science CLIL lessons. Thus, students tested in both testing times had had the same exposure to English in the school as their same-age peers from the other data collection cohort.  Table 1 below shows the distribution of the participants in groups. The team of English teachers and CLIL teachers involved in the programme in 2005 and in 2017 was a stable group. Furthermore, the main parameters of the programme were also kept the same, namely the CLIL programme coordinator was the same and so was the pedagogical approach, as confirmed by the school's management team, guaranteeing the programme's stability and development over the years.
Data were collected from all learners in those three grades at the two testing times, that is from 120 students per grade in primary and 90 students in the second grade of secondary. However, only a randomly selected sample of 30 students from each grade at each testing time was analysed, providing a stratified sample that included students from all the different groups in each school grade. Written consent to participate in the study was requested from all of the students' families through a form in which details regarding the research were presented. All students were tested, as the tests were also included as classroom activities of the school's English subject.

Table 1
Distribution of the participants in groups

Materials
A series of tests were administered to the participants at both testing times: a writing test, a grammar and vocabulary test, and a reading test. Those four main skills were chosen to be the focus of the testing since they provide an overall picture of the students' linguistic competence. However, speaking and listening were not included in the testing due to time and space constraints. The tests were kept the same for both testing times and all three grades tested to guarantee that the scores were comparable across levels and between both testing times. The instructions for all tests were written in Catalan to help understanding.
All tests were created by the researchers in 2005, following guidelines from the schoolteachers to ensure that they corresponded to the educational levels. The tests were piloted in 2005 before the first testing time and were kept the same in 2017. Thus, the tests were criterion-referenced and scored on an absolute scale.
The writing test was based on a picture that showed two policemen taking a statement from a woman and a young boy inside a flat. Students were asked to write three compositions in the space provided: one imagining a dialogue for the picture, another explaining what they thought had triggered the situation, and, finally, another guessing how the situation would end.The three questions that were given to the students were the following (they have been translated from Catalan into English): Write a dialogue imagining the situation involving the policemen, the boy, and his mother.
Write a short paragraph describing why you think the situation you imagined in the dialogue actually took place.
Write a short paragraph explaining howyou think this situation will end?
The grammar and vocabulary test consisted of two separate tasks. The first one was an error correction task. It included 30 statements grouped into three sections of 10 sentences each, with an increasing degree of difficulty, which needed to be corrected in a multiple-choice answer format by choosing either the correct option to fill in the blank in the sentence or selecting the correct sentence out of the three options given. Some examples of the questions included in the error correction test are given below: The reading test consisted of a cloze activity with a multiple-choice answer format. The learners were administered a text dealing with a topic studied in the science course with 20 missing words and four options for each of the blanks. The second paragraph (of three in total) of the reading test, with its corresponding possible answers, is presented below: Tsunami is a Japanese word that (6) 'harbour wave'. But why do tsunamis (7)? Tsunamis are usually caused by earthquakes at the bottom of the sea. At first, the (8) in the sea is quite small, but it moves very (9). When the wave gets close to the coast, the ocean floor makes it grow enormously. By the time it reaches the (10) it has become huge. Some tsunamis can be 30 meters (11). These giant waves can hit Japan, Indonesia, Central (12)

Procedure
Data collection procedures were kept basically the same in 2017 as in 2005. In 2005, the participants were tested in their classrooms over one two-hour session. During the first hour, three tests were administered: the error correction test (15 minutes), the grammaticality judgement test (10 minutes), and the writing test (20 minutes). In the second hour, the reading test (15 minutes) 4 was completed. All tests were pen and paper, and teachers were given previous instructions such as to strictly control the time and to guarantee an exam-like environment.
In the second testing time, in 2017, the tests were administered over two separate sessions. The first one took place in the classrooms and included the writing test (20 minutes), which was a pen and paper test, as in 2005. The second session took place in the computer room and included the rest of the tests: the error correction test (15 minutes), the grammaticality judgement test (10 minutes), and the reading test (15 minutes), which were digitalized and administered through three different Google Forms questionnaires. Teachers were also given precise instructions for test administration. The change introduced in this data collection was a reflection of the well-established use of computers both in the school and in the learners' individual work outside school hours, which made 'pen-and-paper only' tests obsolete to a certain extent, and perhaps less valid as a measure of proficiency.
In addition to the abovementioned tests, students were administered a socio-linguistic background questionnaire, tapping onto the participants' linguistic background such as their L1 and their knowledge of other FLs. The results obtained from the sociolinguistic background questionnaire allowed for controlling external factors such as the students' mother tongue and their participation in English extracurricular lessons. Thus, students with English as their mother tongue or enrolled in English extracurricular activities were excluded from the sample of participants. However, even though those two main factors were controlled, it is important to underline that in 2017, as compared to 2005, there was more extramural contact in English, namely students had more access to resources in English in their daily lives.

Analyses
The following procedure was followed for the analysis. To correct the grammar and vocabulary tests, and the reading test, correct answers added points and incorrect answers did not. As for the writing test, learners' compositions were evaluated on the basis of Friedl and Auer's (2007) wholistic rating model, which includes four main areas (task fulfilment, organization, grammar, and vocabulary) rated along a scale where 0 is the lowest grade and 5 the highest. Interrater reliability was calculated on the basis of 10% of the compositions, which were corrected by two different ratters.
With the raw results obtained from the grading procedure explained above, statistical analyses were conducted, namely t-tests. All data was statistically analysed with Version 16 of IBM's SPSS Statistics package and the p value was set at (p=0.05).

Results
The first research question enquired into the comparative results obtained from the participants in 2005 and 2017 at the three educational levels analysed. It was hypothesized that the 2017 cohort would obtain significantly higher results than the 2005 cohort. When comparing the results in all tests, as shown in Table 2 and Figure 1   Regarding the results obtained by the sixth primary grade learners, as presented in Table 3 and Figure 2 below, higher grades were reported in 2017 for the writing and on the grammaticality judgement test. As for the writing test, the mean score in 2005 was 4.22; while in 2017, it was 4.63. In the grammaticality judgement test, the mean in 2005 was 5.34; while it was 6.04 in 2017. Such a difference was not statistically significant in the writing test (p=0.128) whereas it was significant for the grammaticality judgement test (p=0.026).
On the contrary, lower results were reported in 2017 than in 2005 for error correction and the reading tests.
Regarding the error correction test, the mean score in 2005 was 6.82; while in 2017, it was 6.70, but this was not statistically significant (p=0.729). On the reading test, the mean in 2005 was 6.62; while in 2017, it was 5.82, the difference not being statistically significant either (p=0.106).

Figure 1
Test results for the third year of primary at both testing times

Figure 2
Test results for the sixth year of primary at both testing times When analysing the results of participants in the second grade of secondary, as displayed in Table 4 and Figure  3 below, higher grades were reported in all four tests in 2017, although in the case of the error correction test, the difference was very small: the mean in 2005 was 8.17, while in 2017 it was of 8.19, thus being slightly higher in 2017, however not statistically significant (p=0.909). On the writing test, the mean score in 2005 was 5.62; while in 2017 it was 6.50, a difference that was statistically significant (p=0.032). Concerning the grammaticality judgement test, the mean reported in 2005 was 7.48; while in 2017 it was 7.75, the difference not being statistically significant (p=0.353). Finally, for the reading test, the mean reported in 2005 was 7.72; while in 2017 it was 8.53, with a statistically significant difference (p=0.013).

Figure 3
Test results for the second year of secondary at both testing times In sum, the previous results prove that the first hypothesis can only be partially confirmed. The 2017 cohort showed higher results than the 2005 cohort on most tests. Two exceptions must be noted: the sixth graders in primary school whose error correction test and reading test in 2005 yielded higher results; and the second graders in secondary whose error correction test scores were almost the same at both data collection times.
The second research question enquired into the gains shown cross-sectionally between the third and sixth graders in primary; and those same sixth graders and second graders in secondary. It was hypothesized that gains in the 2017 cohort would be higher than those reported for the 2005 cohort.
When comparing third and sixth graders in primary, significantly better results were obtained by the 2005 cohort in the case of the error correction test and the grammaticality judgement test, and a statistical tendency showed in writing 5 . Indeed, the results for the writing test, as displayed in Table 5 and Figure 4 below, reveal that the gains obtained in 2005 were 3.17 out of 10; while in 2017, they were 2.71, but the difference was not statistically significant (p=0.208). On the error correction test, the gains in 2005 were 3.54 points; while they were 2.10 in 2017, such a difference being statistically significant (p=0.014). On the grammaticality judgement test, the gains in 2005 were 1.44 points; while in 2017, they were 0.28, with a statistically significant difference (p=0.027).

Figure 4
Gain scores between third grade and sixth grade in primary, for both cohorts (2005 and 2017) We find the reverse pattern when contrasting those same sixth graders and second graders in secondary.  By examining the previous results, it can be stated that the second hypothesis cannot be confirmed. Significantly larger gains were obtained from the third to sixth grade of primary by the 2005 cohort on the error correction and the grammaticality judgement tests, and a statistical tendency towards significance was shown in writing (reading was not measured). In contrast, the pattern seemed to change when we compared the sixth graders in primary and the second graders in secondary: significantly larger gains were found for the 2017 cohort in the case of reading and a tendency in the case of the writing, the error correction. Nevertheless, it has to be noted that in 2017, the sixth graders in primary showed lower results on the error correction and reading tests than those reported in 2005 (see Table 3). This can explain why, in the case of reading, they showed higher gains in this comparison, they had more room for improvement given their lower initial level.

Figure 5
Gain scores between the sixth graders in primary and the second graders in secondary for both cohorts (2005 and 2017) The third and last research question enquired into the possible gains between the initial testing time (third grade in primary) and the last testing time (second grade in secondary). It was hypothesized that the 2017 cohort would show larger gains than the 2005 cohort. Table 7 and Figure 6 below display the results, which do not confirm our hypothesis. They were similar at both testing times for writing: in 2005 gains reached 4.57; while in 2017, they were 4.58, and the difference was not statistically significant (p=0.972). In the case of the error correction test, higher gains were obtained in 2005, 4.89 points; than in 2017, when they were 3.59, such difference being statistically significant (p=0.003). The same was true in the case of the grammaticality judgement test, for which the gains obtained in 2005 reached 3.58, also a superior figure than in 2017, 1.99, a difference that was also statistically significant (p=0.006). As for the reading test, in 2005, third graders did not take this test. In conclusion, the third hypothesis cannot be confirmed either, since for both the error correction test and the grammaticality judgment test, significantly larger gains were obtained in 2005 than in 2017, with a statistical tendency shown in the case of the writing.

Figure 6
Gain scores between initial testing time and the last testing time for both cohorts (2005 and 2017)

Discussion
The current study has aimed to contrast the benefits of a FI+CLIL EFL programme at two different points in time. It sought to examine whether learners' linguistic proficiency proved to be higher at the time when the programme was no longer at its onset, that is twelve years after it was launched. It was assumed that more than a decade should have resulted in a well-established learning context no longer at the pilot stage, and, as a result, was potentially more efficient.
Most studies analysing the benefits of CLIL have been conducted with programmes that are in their pilot stages (Moore, 2009). It is therefore of utmost interest to analyse whether the infancy of the programme may negatively impact its outcomes by comparing the same programme years after its launching. A well-stablished and robust programme may be expected to provide better learning conditions by guaranteeing greater stability and complexity, leading to higher results in the students' linguistic proficiency (Moore, 2009;Navarro-Pablo & López Gándara, 2019).
Therefore, the first research question in the present study enquired into the EFL linguistic benefits of the programme by comparing learner proficiency at three different learner ages with a cross-sectional population. It was hypothesized that learners enrolled in the 2017 programme would show higher proficiency than their peers in the 2005 programme on the four tests administered, tapping into writing, grammar and vocabulary, and reading. This first hypothesis has been only partially confirmed, since participants in the 2017 cohort showed significantly higher grades on 8 out of 11 tests: all tests in the third grade in primary; writing and the grammaticality judgement test in the sixth grade of primary; and writing, the grammaticality judgement test, and reading in the second grade in secondary. On the contrary, on three occasions higher results were found in 2005 than in 2017: primary sixth graders on the error correction and reading tests, and secondary second graders on the error correction test.
Previous CLIL research has also reported contradictory results regarding the linguistic benefits of the programmes scrutinized with respect to syntax and writing (Dalton-Puffer, 2008;Pérez-Vidal, 2011). In the present study, the 2017 cohort yielded higher results in writing than the 2005 for the three age groups tested with a statistically significant difference for two of them. Such results may be due to the longevity and robustness of the programme, which may allow students to benefit from a more complex programme and more experienced teachers. As for syntax, out of the six grammar tests that were administered in total (error correction and grammaticality judgement in all three grades), higher results were reported in the 2017 cohort in four, with the exception of the error correction tests for sixth graders in primary and second graders in secondary. Finally, vocabulary has been found to be a skill benefitting from CLIL (Dalton-Puffer, 2008;Pérez-Vidal & Roquet, 2015). As for reading, higher results were found in the present study in older students (2 nd secondary), while younger learners (6 th primary) had higher results in the 2005 programme.
Against the previous backdrop, it seems evident that although most of the previous research in the field has focused on comparing CLIL and FI learners' language development, most of the skills that were previously found to benefit from CLIL have also showed higher results in the well-stablished programme analysed in the current study. Therefore, such results may mean that a more robust and stable CLIL programme could potentially lead to greater benefits in said skills.
Thus, these findings allow us to state that when groups from the early and the well-established programmes are compared at the ages of 8-9, 11-12, 13-14 and the same conditions are kept in terms of number of hours of instruction and pedagogical approach, learners' progress was higher in 2017 than in 2005 in most of the tests administered (8 out of the 11 tests). Indeed, the programme seems to have had a higher positive impact on learners' linguistic progress once fully established, twelve years after it was launched for most tests when raw scores are analysed, with the exception of the error correction and reading tests mentioned above. These results are in line with a previous study (Navarro-Pablo & López Gándara, 2019), in which higher results in the CLIL group were attributed to the seniority of the programme and teachers' expertise.
The second and third research questions approached the comparison of the impact of the CLIL programme when it first was launched and twelve years later, by measuring the gains in linguistic abilities between the three different grades tested. When comparing learners' gains between the third and sixth grades in primary at both testing times (2005 and 2017), much higher gains were seen in the 2005 cohort as well as significant gains on the error correction and grammaticality judgement tests. Hence the programme's length of time in operation does not seem to have yielded extra benefits in terms of linguistic gains. The opposite results were obtained when analysing the gains between the sixth grade in primary and the second grade in secondary, namely higher gains in 2017 for all tests except for the grammaticality judgement test, with a significant difference in reading and a statistical tendency towards significance in writing. Considering these findings, it cannot be claimed that learners will reap higher benefits from a well-established CLIL programme in terms of linguistic gains between the beginning, middle, and end points of the programme. However, there is a tendency towards greater improvement in the older students, who showed higher gains in 2017 than in the areas of writing, grammar, and reading.
Finally, when comparing the gains between the younger and the older groups, almost equal gains were reported on the writing test and higher gains were found in 2005 on the error correction and grammaticality judgement tests. As a result, the second and third hypotheses cannot be confirmed.
To our knowledge, no previous studies on the influence of programme length have analysed linguistic gains; however, one study (Navarro-Pablo & López Gándara, 2019) focused on analysing the effects of programme seniority using raw scores in English proficiency. Therefore, further research is needed in this domain to further examine the effects of programme length.

Conclusion
Overall, it can be stated that the quality of the programme has indeed improved since learners in the 2017 cohort had better results on most of the tests at all three educational levels when raw results are compared. Such improvement in the programme can be attributed to the programme's seniority, since the programme has been kept within the same parameters in terms of the number of hours, the CLIL coordinator, and the pedagogical approach followed. After 15 years of running, the programme in 2017 was more robust and stable, the teachers were more experienced, and its design and application more complex. This allowed students to benefit from a much more developed programme and attain higher levels in English as a FL.
In contrast, the analysis of gains does not seem to prove an effect of the programme's seniority, neither when contrasting the development between the three educational levels, nor between the initial and final testing times. Thus, it cannot be confirmed that a fully-stablished CLIL programme will lead to higher results in terms of the linguistic gains in the students enrolled in the programme, as compared to the programme in its pilot phase. Additionally, the absence of higher gains in 2017 may, however, be attributed to a ceiling effect: initial scores in 2017 were slightly higher on most tests both in the third and sixth grades of primary school, which were used as pre-test scores to analyse gains; therefore, starting levels of English were higher in 2017. Thus, higher initial language levels may have left less room for improvement and, as a result, explain the lower gains.
However, this study has some limitations that may partially explain its results. The first and possibly the largest limitation is that the data used was not longitudinal, but cross-sectional. Although students' exposure and contact with English inside and outside the school was controlled though the sociolinguistic background questionnaire at both testing times, the cross-sectional nature of the study poses a limitation, since the raw scores and gains compared were from different samples of participants. Second, individual differences are considered to have an impact on FL learning; however, we did not take them into account due to time and space constraints. Only the mother tongue of the subjects and their contact with the English language outside of school was considered to make sure that none of the learners had English as their native language or had more hours of contact with the language beyond the school sessions. In addition, it must be noted that the amount of extramural exposure to English in 2017 was much higher than in 2005 as access to English resources accessible from home has increased in the last decade. Therefore, we cannot be completely sure that some of the higher results reported in 2017 were not also enhanced due to the fact that children nowadays have more access to the English language.
This study has sought to contribute with new, meaningful, and updated data to our understanding of the effects of a CLIL programme in mainstream education. In a nutshell, our results reveal that the programme's quality has indeed changed over the years, guaranteeing a higher level of EFL proficiency when raw scores are contemplated, but not when gains are analysed. This calls for further studies being conducted that might add longitudinal data to the cross-sectional data used in this study.