3RD INTERNATIONAL CONFERENCE ON LINGUISTIC AND CULTURAL DIVERSITY IN CYBERSPACE-28 June - 3 July, 2014 Yakutsk, RussiaDaniel Pimientapimienta@funredes.orgNetworks & Development Foundationhttp://funredes.orgObservatory of languages & cultures in the Internethttp://funredes.org/lc Executive Committee Memberofhttp://maaya.org A methodology for exploring the situation of French & languages of France in the Internet which could applyto other groups of languages.Daniel Pimienta and Daniel Prado MAAYA, May 2014MayotteCREDITSThe methodology is the result of the merge of the products of two independent studies realized by the team D. Prado/D. Pimienta, on behalf MAAYA, in 2013:OIF mandated study about the space of French on the InternetGeneral Delegation to French and languages of France (DGLFF) of Ministry of Culture mandated study about the space of languages of France on the InternetTWO COMPLEMENTARY APPROACHESFRENCH, a language classified in position 8 in terms of speakers (L1+L2)OTHER MINORITY LANGUAGES spoken in France territoriesANTECEDENTSDIFFICULTIES IN PRODUCTION OF INDICATORSDILINET PROJECTDILINET PROJECT STATUSMEANWHILELINGUISTIC DIVERSITY INDICATORS PARADOX1988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 10 11 12 13 2014 INTERESTCAPACITYLINGUISTIC DIVERSITY INDICATORS PARADOX1988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 2010 INTERESTCAPACITYFUNREDES/UL...LOPALIS/ISOC..OCLCFUNREDES..XEROX..IDESCAT.Internet users per language (source InternetworldStats). till 2011Web pages per language (not all!) till 2008Other indicators per country (FUNREDES/UL) till 2008WHAT INDICATORS DO WE HAVE?WHERE IS THE BOTTLENECK?The two main indicators building activities rely:on crawling ccTLD for languages in Asia, Africa, the Caribbean and applying recognition algorithms (LOP).on using Search Engines counting capacity and their large percentage of web coverage (FUNREDES/UNION LATINA).WHERE IS THE BOTTLENECK?But- The size of the web is getting too large for traditional crawling (close to infinite!).Search Engines are no more indexing a substantial part of it (80% 5%)Search Engines counting has became unreliable. And anyway all we got is static data mostly focused on the number of web pages per language.A RESEARCH PROJECTCollaboration between UNESCO, OIF, UNION LATINA with participation of ITU.High level profile partners ERCIM, MAAYA, UNESCO, OIF, FUNREDES, EXALEAD, UPC, DIALOGIC, CNRS/LIMSI, FRAUNHOFER, CWI, VOCAPIA, NIELSENImportant investment (estimate 300 Keuros, direct and indirect) PROCESSProposing to 2 EU/PF7 calls:Jan. 2012: Integrated Project of 7Meuros for ICT-2011.4.4 Intelligent Information ManagementJan. 2013: Specific Targeted Research Project of 3Meuros for ICT-2013.4.1 Content analytics and language technologies - Cross-media content analytics2 near misses reflecting low EU interest in the themeNew attempt in process with Qatar partners with LOP on boardMEANWHILEInternetWorldStats stopped updating 3 years ago A new interesting player but limited to 10 millions top sites (2% of the sites) : W3TECH Web evolution towards dynamic pages, video, social networks The context call for alternative approachesPART 1 : MEASURING FRENCHDefining a large set of spaces and applications to get data from.Searching for a large number of Internet sites which offer linguistic or country data for those spaces/applications.Applying appropriate selection criteria to this set of sites.Collecting, compiling, organizing dataCrossing Internet data with reliable demo-linguistic dataPutting results in perspective.P1 : SPACES & APPLICATIONSApplicationsOffice applicationsWeb 2.0Search enginesEmailP2PUP TO 100, split into following categories: SpacesInfrastructureOnline librarySmartphonesVOIP/ChatOperating systemsBrowsersP1: SOURCESTraditional sources (UN, UNESCO, ITU, OCDE, EU) have few linguistic data but plenty of country dataMost non traditional sources are either:Marketing company offering free glances on expensive dataExperts showing their capacity thru reportsLife duration of non traditional sources is often short.SOURCE SELECTION CRITERIAToo small scopeToo biasedNot recently updatedMethodology not reliableSELECTED SOURCESmore than 200 sourcesless than 100 sources10 = excellent< 5 = Not used but kept for future checkSOURCES PARAMETERSTitleURLPublication yearRating (0 10)Focus (worldwide, Europe, France, USA, OCDE)Frequently updated (y/n)Type of source (meta, general, space, application, book, report, paper, webpage)Application or space concernedLanguage specific (y/n)CommentsDEMO-LINGUISTIC DATANo institutional support low data qualityLarge and diverse geography divergent dataMain demo-linguistic sources divergent dataLanguage typology boundary dilemmaL2 counting DEMO-LINGUISTIC CHOICESETHNOLOGUE FOR L1 ( homogeneity)DIVERSE SOURCES FOR L2 ( reliability)WIKIPEDIA FOR COUNTRY DEMOGRAPHICINTERVAL DATA FOR SOME SPACE/APPLICATIONPUT IN PERSPECTIVEI = AxBxCxD/1000A= Level of world relevance (0 to 10)B = Level of reliability of source (0 to 10)C = Level of trust for French (0 to 10)D = Level of relevance for French (0 to 10)P = Direct weightingLEMENTABCDIL1L1+L2(L12)PL1xIL12xIL1 xPL12xPTYPEViadeo257107160706RSTumblr6676154246030168RSHotmail556692401808APPOpen office999858250117010APPBlogs.com67751525029010BLOGNing777827651650300RSMsn777621651230300APPWordpress877727751920350BLOGAVERAGE6,84,27,44,37,24,2ANALYZE PER TYPE* = Only one sourceType of spaceL1L1+L2BOOKS3*BLOGS6,53,3APPLICATIONS6,73,6SOCIAL NETWORKS74INFRASTRUCTURES7,94USERS94*CONTENTS84,1VIDEO76*P2P6,3CONCLUSION P1French, as first language, can be considered up but close to position 7 in the Internet, all elements mixed.French, as first and second language, can be considered as up but very close to position 4.CONCLUSION P1French, in spite its lower demographic strength, is in close competition in the Internet, depending of space/application, with:Spanish, German, Japanese, Portuguese, and in some way with Russian and Arabic.CONCLUSION P1: TRENDSStrongly emerging languages (competing with English)Chinese (will go over English) SpanishEmerging languages(Competing with French)Hindi, Bengali, Russian, ArabicNew players Urdu, IndonesianCONCLUSION P1Most of the elements of the applied methodology should perform for other languages of large world wide scope, such as Arabic, Portuguese, Spanish or Russian.PART 2 : LANGUAGES OF FRANCEMAYOTTEMAYOTTESELECTION OF LANGUAGES OF FRANCE FOR THAT STUDYAlsatianBasqueBretonCatalanCorsicanCreole (*)FlemishFrankish Franco-Provenal Futunan Languages of Mayotte (*) Ol languages (*) Kanak languages (*) Occitan (*) Tahitian Walisian(*) : family of languagesSELECTION CRITERIATerritory based languages (no immigration languages)Subset with higher probability of Internet presence> more than 50,000 speakersor > used in official teachingLanguage familiesCreole : Martinique, Guadeloupe, Guyane, la RunionOccitan: auvergnat, gascon, languedocien, limousin, provenal, vivaro-alpinKanak: aji, drehu, nengone, paic, xrc (+ 24 more not studied)Languages of Mayotte: kibushi et shimaorLanguages terminologyAlsacien: alemannic, alemannisch, alsacien, elsaessisch, elsssisch, etc.Basque: biscayan, gipuzkera, gipuzkoan, guipuzcoan, guipuzcoano, euskera, euskara, roncalese, vasco, vascuense, vizcaino, etc.Catalan: Aiguavivan, Algherese, Aragonais oriental, Balear, Catal, Cataln, Catalan-Valencian-Balear, Eivissenc, Mallorqui, Menorqui, Menorquin, Lleidat, Pallarese, Ribagoran, Valenci, Valenciano, etc.Corse: corsu, corsican, corsi, corso, sartenais, venaco, vico-ajaccio, etc.Languages terminologyFrancique mosellan: lothrnger ditsch, lothringer deutsch, lothringer plattm, lothrnger deitsch, lothrnger deitsch, lothrnger platt, francique luxembourgeois, francique mosellan, platt, etc.Futunian: fakafutunaLanguages terminologyFrancoprovenal: arpetan, arpian, arpitan , arpitano , brass , burgondan , burgonds, dauphinois, delfinese, dialetto , faetar, francoprovenl , friborgs , fribourgeois, genevois, harpitan , lyon, lyonnais, mcons, neuchatelais, neuchtelois, patois, patoua, patous, romand, romand , savoiardo, savoyard, savoyrd, tot-parier, valaisan, valdostano, valdtain, valdtn, valsan , vaudois, vdousLanguages terminologyLangue dol: angevin, berrichon, bourbonnais, bourguignon-morvandiau, brionnais-charolais, champenois, frain-comtou, franc-comtois, gallo, langue comtoise, lorrain, mconnais, manceau, marachin, mayennais, normand, normand mridional, picard, poitevin, poitevin-saintongeais, saintongeais, wallon, etc.Languages terminologyOccitan: barnais, aspois, girondin, lemozin,limousin, mdocain, mondin, monegasque, neugue, niois, nissard, nissart, occitanien, occitanique, parler doc, romans, patois, proensal, raimondin, rouergat, etc. Shibushi: malgache de Mayotte, kibushi kimaore, kibushi kiantalaoutsi, kibushi, kibuki, bushiTahitien: reo tahitiWallisian: fakauvea, faka uvea, ouvaDIFFERENT METHODOLOGYThe same method cannot apply because most of the languages would not have any Internet references offering space/applications data as they do for French (or Spanish, English or Russian).What would be the alternative knowing that the Internet spaces of most of those languages is quite small compared to French?LoF METHODOLOGYCannot search only Internet references giving data on the situation of those languages on the Internet.Cannot search all references related to those languages.BUTHow about searching references closely related to those languages?SCOPE OF THE SEARCHReferences closely related to one of the language of the study (not the territory!) Also references offering data on all the languages of France or offering data to all languages including the one which are studied.What is the definition of closely related?Close relationship to languageBest choice: site/book/paper discussing the situation in the Internet of the language and/or offering data about itGood choices : meta reference about the language (data base, clearinghouse, linguistic organization,)Linguistic resources (dictionary, )Reference discussing the languageCultural reference if they have an indirect relation with the language (literature, poetry or songs)Reference offering serious language learningBlogs in or about the languageClose relationship to languageBad choices:Touristic resources (except excellent presentations of language)Reference looking good but not public domainReference copying another source (go to the very source)RATINGWARNING This is no value judgment about the reference, what is rated is only the level of contribution and proximity to the theme language on the Internet: TARGETThe theme of language on the Internet or bringing meaningful data about that themeRATING9: Exceptional contribution to the theme or meaningful data8: Strong contribution or interesting data7: Interesting contribution or original data6: Average contribution5: Indirect relation4: Indirect relation but not much content3: Not accessible but kept in memory because special interest.COLLECTED DATAYEARUPDATED (Y/N)SECTOR : GOV, EDU, ORG, COM, PERTYPE : Article /Blog /Portal /Linguistic Resource / Social Network / META/ Data Base/ Library LANGUAGE: Local, French, English, German, SpanishDATA: Y/NCOMMENTSSEARCH METHODSimple search with the language most common name to find main sites in first 100 answersGo to the external links page if possible and note all of themSystematic analyze of linksBack to 2 until it is clear that no new links appearComplete with more sophisticated search (GoogleScholar, books, blogs, other languages, other language terminology)RESULTSA total > 1000 references (still missing 4 languages)This obviously cannot taken as an exhaustive search but indeed we have enough data to use statistics to get some meaning useful for public policies.NUMBER OF REFERENCESRATING SPLITSTATISTICSSome key indicators are observed:The rate of wrong links informs about the vitality of the language in the Internet (example Creole rate > 20% reveal problems)The split between ORG, PER, EDU, COMThe split between reference typesSECTOR SPLITORGEDUPERGOVCOMOTHERGeneral27%49%0%8%7%8%Languages of France20%48%7%23%2%0%Breton52%17%6%3%22%0%Corsican15%24%27%19%14%0%Creoles24%31%14%5%26%0%Francoprovenal44%17%35%4%1%0%Futunian28%56%16%0%0%0%Kanak21%48%12%7%6%6%Mayotte34%37%14%2%14%0%Occitan39%19%25%7%9%1%Tahitien28%28%6%9%30%0%Wallisien19%39%26%0%16%0%TOTAL31%30%17%8%12%2%SECTOR SPLITORGANIZED CIVIL SOCIETYACADEMIACITIZENSHIPGOVERNMENT (OFTEN LOCAL)TURISMORGEDUPERGOVCOMOTHERGeneral27%49%0%8%7%8%Languages of France20%48%7%23%2%0%Breton52%17%6%3%22%0%Corsican15%24%27%19%14%0%Creoles24%31%14%5%26%0%Kanak21%48%12%7%6%6%Occitan39%19%25%7%9%1%TOTAL31%30%17%8%12%2%TYPE SPLITTYPESGenLDFBretonCorseCroleFrancoprovenalKanakOccitanTahitianTOTALPUBLICATIONS23%40%15%22%25%21%40%23%13%24%DATA BASE4%2%0%4%0%7%2%2%4%3%BLOGS0%2%6%22%2%6%8%15%0%9%MEDIA0%2%0%2%1%2%0%3%0%1%META14%7%16%2%10%2%5%2%19%9%PORTAL10%10%44%24%28%25%14%24%30%23%LINGUISTICRESOURCES48%38%18%23%31%28%26%30%28%29%SOCIALNETWORK1%0%1%0%2%10%3%0%4%2%TOTAL7%6%8%9%10%10%12%24%4%100%LANGUAGE SPLITMEANGroupwith higher %Groupwith lower %% in English10%GeneralOccitan% in French48%LoFTahitian% in local language7%CorsicanLoF% in French & local language19%Breton & CorsicanLoF% multilingual18%TahitianCorsicanEMERGING PATTERNSA1- Not much spoken, Internet presence pushed by citizenship & multistakeholder, including local government : CorsicanA2- Not much spoken, Internet presence pushed by citizenship but low government involvement: Occitan & Franco-provenal A3- Not much spoken but Internet presence pushed by civil society organizations but low government involvement: Breton B- Spoken language but low Internet presence except academic: Creole, Kanak, Futunian et WalisianCONCLUSION P2First interesting results into a field not yet systematically exploredNext step will be to create a public clearinghouse and invite players to contribute and promote dialog cross languagesThe approach should be applicable to other countries with a variety of minority languages (such as Italy, Spain, Germany or Russia).GENERAL CONCLUSIONThe exposed methodology could probably be reused with no much modifications by other language familyMERCIThank youGraciasObrigadoAmesegnalhu Shukran Dhonnyobaad Orkun Doh jeh Dekuji Adjarama Abhar Toda raba Ngue penTack*


