3RD INTERNATIONAL CONFERENCE ON LINGUISTIC AND CULTURAL DIVERSITY IN CYBERSPACE-28 June - 3 July, 2014 Yakutsk, Russia
Networks & Development Foundationhttp://funredes.org
Observatory of languages & cultures in the Internethttp://funredes.org/lc
Executive Committee Memberof
A methodology for exploring the situation of French & languages of France
in the Internet which could applyto other groups of languages.
Daniel Pimienta and Daniel Prado MAAYA, May 2014
CREDITSThe methodology is the result of the merge of the products of two independent studies realized by the team D. Prado/D. Pimienta, on behalf MAAYA, in 2013:
OIF mandated study about the space of French on the Internet
General Delegation to French and languages of France (DGLFF) of Ministry of Culture mandated study about the space of languages of France on the Internet
TWO COMPLEMENTARY APPROACHES
FRENCH, a language classified in position 8 in terms of speakers (L1+L2)
OTHER MINORITY LANGUAGES spoken in France territories
ANTECEDENTSDIFFICULTIES IN PRODUCTION OF INDICATORS
DILINET PROJECT STATUS
LINGUISTIC DIVERSITY INDICATORS PARADOX1988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 10 11 12 13 2014 INTERESTCAPACITY
LINGUISTIC DIVERSITY INDICATORS PARADOX1988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 2010 INTERESTCAPACITYFUNREDES/UL...LOPALIS/ISOC..OCLCFUNREDES..XEROX..IDESCAT.
Internet users per language (source InternetworldStats). till 2011
Web pages per language (not all!) till 2008
Other indicators per country (FUNREDES/UL) till 2008WHAT INDICATORS DO WE HAVE?
WHERE IS THE BOTTLENECK?The two main indicators building activities rely:
on crawling ccTLD for languages in Asia, Africa, the Caribbean and applying recognition algorithms (LOP).
on using Search Engines counting capacity and their large percentage of web coverage (FUNREDES/UNION LATINA).
WHERE IS THE BOTTLENECK?But
- The size of the web is getting too large for traditional crawling (close to infinite!).
Search Engines are no more indexing a substantial part of it (80% 5%)
Search Engines counting has became unreliable.
And anyway all we got is static data mostly focused on the number of web pages per language.
A RESEARCH PROJECTCollaboration between UNESCO, OIF, UNION LATINA with participation of ITU.
High level profile partners ERCIM, MAAYA, UNESCO, OIF, FUNREDES, EXALEAD, UPC, DIALOGIC, CNRS/LIMSI, FRAUNHOFER, CWI, VOCAPIA, NIELSEN
Important investment (estimate 300 Keuros, direct and indirect)
PROCESSProposing to 2 EU/PF7 calls:Jan. 2012: Integrated Project of 7Meuros for ICT-2011.4.4 Intelligent Information ManagementJan. 2013: Specific Targeted Research Project of 3Meuros for ICT-2013.4.1 Content analytics and language technologies - Cross-media content analytics
2 near misses reflecting low EU interest in the theme
New attempt in process with Qatar partners with LOP on board
MEANWHILEInternetWorldStats stopped updating 3 years ago
A new interesting player but limited to 10 millions top sites (2% of the sites) : W3TECH
Web evolution towards dynamic pages, video, social networks
The context call for alternative approaches
PART 1 : MEASURING FRENCHDefining a large set of spaces and applications to get data from.
Searching for a large number of Internet sites which offer linguistic or country data for those spaces/applications.Applying appropriate selection criteria to this set of sites.Collecting, compiling, organizing dataCrossing Internet data with reliable demo-linguistic dataPutting results in perspective.
P1 : SPACES & APPLICATIONSApplicationsOffice applicationsWeb 2.0Search enginesEmailP2P
UP TO 100, split into following categories: SpacesInfrastructureOnline librarySmartphonesVOIP/ChatOperating systemsBrowsers
P1: SOURCESTraditional sources (UN, UNESCO, ITU, OCDE, EU) have few linguistic data but plenty of country data
Most non traditional sources are either:Marketing company offering free glances on expensive dataExperts showing their capacity thru reports
Life duration of non traditional sources is often short.
SOURCE SELECTION CRITERIAToo small scopeToo biasedNot recently updatedMethodology not reliable
SELECTED SOURCESmore than 200 sourcesless than 100 sources10 = excellent< 5 = Not used but kept for future check
SOURCES PARAMETERSTitleURLPublication yearRating (0 10)Focus (worldwide, Europe, France, USA, OCDE)Frequently updated (y/n)Type of source (meta, general, space, application, book, report, paper, webpage)Application or space concernedLanguage specific (y/n)Comments
DEMO-LINGUISTIC DATANo institutional support low data qualityLarge and diverse geography divergent dataMain demo-linguistic sources divergent dataLanguage typology boundary dilemma
DEMO-LINGUISTIC CHOICESETHNOLOGUE FOR L1 ( homogeneity)
DIVERSE SOURCES FOR L2 ( reliability)
WIKIPEDIA FOR COUNTRY DEMOGRAPHIC
INTERVAL DATA FOR SOME SPACE/APPLICATION
PUT IN PERSPECTIVEI = AxBxCxD/1000A= Level of world relevance (0 to 10)B = Level of reliability of source (0 to 10)C = Level of trust for French (0 to 10)D = Level of relevance for French (0 to 10)
P = Direct weighting
LEMENTABCDIL1L1+L2(L12)PL1xIL12xIL1 xPL12xPTYPEViadeo257107160706RSTumblr6676154246030168RSHotmail556692401808APPOpen office999858250117010APPBlogs.com67751525029010BLOG
ANALYZE PER TYPE* = Only one source
Type of spaceL1L1+L2BOOKS3*BLOGS6,53,3APPLICATIONS6,73,6SOCIAL NETWORKS74INFRASTRUCTURES7,94USERS94*CONTENTS84,1VIDEO76*P2P6,3
CONCLUSION P1French, as first language, can be considered up but close to position 7 in the Internet, all elements mixed.
French, as first and second language, can be considered as up but very close to position 4.
CONCLUSION P1French, in spite its lower demographic strength, is in close competition in the Internet, depending of space/application, with:Spanish, German, Japanese, Portuguese, and in some way with Russian and Arabic.
CONCLUSION P1: TRENDSStrongly emerging languages (competing with English)Chinese (will go over English) Spanish
Emerging languages(Competing with French)Hindi, Bengali, Russian, Arabic
New players Urdu, Indonesian
CONCLUSION P1Most of the elements of the applied methodology should perform for other languages of large world wide scope, such as Arabic, Portuguese, Spanish or Russian.
PART 2 : LANGUAGES OF FRANCEMAYOTTEMAYOTTE
SELECTION OF LANGUAGES OF FRANCE FOR THAT STUDYAlsatianBasqueBretonCatalanCorsicanCreole (*)FlemishFrankish Franco-Provenal Futunan Languages of Mayotte (*) Ol languages (*) Kanak languages (*) Occitan (*) Tahitian Walisian
(*) : family of languages
SELECTION CRITERIATerritory based languages (no immigration languages)
Subset with higher probability of Internet presence
> more than 50,000 speakersor > used in official teaching
Language familiesCreole : Martinique, Guadeloupe, Guyane, la RunionOccitan: auvergnat, gascon, languedocien, limousin, provenal, vivaro-alpinKanak: aji, drehu, nengone, paic, xrc (+ 24 more not studied)Languages of Mayotte: kibushi et shimaor
Languages terminologyAlsacien: alemannic, alemannisch, alsacien, elsaessisch, elsssisch, etc.Basque: biscayan, gipuzkera, gipuzkoan, guipuzcoan, guipuzcoano, euskera, euskara, roncalese, vasco, vascuense, vizcaino, etc.Catalan: Aiguavivan, Algherese, Aragonais oriental, Balear, Catal, Cataln, Catalan-Valencian-Balear, Eivissenc, Mallorqui, Menorqui, Menorquin, Lleidat, Pallarese, Ribagoran, Valenci, Valenciano, etc.Corse: corsu, corsican, corsi, corso, sartenais, venaco, vico-ajaccio, etc.
Languages terminologyFrancique mosellan: lothrnger ditsch, lothringer deutsch, lothringer plattm, lothrnger deitsch, lothrnger deitsch, lothrnger platt, francique luxembourgeois, francique mosellan, platt, etc.
Languages terminologyFrancoprovenal: arpetan, arpian, arpitan , arpitano , brass , burgondan , burgonds, dauphinois, delfinese, dialetto , faetar, francoprovenl , friborgs , fribourgeois, genevois, harpitan , lyon, lyonnais, mcons, neuchatelais, neuchtelois, patois, patoua, patous, romand, romand , savoiardo, savoyard, savoyrd, tot-parier, valaisan, valdostano, valdtain, valdtn, valsan , vaudois, vdous
Languages terminologyLangue dol: angevin, berrichon, bourbonnais, bourguignon-morvandiau, brionnais-charolais, champenois, frain-comtou, franc-comtois, gallo, langue comtoise, lorrain, mconnais, manceau, marachin, mayennais, normand, normand mridional, picard, poitevin, poitevin-saintongeais, saintongeais, wallon, etc.
Languages terminologyOccitan: barnais, aspois, girondin, lemozin,limousin, mdocain, mondin, monegasque, neugue, niois, nissard, nissart, occitanien, occitanique, parler doc, romans, patois, proensal, raimondin, rouergat, etc. Shibushi: malgache de Mayotte, kibushi kimaore, kibushi kiantalaoutsi, kibushi, kibuki, bushiTahitien: reo tahitiWallisian: fakauvea, faka uvea, ouva
DIFFERENT METHODOLOGYThe same method cannot apply because most of the languages would not have any Internet references offering space/applications data as they do for French (or Spanish, English or Russian).
What would be the alternative knowing that the Internet spaces of most of those languages is quite small compared to French?
LoF METHODOLOGYCannot search only Internet references giving data on the situation of those languages on the Internet.Cannot search all references related to those languages.BUTHow about searching references closely related to those languages?
SCOPE OF THE SEARCHReferences closely related to one of the language of the study (not the territory!) Also references offering data on all the languages of France or offering data to all languages including the one which are studied.
What is the definition of closely related?
Close relationship to languageBest choice: site/book/paper discussing the situation in the Internet of the language and/or offering data about itGood choices : meta reference about the language (data base, clearinghouse, linguistic organization,)Linguistic resources (dictionary, )Reference discussing the languageCultural reference if they have an indirect relation with the language (literature, poetry or songs)Reference offering serious language learningBlogs in or about the language
Close relationship to languageBad choices:Touristic resources (except excellent presentations of language)Reference looking good but not public domainReference copying another source (go to the very source)
RATINGWARNING This is no value judgment about the reference, what is rated is only the level of contribution and proximity to the theme language on the Internet: TARGET
The theme of language on the Internet or bringing meaningful data about that theme
- RATING9: Exceptional contribution to the theme or meaningful data8: Strong contribution or interesting data7: Interesting contribution or original data6: Average contribution5: Indirect relation4: Indirect relation but not much content3: Not accessible but kept in memory because special interest.
COLLECTED DATAYEARUPDATED (Y/N)SECTOR : GOV, EDU, ORG, COM, PERTYPE : Article /Blog /Portal /Linguistic Resource / Social Network / META/ Data Base/ Library LANGUAGE: Local, French, English, German, SpanishDATA: Y/NCOMMENTS
SEARCH METHODSimple search with the language most common name to find main sites in first 100 answersGo to the external links page if possible and note all of themSystematic analyze of linksBack to 2 until it is clear that no new links appearComplete with more sophisticated search (GoogleScholar, books, blogs, other languages, other language terminology)
RESULTSA total > 1000 references (still missing 4 languages)
This obviously cannot taken as an exhaustive search but indeed we have enough data to use statistics to get some meaning useful for public policies.
NUMBER OF REFERENCES
STATISTICSSome key indicators are observed:The rate of wrong links informs about the vitality of the language in the Internet (example Creole rate > 20% reveal problems)
The split between ORG, PER, EDU, COM
The split between reference types
ORGEDUPERGOVCOMOTHERGeneral27%49%0%8%7%8%Languages of France20%48%7%23%2%0%Breton52%17%6%3%22%0%Corsican15%24%27%19%14%0%Creoles24%31%14%5%26%0%Francoprovenal44%17%35%4%1%0%Futunian28%56%16%0%0%0%Kanak21%48%12%7%6%6%Mayotte34%37%14%2%14%0%Occitan39%19%25%7%9%1%Tahitien28%28%6%9%30%0%Wallisien19%39%26%0%16%0%TOTAL31%30%17%8%12%2%
SECTOR SPLITORGANIZED CIVIL SOCIETYACADEMIACITIZENSHIPGOVERNMENT (OFTEN LOCAL)TURISM
ORGEDUPERGOVCOMOTHERGeneral27%49%0%8%7%8%Languages of France20%48%7%23%2%0%Breton52%17%6%3%22%0%Corsican15%24%27%19%14%0%Creoles24%31%14%5%26%0%Kanak21%48%12%7%6%6%Occitan39%19%25%7%9%1%TOTAL31%30%17%8%12%2%
MEANGroupwith higher %Groupwith lower %% in English10%GeneralOccitan% in French48%LoFTahitian% in local language7%CorsicanLoF% in French & local language19%Breton & CorsicanLoF% multilingual18%TahitianCorsican
EMERGING PATTERNSA1- Not much spoken, Internet presence pushed by citizenship & multistakeholder, including local government : Corsican
A2- Not much spoken, Internet presence pushed by citizenship but low government involvement: Occitan & Franco-provenal
A3- Not much spoken but Internet presence pushed by civil society organizations but low government involvement: Breton
B- Spoken language but low Internet presence except academic: Creole, Kanak, Futunian et Walisian
CONCLUSION P2First interesting results into a field not yet systematically exploredNext step will be to create a public clearinghouse and invite players to contribute and promote dialog cross languagesThe approach should be applicable to other countries with a variety of minority languages (such as Italy, Spain, Germany or Russia).
GENERAL CONCLUSIONThe exposed methodology could probably be reused with no much modifications by other language family
MERCIThank youGraciasObrigadoAmesegnalhu Shukran Dhonnyobaad Orkun Doh jeh Dekuji Adjarama Abhar Toda raba Ngue penTack