Citations de lectures

"Il n'y a pas d'histoire de France. Il n'y a qu'une histoire de l'Europe." Marc Bloch
"Il n'y a pas d'histoire de l'Europe, il y a une histoire du monde." Fernand Braudel

Civilisation : "Ce qui, à travers des séries d'économies, des séries de sociétés, persiste à vivre en ne se laissant qu'à peine et peu à peu infléchir." Fernand Braudel

lundi 6 décembre 2010

Cloud analysis of texts with TagCrowd

Résumé:
Tagcrowd est un outil en ligne de mesure de la fréquence des mots dans un texte. C'est un outil pratique, gratuit, dont les paramètres sont adaptables. Il existe plus d'options pour les textes en langue anglaise. Pour un historien, parmi les nombreuses utilisations vantées par le concepteur, deux utilisations apparaissent comme séduisantes:
- pour un premier travail de débroussaillage de données textuelles longues, dans une perspective comparative.
- pour illustrer son propos par la génération automatique d'un nuage de mots-clefs visuellement efficace.
Bref, c'est un outil amusant. En voici une analyse plus poussée:

In dealing with increasingly massive amount of texts which are more and more available in more than image forms, researchers in Humanities can be tempted to get computer's help.
A very basic way to approach a text with a computer is by simply do a search with key-words, i.e. with the words I think will be inside the text or with the words I hope will be inside the text. That means doing exactly like browsing through an index at the end of a book.
But softwares can help us do just the opposite: faced with huge amounts of text, we can generate indexes on the basis of word frequency. Thus we can have our own tags generated automatically.
Computerised text analysis is a very advanced field and provides many different approaches to help researcher in what they most urgently need : devices to simplify the analysis of larger and larger amounts of text data a single human being would be unable to process in a day, a year, or a life-time. Text analysis can be based on proximity (geometrical algorithms), on frequency, on meaning, on all of the above, etc.
TagCrowd is one of these tools: it is a web application that creates clouds of tags selected on the basis of frequency in a given text.

Let's test it with two pieces on China and religion :

- La Religion des Chinois, by Marcel Granet (one of the founding fathers of modern Sinology in France), 1922:


created at TagCrowd.com
This has been obtained while specifying the elimination of most common words in the French language. But it is not perfect since "d'un"or "d'une" were not eliminated.
But it is much better than that:

created at TagCrowd.com

So to solve the problem of unwanted words or compounds, one can create a stoplist. I did that to improve the quality of the cloud, and I obtained that (while expanding the number of words to 50):

created at TagCrowd.com
- A Typical Mission in China, by William Soothill (eventually he was to become Professor of Chinese at Oxford), 1907

created at TagCrowd.com
This cloud being taken from an English text is also supposed to gather words by proximity (extra-function).
Here is the list of non-wanted words so that I could have a "cleaner" cloud, with more "meaningful" words: ago, already, away, become, became, found, mr, name, number, soon, whose, brought .

As we can see this process would probably not satisfy a demanding cliometrician, or linguist, or historian (on what basis do I eliminate the verb "become", belonging to the lexical realm of change?). And other softwares exist that would. But this web app is free, reliable, adaptable, easy to use and produces pretty clouds of visual significance. When will a Chinese version be available?


PS: The tags for this post were TagCrowd generated....

3 commentaires:

  1. C'est drôle, je n'ai pas du tout pensé à l'utiliser comme ça... J'ai plus pensé au nuage de tags comme outil de navigation. Bien vu ! (ou plutôt mal vu pour moi...)

    RépondreSupprimer
  2. En fait c'est parce que le mot tag est trompeur. Un tag sur un site ou un blog renvoit forcément à un ensemble d'articles taggés. Ici, c'est une recherche de fréquences de mots. C'est un processus d'assistance visuelle à la recherche d'une thématique majeure dans un texte. Et c'est joli...

    RépondreSupprimer
  3. Bien d'accord. Et donc, encore une fois, bien vu.

    RépondreSupprimer