Study on the effectiveness of NLP tools for syntactic analysis of French texts.

Link to full Paper

NLP (natural language processing) is the field of artificial intelligence that deals with the processing, analysis and understanding of natural language. It began to emerge and play an indispensable role since the 1950s with its usefulness in document translation, text indexing, and text-like data mining.

The advancement in the use of NLP tools has followed an exceptional development, from an automatic translator at the beginning to an assistant that structures information into computer-understandable data to eventually a more developed and complex form that began since the 2000s with the help of growth in artificial intelligence hardware, big data, and deep learning. It is now the basis of all search engines, text filters(SPAM detectors) and even speech recognition technology.

The project “Study on the efficiency of French text parsing tools” is in line with the project “Study of the performance of POS-taggers applied to the French language” which consisted of re-launching comparisons on the results of the POS-tagging outputs of its tools with a French language corpus(French TreeBank). Similarly, this project focused on the comparison of the same NLP tools but in another more complex domain, namely syntactic dependency.

This research was applied to the basic French texts' corpus (a.k.a French TreeBank -FTB-), which consists of articles from multiple newspapers published between 1990 and 1993. The corpus itself is not vast, it carried over 21 500 senteces with 664 500 tokens (POS-tags), but it was suitable for a medium size analysis.

The analysis was a syntactic analysis, which in simple words means finding the relations between the words of a given sentence. The steps taken were simple and clear:

  • Clean&format the corpus.
  • Adapt the texts to each tool.
  • Analyse the texts and aggregate the outputs.
  • Cross-validate over the same texts.
  • Plot the results for each tool.
  • (Bonus) Compare&Plot the inference time for each tool.