Monday, August 24, 2015

Algorithm “Software developed in São Paulo filters data … – Diario de Pernambuco

For many, organize folders and computer files is a virtual continually postponed task. So does the obligation to read and separate the accumulated messages in the email inbox, an idea that becomes more unbearable with each new message that arrives. Imagine, then, how difficult it would be to examine all content published on websites, such as news portals, blogs and social networks. A challenge impossible for humans. For machines, however, a work that can be accomplished without difficulty. A software under development at the Institute of Mathematics and Computer Sciences (ICMC), University of São Paulo (USP) in São Carlos can automatically sort large amount of digital texts.

This is an algorithm that identifies the terms used in each type of text and analyzes the relationship between words to classify a new document. Everything is made according to the examples given by humans. A virtual library has various types of scientific archives, for example, would suffice register on the program some work related to each subject. From some examples of each category, the program completes the organization on their own.

Most of the automatic text classification program considers how often certain keywords appear in the documents. However, the algorithms developed by doctoral student Rafael Rossi ICMC are also able to interpret the networks of associations between terms, which allows the computer to identify patterns not assimilated into other types of representations, making the most efficient software. Through machine learning, the system can be perfected in their task, imitating the discernment of a human without his having to be specially programmed.

“What we propose is to consider the similarity between terms in a collection of documents. If I have the word ‘bank’ and ‘data’ in the same document, it will be similar. If, on the other hand, we have the terms ‘photo’ and ‘network’, which areas are distinct, they are not similar. Thus, we define what we call the relevant amount. It would be the weight or force that has a term for a particular document. The goal is to use the relationship of similarity to set this relevance, “explains Rafael Rossi.

LikeTweet

No comments:

Post a Comment