At The Neon Project, we enjoy participating in projects that require the application of date technologies such as artificial intelligence and, more specifically, machine learning. These technologies help us extract and analyze large amounts of data to make better decisions.
Natural language processing in our last project
In our most recent project for an organization in the USA, we have been given the task to help reduce and ultimately eliminate mistakes in a disaster prevention system through natural language processing (NLP).
The organization currently uses a system that relies on a traditional parser in order to extract relevant data from informative documents. These documents are manually written by humans from trusted organizations. In this scenario, a document with a wrong format that is not processed by the parser is critical for the rest of the infrastructure.
Because of the sensitivity of the system to false negatives (documents that are ignored by the parser but that are actually relevant and critical), the organization requires engineers to check a lot of the documents rejected by the parser at whatever time they come in.
Technically the source of the problem is that data is heterogeneous and reports don’t necessarily match a standard or template. Human typing mistakes can be expected as they are generated in critical and tense contexts. In this scenario, a traditional parser has major technical limitations.
Our approach was to design a natural language processing model that can identify a set of identities within a huge set of documents. After a few training and validation iterations, this model can replace the current parser and be better at identifying relevant documents. This reduces the risk of false negatives and the need for human intervention for corrections.
Adding Google Cloud AutoML in our tech stack
Google launched last April a new set of services for machine learning development under the name Google Cloud AutoML. We were very excited to use them in a real-world use case, so this seemed to be a great chance to do it.
AutoML offers a simple interface where we could easily upload sample documents to create a dataset. We had documents that were shared by our clients. We completed this data set with similar documents we scraped from trusted sources on the Internet.
Once the documents were uploaded, we had to tag with labels the entities we wanted our model to recognize. Only after the dataset had a minimum quantity of samples for each label, we could begin the training of the model. Once finished, our model was able to identify the entities from a new document, not previously known to the system.
As soon as we had validated the model was properly designed and trained, we used the Node.js library for the AutoML API. In addition, we built a proxy API service together with a simple web interface. In the interface, users can upload documents and see how confident the system is of the importance of the report, based on the content and context of the document and not on its format.
Building the model and the user interface took days instead of weeks thanks to Google Cloud AutoML and allowed us to make our clients happy. We found AutoML an exciting platform for creating machine learning solutions. Getting things done was very straight-forward, and the results were pretty good. We will continue using it in the future!