As developers, we manipulate data every day. But there's one kind of data that is difficult and non-trivial to process and structure: human language. This is even more difficult when you're in a country like Switzerland, where there are four official languages! Fortunately, there has been a decades long effort in computer science regarding this specific purpose, namely Natural Language Processing (NLP). In this talk, we cover many of the available components of a modern NLP pipeline: From the basic tasks, like tokenization and lemmatization, to the most interesting techniques like Named Entity Recognition (NER), coreference resolution, and the dependency parser. Furthermore, we show where and how LLMs, like GPT, can be plugged in to (possibly) enhance a pipeline. To provide a real-world - and Swiss - context, our target dataset will be the Swiss Commercial Registry. This complex, multilingual public database is central to an expansive interdisciplinary research project in economics and political science, where we are building the software engineering backbone using cutting-edge NLP technology.
Get notified about new features and conference additions.