SUMMER JOB IN DATA ANONYMIZATION – students only

by Elena Callegari

Tuesday, 17 May 22 - 8:43 am

About the position

This is a 3-month, paid position reserved for Icelandic undergraduate or graduate students, or international BA or MA students enrolled at an Icelandic University.

The candidate will be developing a pipeline for the automatic anonymization of scientific articles and academic texts.
This pipeline will be used in conjunction with, and to facilitate the creation of, an annotated corpus of scientific texts. This corpus is a joint project between the Icelandic company SageWrite and the Language & Technology lab at the University of Iceland. Thanks to these text extraction and text anonymization algorithms, it will be easier to have clean data that can be manually annotated by annotators and that can be used to train machine-learning algorithms on.
By working on this project, the student will gain advanced skills that can be used to pursue a career in natural language processing (NLP). In 2021 NLP was one of the 7 most in-demand tech skills. Specialists in anonymization in particular are very sought after ever since the introduction of GDPR laws.

What will you do?

In recent years, different laws like GDPR which standardize the way services collect private information came into play. This brought the privacy aspects to the attention of every company and increased the investment in handling and anonymizing private data.
In partnership with the University of Iceland, SageWrite ehf. is developing a corpus of annotated academic texts. This will be used to train text-generation and text-classification machine learning algorithms. For this corpus to be fully GDPR-compliant, this must not include any personally identifiable information (PPI), e.g. names, email addresses or phone numbers. Manually eliminating such information can be extremely time-consuming, which is why the student will be involved in creating a pipeline for anonymizing PII entities that can work on both structured and unstructured data.

The student will:

  • familiarize themselves with existing key literature on data anonymization and GDPR law;
  • manually review a selection of academic articles to identify what PPI academic texts generally contain;
  • elaborate possible NLP strategies to remove this information;
  • be trained on how to use models such as spaCy and other relevant anonymization NLP processes.
  • create a pipeline for extracting and anonymizing academic text;
  • test the developed pipeline, analyze the output and use that information to improve the pipeline.

Requirements

  • you are either an Icelandic BA or MA student, or an international BA or MA student who is enrolled at an Icelandic university;
  • you have used Python before;
  • you have some experience with basic NLP tasks;
  • you are interested in natural language processing;
  • you can work during the summer months.

We particularly encourage students enrolled at a Language & Technology/Computational Linguistics program to enrol to this position.

Compensation

ISK 423,000 per month, plus food allowance.

This is a temporary position (3 months).

How do I apply?

Send your cv to info@sagewrite.com . In the body of the email, please specify why you are interested in the position and why you think you'd be a good candidate.

We aim to fill in the position as soon as possible, so early applications will be given priority.


Comments

Write a Reply or Comment

Your email address will not be published. Required fields are marked *