• Keine Ergebnisse gefunden

Interpretation of Contractual Agreements using Machine Learning

Related Work

3.2 Interpretation of Contractual Agreements using Machine Learning

Over the past years, the Machine Learning (ML) revolution has made significant advances in Natural Language Processing (NLP). Likewise, legal text processing as a sub-field of NLP has drawn the researchers’ attention [54–57]. However, considering the abundance of previous ML studies in law and legal texts, in this Section, we focus on those efforts that specifically address contractual agreements and regulatory documents.

3.2.1 Linear Classification Methods

Linear classifiers are a subfield of machine learning and aim at using an object’s features to identify which class it belongs to. Linear classification methods achieve this goal based on the value of a linear combination of the features. Below, we provide an overview of studies that exploited linear classification algorithm to analyze legal documents.

TheNLL2RDFframework exploits machine learning and Support Vector Machine (SVM) to generate RDF expressions of license agreements [58], targeting open linked data as their primary use-case. The authors used ODRL and CC REL vocabulary to manually annotate the dataset and build a gold standard.

Similarly, NLL2RDF also is primarily concerned with Permissions (derive, reproduce, modify, copy, sell), Prohibitions (commercialize), and Duties (shareAlike, attachPolicy, attribute). However, the framework’s limitation is that it only covers a limited number of rights and conditions. Furthermore, notwithstanding that their dataset covered 37 licenses, the class with the highest frequency only scored 28 occurrences.

This low number might be related to their training data, since after going through their publicly available dataset8, we noticed a scarce number of annotations in the 4-5 page licenses.

A prominent group on privacy policy analysis isUsable Privacy Policy Project9. They provided OPP-115, the first comprehensive dataset with fine-grained annotations on paragraph level [59]. The project aims to extract important information for the benefit of regular and expert end-users. To do so, a corpus containing 115 privacy policies from 115 US companies was annotated by three experts on the paragraph level (10 experts in total and three experts per document). Along with the creation of the dataset, the authors built different ML models for the prediction of high-level categories. The gold standard for evaluating the methods was compiled based on majority votes: if two or more experts agreed on a single category, it was considered in the final gold standard. The best-reported micro-average F1 is 66% that was achieved with Support Vector Machine (SVM).

A few approaches developed a model with supervised ML to measure completeness of privacy policies [60,61]. The dataset used in training contains a set of pre-defined categories based on privacy regulations and guidelines. Finally, the trained model predicts a category for an unseen paragraph. Once again, none of the corpora were created with the full support of experts, which is an essential prerequisite in legal text processing.

8http://www.airpedia.org/nll2rdf/dataset-licenses/

9https://usableprivacy.org/

Chapter 3 Related Work

(a) The user control score, based on 10 privacy concerns. (b) TheGDPRcompliance score.

Figure 3.4: Evaluation of ResearchGate privacy policy using thePrivacycheckChrome extension.

PrivacyCheckis an approach for automatic summarization of privacy policies using data mining [62]. It answers ten pre-defined questions concerning the privacy and security of users’ data and is also available as a Chrome browser extension. Figure3.4illustrates the automatic analysis ofPrivacyCheckChrome plugin, applied to theResearchGateprivacy policy. In order to train the model, a corpus containing 400 privacy policies was compiled, and seven privacy experts manually assigned risk levels (Green, Yellow, Red) to the ten factors. First, a pre-processing step finds those paragraphs that have at least one keyword related to one of 10 factors. The methodology of selecting keywords was largely manual. Then, the selected paragraphs will be sent to a data mining server where 11 data mining models were trained, one for checking if the corresponding page is a privacy policy and one each for the ten questions. The authors claim that, on average, 60% of the times,PrivacyCheckfinds the correct risk level. The limitation of PrivacyCheckis its lack of an Inter Annotator Agreement (IAA) for the annotators. According to the paper, the quality control was performed by assigning each policy to two team members. However, only 15% of privacy policies were compared, and their discrepancies were resolved, which makes the training

dataset less reliable.

PrivacyGuideis another summarization tool inspired byGDPRthat classifies a privacy policy into 11 categories using NLP and machine learning and further measures the associated risk level of each class [63]. Figure3.5shows a snapshot ofPrivacyGuideautomatic summarization applied to theUnilever privacy policy. The red icons illustrate a high-risk prediction for the corresponding category. Similar to previous studies,PrivacyGuideuses the three-level scale risk based on classification (i.e., Green, Yellow,

3.2 Interpretation of Contractual Agreements using Machine Learning

(a) The welcome screen. (b) After applying a sample policy.

Figure 3.5: PrivacyGuide snapshots.

Red). The 11 criteria and their associated risk levels were defined byGDPRexperts. Based on these criteria, a privacy corpus was compiled with the help of 35 university students. Each participant assigned a privacy category to text snippets and classified them with a risk level. The author reported that the weighted average accuracy is 74% for classifying a privacy policy into one of the 11 classes, and the accuracy of risk level detection is 90%. Although the results were encouraging, the dataset was not annotated by experts, which is a fundamental criterion in legal text analysis.

3.2.2 Deep Neural Networks

As introduced in2.4, deep neural networks are artificial neural networks with multiple layers between the input and output layers. A neural network can model any function and has non-linear activation layers to model non-linear functions. Due to the broad use of deep learning for legal text analysis, in this part, a brief literature review is presented.

Leveraging Recurrent Neural Network (RNN), [64] extracts obligations and prohibitions from contracts.

The goal of this study is to assist legal firms and legal departments to automatically identify sentences (or clauses) specifying obligations and prohibitions in order to monitor the compliance of each party.

The gold standard was compiled from the main bodies (excluding introductions, covers, recitals) of 100 randomly selected English service agreements. The NLTK’s splitter10was applied to the 100 document and 31 545 training, 8 036 development, and 5 563 test sentences/clauses were identified. Five law students were selected to manually annotate the sentences/clauses with the five pre-defined classes:

Obligation,Prohibition,Obligation List Intro,Obligation List Item, andProhibition List Item. The results show that the best performance is achieved with a hierarchical BILSTM classifier, which produces an embedding vector for each sentence and then predicts a class for the sentence embeddings. Despite their promising results, the major limitation of this study is having only one expert opinion per class, which makes their dataset unreliable.

Neill et. al [65] employed Convolutional and Recurrent Neural Network to classify deontic modalities in regulatory documents. The annotations were carried out by Subject Matter Experts (SMEs) using the General Architecture for Text Engineering [66]. The final training set consists of 1 297 SME annotated sentences, including 596 obligations, 94 prohibitions, and 607 permissions. Furthermore, the test set consists of held-out documents from sub-domains of the financial regulations (e.g., Anti-Money

10http://www.nltk.org/

Chapter 3 Related Work

Laundering, EU Markets in Financial Instruments Directive, Consolidated Accounts and Markets in Financial Instruments Regulation, etc.) and include 312 Obligations, 248 Permissions, and 62 Prohibitions.

According to the paper, the inter-annotator agreement for two SMEs is equal to 0.74, with only a few disagreements. The results demonstrate that the NN model, which incorporates domain-specific legal distributional semantic model (DSM) representations with a general DSM representation (Google News), achieved the best performance. Although the conducted research in this study inspired us to combine domain-specific embeddings with a general one, the presented dataset was not useful for the contractual agreements domain.

Leveraging OPP-115 and deep learning,Polisisextracts segments from privacy policies and presents them to users in a visualized format [67]. According to the paper, the union-based gold standard was used for experiments, e.g., all experts’ annotations were included in the gold standard (as opposed to the majority votes). Out of 115 privacy policies, 65 were considered for training, and 50 policies were kept for the test set. The authors claim that a successful multi-label classifier should not only predict the presence of a label but also its absence11. They report only macro-averages and further compute the average of F1 and F1-absence and yield 81% average on the test set. Despite the encouraging work done inPolisis, we believe that the paper lacks two fundamental elements: there is no validation set involved in the training phase; and there is no information on micro-averages.

It is worth mentioning that, regarding privacy policy classification, none of the above studies provided their dataset splits, and therefore there is no standardized benchmark for privacy policy classification. As a result, in Section5, first, we show how we successfully reproducePolisisresults (though with different data splits) and further present two transformer models that significantly outperformPolisis.