• Keine Ergebnisse gefunden

Framings of Machine Learning

Beliefs Ratings Beliefs Explanations Audits

4.4 Framings of Machine Learning

Our analysis explored how different framings of the concept ML are manifested in the tutorials that we analysed. As a first step, we re-viewed the 41 tutorials and searched for definitions or operationali-sations of the term machine learning. Surprisingly, only 21 of the 41 tutorials (51%) explained the term machine learning. In half of the tutorials, the authors did not define or operationalise what ML ‘is’.

In those tutorials that defined the term ML, we found two dominant definitions. Also, we identified a long-tail of other definitions each only cited once. The most widely cited definition is Samuel (1959), who defined ML as a ‘field of study that gives computers the ability to learn without being explicitly programmed’ (T4, T3, T9, T27, T30, T39). This definition is also sometimes repeated without explicitly re-ferring to Samuel, describing ML as ‘the kind of programming which gives computers the capability to automatically learn from data with-out being explicitly programmed’ (T35). The second most commonly

cited author is Mitchell (1997) (T4, T3, T10, T27, T39), who defined ML as follows:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

This definition is also used without referring to Mitchell, for instance in Tutorials 15 and 37. In addition to these two dominant definitions, there are a variety of other definitions and descriptions that focus on other aspects. Tutorial 5, for instance, compares ML to pattern recog-nition, while Tutorial 6 and 8 regard ML as ‘learning based on expe-rience’, without further detailing what learning and experience mean.

Tutorial 13 distinguishes ‘traditional programming’ from ML. They ar-gue that ‘traditional programming’ relies on hard-coded rules, while ML relies on ‘learning patterns based on sample data’. Other defi-nitions are more general, describing ML as ‘a technology design to build intelligent systems’ (T20) or as ‘based on the idea of giving ma-chines access to data and allowing them [the system] to learn for them-selves’ (T22). Tutorial 48 defined the term ‘learning’ as ‘figuring out an equation to solve a specific problem based on some example data’.

Tutorial 36 states that ‘instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data’. Tu-torial 36 states that this generic algorithm ‘can tell you something in-teresting about a set of data without you having to write any custom code specific to the problem’. According to these tutorials, ‘the algo-rithm/machine builds the logic based on the given data’ (T36), hence ascribing agency to the data. This focus on data is even more explicitly described in Tutorial 39, which defines ML as ‘a part of AI [artificial intelligence] that learns from the data’ (T23). Tutorial 24 defines ML as ‘generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem’. Tutorial 4 describes ML as ‘feel[ing] its way to the answer’

without explaining further what this means. In Tutorial 21, ML is de-scribed as ‘the brain where all the learning takes place’, later comparing ML to how humans learn from experience.

Types of Machine Learning

In addition to how the term ML is described, defined, and/or oper-ationalised, our analysis also revealed different types of ML that are commonly recognised. The dominant types of ML are supervised

learn-ing, unsupervised learnlearn-ing, and reinforcement learning.

Supervised learning is commonly framed as a type of ML that is based on data and labels corresponding to the data (T1, 35, 19, and 40). Tu-torial 12 describes supervised learning as learning a general rule that maps inputs to outputs. Other definitions describe the data and its la-bels as input and target pairs (T37), inputs and desired outputs (T20), or as ‘feedback from the humans’ (T21). One metaphor describes super-vised learning as ‘the computer’ being presented with example inputs and desired outputs by a ‘teacher’ (T12). This imagined actor is also sometimes called an ‘external supervisor’ (T33) or ‘the scientist’ (T32).

In other tutorials, supervised learning is framed as working backwards from the solution. Surprisingly, even though supervised ML is only possible if aligned pairs of input and output data are available, the la-borious process of data labelling, which has to be done ‘by a human being beforehand’, is only mentioned in Tutorial 35.

Unsupervised learning is another commonly mentioned type of ML that is often contrasted with supervised learning. Tutorial 5, for in-stance, regards supervised learning as using labelled data, and unsu-pervised learning as finding patterns in unlabelled data. A variety of tutorials refer to the lack of labels (T12, 20, and 35), which leaves the

‘algorithm [...] on its own to find structure in its input’ (T12). Tutorials frame unsupervised learning as discovering similarities or regularities in the input data. For Tutorial 4, ‘the program is given a bunch of data and must find patterns and relationships therein’. Similarly, Tutorial 8 regards unsupervised learning as a task where it is up to the machine

‘to determine the relationship between the entered data and any other hypothetical data’. Tutorial 19 describes unsupervised learning as an approach to data ‘where you do not have any information about inner interrelations in advance’. This is similar to Tutorial 33, which regards the main task in unsupervised learning as finding ‘the underlying pat-terns rather than the mapping’. More broadly, Tutorial 37 describes unsupervised learning as characterising the unknown distribution.

A third type of ML, reinforcement learning (RL), is defined by Tutorial 35 as a computer program that dynamically interacts with its environ-ment while receiving positive and/or negative feedback to improve its performance. The definitions in Tutorials 1, 12, 20 and 33 focus on an agent that is interacting with a dynamic environment while stress-ing the importance of ‘a certain goal’. Tutorial 22 strongly focuses on the agency of ‘the machine’ which ‘continuously trains itself using trial and error’ in relation to a specific environment. This anthropomorphi-sation can also be observed in Tutorial 36, which states that when an

RL-based system ‘makes a wrong prediction[,] it will update its rule by itself’. Other reinforcement definitions (Tutorials 5, 18 and 33) focus on a ‘reward’ that is being optimised by an ‘agent’ based on feedback, highlighting the difference between reinforcement learning and super-vised learning.

Data and Machine Learning

Following Samuel’s (1959) definition of ML as learning from data, it is noteworthy to consider how data is framed in the different tutori-als. Overall, we found little discussion of the nature or significance of data. Those that engage with the term frame data as ‘any unpro-cessed fact, value, text, sound or picture that is not being interpreted and analyzed’ (T14) or describe data as ‘the new oil’, that is ‘precious but useful only when cleaned and processed’ (T13). Considering the small number of tutorials that engage with the term data, it is surpris-ing that almost half of the tutorials (46%) mentioned data preparation as a topic. Six of the tutorials (15%) apply and explain data preparation techniques.

Regarding data in ML, it is noteworthy that the large majority of the tutorials does not explain that data presented to a model is a sample that may or may not be representative of a population. Only Tutorial 4 stated that ML systems require ‘a statistically significant random sam-ple as training data’. Basic assumptions regarding the class distribu-tions, which are crucial for the successful training of ML (Müller and Guido, 2016) and which can be a great catalyst for fairness problems in ML (Hardt, 2014), are rarely discussed. The practice of randomly shuffling data is also rarely mentioned (T38 and 40). Stratification, the practice of making sure that the training and test set are randomly shuf-fled while ensuring that the class distribution is the same in both sets, is only mentioned in Tutorial 29. However, it is merely called a good practice that ‘will ensure your training set looks similar to your test set’.

Understanding data is mentioned as an important aspect of being a data scientist (T25). Tutorial 41 encourages practitioners to take a ‘peak at the data itself’ by looking at statistics like mean and median as well as class distributions, data visualisations, boxplots, histograms, and scatterplots. Tutorial 28 mentions that it is important for datasets with greater complexity to visualise the distribution of the data ‘in order to gain an understanding of the data’. Furthermore, Tutorial 29 stresses

the importance of speaking to domain experts to gain a contextual un-derstanding of the data and its origin.

Tutorial 20 discusses the importance of transforming the data into a form that is ‘useful’ for the ML system. The process of data prepara-tion is framed as making data ‘even more valuable’ (T32). A frequently mentioned data preparation step is normalisation (T15, 16, 17, 26, and 29), which means subtracting the mean and dividing the data by the standard deviation, thus centring the data points at zero with unit vari-ance. Tutorial 13 and 29 mention feature scaling as a similar practice aimed at making all features comparable by putting them on the same scale. Data representation is also discussed for specific application do-mains such as natural language processing (T31, 37, and 40). Surpris-ingly, Tutorial 39 is the only tutorial that addresses the issue of missing data and proposed imputation as a solution, i.e. how missing values can and should be replaced.

Tutorials only rarely discuss the impact that the quality of the data can have on the performance of ML systems. Tutorial 39 addresses this issue very briefly by pointing out that the better the quality of data, the more suitable it will be for modelling. Tutorial 4 assesses that ‘real-world data’ is always ‘a little noisy’, which prohibits the model from fitting the data ‘neatly on a straight line’. Lack of data and lack of diversity in the dataset are mentioned as primary challenges of ML in Tutorial 21. The tutorial further elaborates that ‘a machine needs to have heterogeneity to learn meaningful insights’.

Tutorial 20 evokes the notion that ‘hidden patterns’ exist in the data that can be identified by ML systems. Tutorial 24 warns that, when predicting the price of a house, ‘the function’ that an ML system may end up with is ‘totally dumb’. The ML system does not know what

‘square feet’ or ‘bedrooms’ are. According to the tutorial’s authors, a regression model is merely ‘stir[ing] in some amount of those numbers to get the correct answer’. They argue that if a human expert could not use the data to solve the problem manually, ‘a computer probably won’t be able to either’ (T24).

Only one tutorial explicitly addresses the limitations of ML in relation to data. Tutorial 31 argues that the accuracy of the system they are training ‘seems to be a natural limit for this data with its given size’.

This crucial concept – that there is a limit of what can be inferred from data – is only brought up here.

Machine Learning Algorithms

In addition to definitions of ML and the types of ML that are distin-guished, this paper also explores the specific ML algorithms that are mentioned. Overall, we found that ML algorithms were mentioned in 31 of the 41 tutorials (76%). The most commonly mentioned ML al-gorithms are support vector machines, which are mentioned 16 times in the tutorials (39%). The second most frequently mentioned algo-rithms are neural networks, which are mentioned in 14 tutorials (34%).

The third most commonly mentioned ML algorithm is linear regres-sion, mentioned in 13 tutorials (32%). Decision trees and naïve Bayes classifiers are mentioned in 11 tutorials (27%), Logistic regression and k-nearest neighbours in 10 (24%). This means that most tutorials are focused on supervised ML models that perform classification or regres-sion. That said, nine tutorials mentioned the unsupervised clustering model k-means, while eight tutorials focused on reinforcement learn-ing. For these algorithms, a long-tail phenomenon can be observed.

Thirty-eight algorithms are only mentioned once. These algorithms span a broad range, including inductive logic programming, Bayesian networks, extreme learning machines, long short-term memory net-works, multi-armed bandits, and neural Turing machines.

The most commonly applied algorithm is linear regression, which is applied in 4 of the 41 tutorials using a concrete application exam-ple (10%). The second most commonly applied algorithm is logistic regression, which is applied in three tutorials. Neural networks and k-nearest neighbours are applied in two of the tutorials. Surprisingly, support vector machines, the most commonly mentioned algorithm, is only applied once.

Considering how ML algorithms are presented, we find that even in tu-torials that are aimed at explaining ML, the underlying algorithms are presented as black boxes. The inner mechanics of the most commonly mentioned algorithms – support vector machines and neural networks – are rarely explained. If they are explained, then not in-depth. Key concepts like gradient descent and backpropagation are not explained either. Few tutorials mention or explain concrete ways to measure the generalisation capabilities of ML-based systems. Metrics such as accu-racy, precision, and recall are mentioned but rarely formally defined or applied. Widely used metrics such as Mean Average Precision (mAP) are never mentioned. The large body of work on the interpretability of ML, including visualisation of feature importance, is not mentioned in any of the tutorials that we reviewed.

Expertise in Machine Learning

Considering our interest in agency, we also explored how tutorials framed the required expertise for successfully applying ML. Tutorial 30 argues that

[ML is] a lot like a car, you do not need to know much about how it works in order to get an incredible amount of utility from it.

Tutorial 30 further argues that people can ‘engage in ML very easily without almost any knowledge at all of how it works’, since the de-fault settings of ML libraries can get 90-95% accuracy on many tasks.

However, to ‘push the limits in performance and efficiency’, Tutorial 30 recommends readers ‘to dig in under the hood’.

The sentiment that a thorough understanding of ML is not required can also be found in Tutorial 41, which tells its readers that they ‘do not need to understand everything (at least not right now)’. Tutorial 41 states that ML practitioners do not need to know how a model works, arguing that learning about the benefits and limitations of various al-gorithms can be done later. The idea that expertise is not needed is especially problematic considering that Tutorial 30 does not discuss model evaluation in the text and only refers to a follow-up tutorial on evaluation. That said, the tutorial argues that it is ‘important to know about the limitations and how to configure ML algorithms’.

Surprisingly, the potential of ML systems to overfit, i.e. to learn param-eters that do not generalise beyond the training data, is rarely men-tioned (T16 and 31). Only Tutorial 16 mentions the complexity of a model as a possible cause of overfitting. Tutorial 22 addresses the fit between the problem and the ML system, arguing that ‘it is a fact that no one ML model can solve every problem’. The tutorial stresses the importance of aspects such as the structure and size of a dataset in finding the most suitable model for a given problem. They also declare that ‘you can’t say that decision trees is [sic!] always better than neural networks and vice-versa’.

Applications of Machine Learning

Finally, our analysis provides an overview of the different applications of ML and how common they are. Overall, the 41 tutorials mention 160 distinct applications of ML. The most frequently mentioned

appli-cation of ML is the detection of spam. It is brought up in 11 tutorials, which means that almost a third (27%) of tutorials reference the de-tection of spam. The second most frequently mentioned application are self-driving cars, which are referred to in every fourth tutorial (10 mentions, 24%). The prediction of housing prices is the third most frequently used example, mentioned by nine tutorials and applied by one. Other applications include face recognition (6 mentions), senti-ment analysis (5), and playing the game Go (5). Playing chess (4) and cancer detection (4) are mentioned in 10% of the tutorials.

The distribution of the different applications of ML follows a strong long-tail distribution. 111 applications are only mentioned in one tuto-rial. Twenty-nine applications of ML are brought up in more than two tutorials. In the long-tail of applications only mentioned once, we find examples such as Facebook’s News Feed curating news, an ML system creating art, industrial logistics in general, and a robot learning to fly.

Other applications include preventing jaywalkers, detecting pornog-raphy, robots doing backflips, network security anomaly detection, as well as text mining and social media analysis.

Considering how ML applications are presented in the tutorials, it is noteworthy to point out how comparatively few tutorials show how to apply ML in practice. Only eight tutorials explain how to implement an ML system. For the application, examples are unique, e.g. no ap-plication was implemented in more than one tutorial. The apap-plications include regression problems like the prediction of housing prices or stock prices as well as problems like the classification of handwritten digits, fruits, flowers and the quality of wines. Clustering problems in-clude the sorting of building bricks as well as the clustering of people based on specific attributes.