ZsoltTest: Unterschied zwischen den Versionen

Version vom 1. Februar 2024, 16:57 Uhr

ffgg

Version vom 1. Februar 2024, 16:51 Uhr (Quelltext anzeigen) SAFFER Zsolt (Diskussion \| Beiträge) ← Zum vorherigen Versionsunterschied		Version vom 1. Februar 2024, 16:57 Uhr (Quelltext anzeigen) SAFFER Zsolt (Diskussion \| Beiträge) (Der Seiteninhalt wurde durch einen anderen Text ersetzt: „ffgg“) Markierung: Ersetzt Zum nächsten Versionsunterschied →
Zeile 1:		Zeile 1:
	<div><span id="einführung-in-computational-intelligence-und-ai"></span></div><div>= Einführung in Computational Intelligence und AI =</div><div><br></div><div>Die Idee hinter der künstlichen Intelligenz, KI (Artificial Intelligence - AI), bestand darin, die Prozesse der menschlichen Wahrnehmung wie Denken, Lernen oder Mustererkennung nachzubilden. Das Erscheinen der KI in der akademischen Welt begann mit der Simulation des Verhaltens neuronaler Netze bei IBM im Jahr 1955. Dies führte zu einer Konferenz zu diesem Thema, die heute als Dartmouth-Konferenz bekannt ist und als Geburtsstunde der künstlichen Intelligenz gilt.</div><div><br></div><div>Die Geschichte der KI als akademische Disziplin hat einen langen, weitreichenden und abenteuerlichen Weg zurückgelegt, der optimistische und zweifelhafte, verrufene Phasen durchquerte und durch viele wissenschaftliche Bereiche führte. Es begann mit dem Studium der ,,formalen“Argumentation, das unter anderem zu Alan Turings Berechnungstheorie oder zur Programmiersprache Lisp sowie zum kommerziellen Erfolg von Expertensystemen in den frühen 1980er Jahren führte. Es folgte eine eher zweifelhafte Zeitraum, während andere Ansätze wuchsen, wie z.B. Bewegen von technischen Maschinen.</div><div><br></div><div>Mit der rasanten Entwicklung der Rechengeschwindigkeit und -fähigkeiten der Computer entstanden spätestens in den 1990er Jahren rechenintensive Zweige der KI. Computational Intelligence umfasst die biologisch motivierten Bereiche der Informationsverarbeitung, wie z.B. evolutionäre Algorithmen. Maschinelles Lernen ist ein umfassender Zweig der KI, der viele rechenintensive Teilbereiche umfasst. Eine davon ist die auf statistischen Ansätzen basierende Mustererkennung, die Ende der 1990er Jahre zu einem großen kommerziellen Erfolg bei Befehls- und Kontrollsystemen (command&control systems), Spracherkennungssystemen mit großem Vokabular (Large Vocabulary Speech Recognition Systems - LVSRS) und ihren Anwendungen in Callcentern und in der Radiologie führte. Diese Ansätze haben sich in den 2000er Jahren abgeflacht, als sie die Sättigung des angewandten statistischen Ansatzes erreichten.</div><div><br></div><div>Mittlerweile führte die Entwicklung von Ideen zur Wissensdarstellung zu fortgeschritteneren Ansätzen wie Ontologie oder Wissensgraphen (Knowledge Graphs - KG).</div><div><br></div><div>Die wichtigste Entwicklung war jedoch die Wiederbelebung der neuronalen Netzwerkforschung, die zu vielen erfolgreichen Anwendungen führte, wie z.B. Erkennung handgeschriebener Ziffern. Der eigentliche Durchbruch bei der Verwendung neuronaler Netze kam mit Deep Learning, das erfolgreich auf die Klassifizierung großer Datenmengen angewendet werden kann, wie z.B. Bilder. Der Erfolg von Deep Learning führte zu einem enormen Anstieg des Interesses an und der Finanzierung von KI. Deep Reinforcement Learning ermöglicht die erfolgreiche Umsetzung automatischer Steuerungsaufgaben.</div><div><br></div><div>Die jüngste Entwicklung im Bereich der KI ist die Anwendung spezifischer großer Sprachmodelle (Large Language Model - LLM) und fortschrittlicher Darstellungstechniken in prädiktiven Transformatoren, die zu Produkten führen, wie z. B. ChatGPT 3.5.</div><div><br></div><div>Im Wesentlichen soll KI die unterschiedlichen Fähigkeiten der menschlichen Intelligenz umsetzen. Dazu gehören Lernen, Wahrnehmung, logisches Denken, Abstraktion sowie komplexere Fähigkeiten wie z.B. Zusammenfassung, Kontrolle, Aufgabenlösungsfähigkeiten und vieles mehr. Eines der langfristigen Ziele ist die Schaffung einer allgemeinen Intelligenz, der sogenannten Künstlichen Allgemeinen Intelligenz (Artificial General Intelligence - AGI), die eine Vielzahl von Problemen ähnlich der menschlichen Intelligenz lösen könnte.</div><div><br></div><div>Raymond Kurzweil, Pionier unter anderen der optischen Texterkennung (Optical Character Recognition - OCR), Sprachsynthese (speech synthesis), Spracherkennung, ein Visionär der künstlichen Intelligenz hat vorhergesagt, dass (einfach formuliert) spätestens im Jahr 2029 die KI klüger sein werden als das Mensch. Kritiker sagen, dass in der Richtung der Entwicklung zukünftiger Technologien er wahrscheinlich Recht hat, aber die prognostizierte Zeitspanne stimmt nicht, er ist zu optimistisch.</div><div><br></div><div>KI ist eine unvermeidliche zukünftige Entwicklungsrichtung und eine der Kernfachgebiete der Wissenschaftsfeld Data Science. Beachten Sie, dass Data Science in wirtschaftlichen Kontext auch als Business Analytics bekannt ist.</div><div><br></div><div><span id="grundlagen-von-computational-intelligence-und-ai"></span></div><div>== Grundlagen von Computational Intelligence und AI ==</div><div><br></div><div><span id="mathematische-grundlagen-von-ki"></span></div><div>=== Mathematische Grundlagen von KI ===</div><div><br></div><div>Es gibt keine Einheitliche Definition von KI. Eine mögliche ausdrucksvolle Definition kann wie folgt angegeben werden: KI ist the field of study in computer science that develops and studies intelligent machines. Allerdings unter KI ist auch die ,,intelligente Maschine“ selbst zu verstehen.</div><div><br></div><div>Die Ziele von KI sind Teilmenge möglicher Arten menschlicher intelligenter Aktivitäten zu implementieren. Dazu gehört Argumentation (reasoning), Wissensrepräsentation (knowledge representation), Planung (planning), Lernen (learning), Verarbeitung natürlicher Sprache (natural language processing), Wahrnehmung (perception) und Unterstützung für Robotik (support for robotics).</div><div><br></div><div>Im Folgenden geben wir eine kurze mathematische Beschreibung einiger ausgewählten grundlegender Konzepte der KI. Das beinhaltet</div><div><br></div><div># Regelbasierte Methoden,</div><div># Wissensrepräsentation,</div><div># Linear regression mit MSE-Framework</div><div># Klassifikation - Bayesian decision theory</div><div># Klassifikation - lineare Diskriminanzfunktionen</div><div># Lernen aus Beispiele</div><div># Learning durch Steuereung - MDP</div><div><br></div><div><span> '''''Regelbasierte Methoden'''''</span></div><div><br></div><div>Logische Argumentation zielt auf menschliche Fähigkeit ab, Schritt für Schritt zu argumentieren oder/und logische Schlussfolgerungen zu ziehen. Dies wird realisiert durch Regelbasierte Methoden (rule-based methods). Sie bieten eine deterministische Methode zur Klassifizierung in Problemen, in welchen Klassen aus Beziehungen von Entitäten bestimmt werden können, die Eigenschaften einiger Entitäten darstellen. Solche Entitätenbeziehungen werden üblicherweise durch die breite Klasse der if-then rules (Wenn-Dann-Regeln) dargestellt. Eine einfache if-then rule lautet:<br /></div><div>IF EinerDerFünfGrößtenOrteÖsterreichs(x) THEN Stadt(x)<br /></div><div>was natürlich bedeutet, dass wenn ein Objekt x der Eigenschaft hat, einer der fünf größten Orte von Österreichs zu sein, dann ist x eine Stadt.</div><div><br></div><div>If-then rules können auch logische Verknüpfungen einbeziehen, wie es im folgenden Beispiel gezeigt wird:<br /></div><div>IF Fahrzeug(x) AND AufSchienenFahrendes(x) THEN Bahnfahrzeug(x)<br /></div><div>Darüber hinaus sind in if-then rules auch Funktionen erlaubt, die numerische Werte zurückgeben. Dies wird wie folgt veranschaulicht:<br /></div><div>IF Female(x) AND (Age(x) < 16) THEN Girl(x),<br /></div><div>wobei der Age(x) ist eine Funktion, die das numerische Alter in Jahren zurückgibt, und der Ausdruck (Age(x) < 16) gibt entweder logisch ,,True“ oder ,,False“ zurück.</div><div><br></div><div>Eine predicate (Aussage), wie z.B. Vögel(x), HatFlügel(x) and HatKörperbedeckungAusFeder(x) and HatSchnabel(x), ist ein Test, der entweder logisch ,,True“ oder ,,False“ zurückgibt. Es gibt zwei Haupttypen von if-then rules, die propositional (variablenfrei Aussage) and first-order Regeln. Die bisherige Beispiele sind alle first-order Regel, da alle sich auf eine Variable x beziehen. Ein Beispiel für propositional if-then rules ist<br /></div><div>IF EineDerFünfGrößtenOrteVonÖsterreich (Wien) THEN Stadt(Wien),<br /></div><div>da diese Regel eine Aussage nur für die logische Konstante (=bestimmte atomare Instanz) Wien beschreibt.</div><div><br></div><div>Die first-order if-then rules sind leistungsstarke logische Regeln, was durch folgendes Beispiel verdeutlicht wird:<br /></div><div>IF Parent(x,y) AND Parent(y,z) THEN Grandchild(z,x)</div><div><br></div><div>If-then rules können auf algorithmischem Wege erlernt werden. Der Sequential Covering Learning Algorithmus basiert auf dem Lernen aus Trainingsdaten, die aus einer Menge positiver und negativer Aussagen (Prädikate) bestehen. Der Algorithmus erstellt Regeln, die die Menge der positiven Prädikate abdecken, aber keines der negativen Prädikate. Das Grundprinzip des Algorithmus besteht darin, jeweils eine Regel zu lernen und diesen Vorgang zu wiederholen um schrittweise alle positiven Prädikate abzudecken. Daher heißt der "Sequential Covering". Entsprechend umfasst der Algorithmus drei Schritte: Lernen einer einzelnen Regel um am meisten positive Prädikate abzudecken (Learn-One-Rule), Löschen der dadurch abgedeckte Prädikate und Iterieren. Das Ergebnis ist ein disjunktes Regelwerk, das die gesamte Menge der positiven Prädikate abdeckt.</div><div><br></div><div>'''Beispiel .'''</div><div><br></div><div>Als Trainingsdaten werden die folgende Aussagen über Möbelwaren angegeben:<br /></div><div><br></div><div><br></div><div><div class="center"></div><div><br></div><div><div id="tab:Td_Aussagen_über_Möbelwaren"></div><div><br></div><div>{\| class="wikitable"</div><div>\|+ Trainingsdaten - Aussagen über Möbelwaren</div><div>\|-</div><div>! style="text-align: left;"\| Aussage-ID  </div><div>! style="text-align: center;"\|   Größe  </div><div>! style="text-align: center;"\|   Möbelhaus  </div><div>! style="text-align: center;"\|   Teuer</div><div>\|-</div><div>\| style="text-align: left;"\| 1  </div><div>\| style="text-align: center;"\|   groß  </div><div>\| style="text-align: center;"\|   Möbelix  </div><div>\| style="text-align: center;"\|   False</div><div>\|-</div><div>\| style="text-align: left;"\| 2  </div><div>\| style="text-align: center;"\|   klein  </div><div>\| style="text-align: center;"\|   Möbelix  </div><div>\| style="text-align: center;"\|   False</div><div>\|-</div><div>\| style="text-align: left;"\| 3  </div><div>\| style="text-align: center;"\|   groß  </div><div>\| style="text-align: center;"\|   Kika  </div><div>\| style="text-align: center;"\|   True</div><div>\|-</div><div>\| style="text-align: left;"\| 4  </div><div>\| style="text-align: center;"\|   klein  </div><div>\| style="text-align: center;"\|   Kika  </div><div>\| style="text-align: center;"\|   False</div><div>\|-</div><div>\| style="text-align: left;"\| 5  </div><div>\| style="text-align: center;"\|   groß  </div><div>\| style="text-align: center;"\|   XXXLutz  </div><div>\| style="text-align: center;"\|   True</div><div>\|-</div><div>\| style="text-align: left;"\| 6  </div><div>\| style="text-align: center;"\|   klein  </div><div>\| style="text-align: center;"\|   XXXLutz  </div><div>\| style="text-align: center;"\|   True</div><div>\|}</div><div><br></div><div><br></div><div></div></div><div><br></div><div></div></div><div>Die Aussagen 3,5 und 6 bilden die Menge der positiven Prädikaten, während die anderen (1,2 und 4) der Menge der negativen Prädikaten gehören. Zuerst werden mittels Learn-One-Rule die beste Hypothese bestimmt, d.h. die Regel die am meisten positive Prädikate abdeckt:<br /></div><div>IF Möbelhaus(x,XXXLutz) THEN Teuer(x),<br /></div><div>wobei x die Variable für Möbelwaren ist. Dann werden die Prädikaten 5,6 aus dem Trainingsdaten gelöscht. Wendet man Learn-One-Rule noch einmal an, bekommt man die Regel<br /></div><div>IF Größe(x,groß) AND Möbelhaus(x,Kika) THEN Teuer(x)<br /></div><div>Damit alle positiven Prädikate wurde abgedekt, also endet der Algorithmus hier. Beachten Sie, dass die beiden gefundenen Regeln disjunkt sind, was aus dem Verlauf des Algorithmus folgt.</div><div><br></div><div>Die endgültige Regel lautet wie folgt:<br /></div><div>IF Möbelhaus(x,XXXLutz) (deckt die Beispiele 5,6)<br /></div><div>OR (Größe(x,groß) AND Möbelhaus(x,Kika)) (deckt das Beispiel 3)<br /></div><div>Then Teuer(x)</div><div><br></div><div>Die Visualisierung des Sequential Covering Learning Algorithmus wird in Abbildung 1.1 gezeigt.</div><div><br></div><div>Abbildung 1.1 Visualisierung des Sequential Covering Learning Algorithmus</div><div><br></div><div>Das vorige Beispiel zeigt den Aufbau eines Regels mittels Sequential Covering Learning Algorithmus für Binäre Klassifizierung, welche die mögliche Prädikate in eine von zwei Klassen eingeteilt werden: Teuer(x) = True oder Teuer(x) = False. Die Regelbasierte Methoden können für Klassifizierung nicht nur in zwei sondern in mehrere Klassen angewendet werden. Nachfolgend wird der Sequential Covering Learning Algorithmus für den allgemeinen Fall von Lernregeln für die Klassifizierung in eine endliche Anzahl von Klassen gezeigt.</div><div><br></div><div><span>'''—————————————————————————————'''</span><br /></div><div>Algorithm  Sequential Covering Algorithm<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div>Eingabe:<br /></div><div>- Die Menge T, Trainingsdaten, bestehend aus positiver und negativer Prädikaten<br /></div><div>- Die geordnete Menge von Klassen, <math display="inline">C={c1,...,c_k}</math><br /></div><div>Output: Individuelle Regel für jede Klasse<br /></div><div>in Form eine mit AND verbundener Liste der gefundenen Regeln,<br /></div><div><math display="inline">R_i</math> für die Klasse <math display="inline">c_i, i=1,...,k</math>,<br /></div><div>die die Klassifikation durch Ausprobieren der Regellisten, <math display="inline">R_i</math>,<br /></div><div>in der Reihenfolge <math display="inline">i=1,...,k</math>, realisieren.<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div>1 Initialize the rule lists, <math display="inline">R_i = {}</math>, <math display="inline">i=1,...,k</math><br /></div><div>2 for i= 1:(k-1)<br /></div><div>3   while not every positive predicate is covered in class <math display="inline">c_i</math><br /></div><div>4     <math display="inline">r</math> <– Learn-One-Rule<br /></div><div>5     Removing predicates from training data covered by <math display="inline">r</math><br /></div><div>6     Add r to the end of the rule list <math display="inline">R_i</math><br /></div><div>7   end<br /></div><div>8 end<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div><br></div><div><br></div><div>Auf diese Weise werden mögliche komplexe if-then rules erlernt. Dieser Greedy-Algorithmus ist nicht unbedingt optimal, d. h. er liefert nicht unbedingt die kompaktesten Regeln. Daher können die resultierenden Regeln in einem Nachbearbeitungsschritt durch die Anwendung logischer Umformulierungsregeln weiter vereinfacht werden. Der Schritt Learn-One-Rule kann auf auf unterschiedliche Art und Weise realisiert werden. Eine effektiver Ansatz ist beam search. Man kann über Sequential Covering Algorithmenfamilie reden, da geänderte und erweiterte Varianten des ursprünglichen Algorithmus existieren. Typische Varianten sind Ripper Algorithmus, Foil Algorithmus oder CN2 Algorithmus. If-then rules können auch auf andere Weise, z. B. mittels Entscheidungsbäume auch gelernt werden.</div><div><br></div><div>Regeln haben den Vorteil, dass sie einfach zu interpretieren sind und in relationalen Datenbanken implementiert werden können. Allerdings hängt die Auswahl der Prädikate für die Trainingsdaten stark von der Aufgabe ab und ist in der Praxis eine schwierigere Aufgabe als das Erlernen der Regeln. Regelbasierte Methoden sind integraler Bestandteil von Expertensystemen in der KI.</div><div><br></div><div><span> '''''Wissensrepräsentation'''''</span></div><div><br></div><div>Representing knowledge on word level can be realized by probabilistic model of natural language, which is called Language Model (LM) Usually an LM is used to estimate the probability of a given word sequence <math display="inline">w_1,w_2,...,w_n</math>, i.e. <math display="inline">P(w_1,w_2,...,w_n)</math>, which can be given as a product of the conditional probabilities as<br /></div><div><math display="inline">P(w_1,w_2,...,w_n) =P(w_1)P(w_2\|w_1)...P(w_n\|w_{n-1},...,w_2,w_1)</math>.</div><div><br></div><div>In the N-gram LM, due to practical reasons, the word sequence probability, <math display="inline">P(w_1,w_2,...,w_n)</math> is approximated as<br /></div><div><math display="inline">P(w_1,w_2,...,w_n) = \prod_{i=1}^{n} P(w_i\|w_{i-1},w_{i-2},...,w_{i-N+1})</math>,<br /></div><div>where the words with negative index must be omitted from the probabilities. The conditional probabilities <math display="inline">P(w_i\|w_{i-1},w_{i-2},...,w_{i-N+1})</math> can be estimated from the relative frequencies of the N-grams and (N-1)-gramms as<br /></div><div><math display="inline">P(w_i\|w_{i-1},w_{i-2},...,w_{i-N+1}) = \frac{c(w_{i},w_{i-1},...,w_{i-N+1})}{c(w_{i-1},w_{i-2},...,w_{i-N+1})}</math>,<br /></div><div>where <math display="inline">c(x)</math> stands for the number of occurrences of string x.</div><div><br></div><div>The probabilities <math display="inline">P(w_i) for N=1, P(w_i,w_j)</math> for <math display="inline">N=2</math> and <math display="inline">P(w_i,w_j,w_k)</math> for <math display="inline">N=3</math> are called unigram, bigram and trigram probabilities, respectively. The N-gram LM is purely Statistical LM (SLM), since it is trained on N-gram statistics of the training corpora. The estimates of the conditional probabilities <math display="inline">P(w_i\|w_{i-1},w_{i-2},...,w_{i-N+1})</math> to be reliable requires that the corresponding N-grams and (N-1)-grams must be present in the training corpora. This implies that the training corpus must be large enough. The SLM can not assign any probability to N-grams, which does not appear in the training corpus. This problem is usually solved by applying smoothing techniques, in which a small part of the probability measure is shifted to the unseen grams.</div><div><br></div><div>The most effective method of smoothing is the Kneser–Ney smoothing, in which a fixed value is subtracted from the conditional probability’s of N-grams with lower frequencies and the so won probabilities are spread over the conditional probabilities of all non-appearing N-grams. The equation for the conditional bigram probabilities is given by<br /></div><div><math display="inline">P_{KN}(w_i\|w_{i_1}) = \frac{max(c(w_{i-1},w_i)-\delta,0)}{\sum_{w_j} c(w_{i-1},w_j)} + \lambda_{w_{i-1}}P_{KN}(w_i)</math><br /></div><div>and<br /></div><div><math display="inline">P_{KN}(w_i) = \frac{\|{w_j: 0 < c(w_j,w_i)}\|}{\|{(w_j,w_k): 0 < c(w_j,w_k)}\|}</math>.<br /></div><div>Here <math display="inline">\delta</math> is the constant discount value to be subtracted and <math display="inline">\lambda_{w_{i-1}}</math> is a normalization constant, which is set to make the sum of <math display="inline">P_{KN}(w_i\|w_{i_1})</math> over all <math display="inline">w_i</math> equal to one. The conditional probability’s of N-grams with lower frequencies will be determined by the unusual probability term <math display="inline">P_{KN}(w_i)</math>, which realizes an estimation for the conditional probability of the unseen bigram form other bigram counts related to <math display="inline">w_i</math>.</div><div><br></div><div>Besides of LMs, another ways of knowledge representation are</div><div><br></div><div>* Semantic network - a graph representing semantic relations between concepts. The concepts are represented by the nodes and the edges specify the semantic relations. One reason to design semantic networks is to represent knowledge in machine readable form. Knowledge graphs are semantic networks with limited set of semantic relations.</div><div>* Ontology- hierarchical representation of concepts and their relationships, which can be managed by standard ontology language such as OWL or RDF.</div><div><br></div><div><span> '''''Linear regression mit MSE-Framework''''' </span></div><div><br></div><div>The task of linear regression is to give an linear prediction for the random scalar output <math display="inline">y \in \mathbb{R}</math> from the random input vector <math display="inline">{\bf x}</math>. Let <math display="inline">\hat{y}</math> be the predicted value of <math display="inline">y</math>. It is assumed that the <math display="inline">y</math> depends linearly from the vector <math display="inline">{\bf x}</math>, hence the predicted value <math display="inline">\hat{y}</math> can be given as linear combination of <math display="inline">{\bf x}</math> as <math display="inline">\hat{y}= w_1x_1 + \ldots + w_nx_n</math> or in vector form</div><div><br></div><div><math display="block">\hat{y}= {\bf w}^T {\bf x},</math></div><div><br></div><div>where <math display="inline">{\bf w} = (w_1,\ldots,w_n)^T</math> is a column vector of parameters. Here <math display="inline">^T</math> stands for the transpose operation and the vectors are by default column vectors. The parameters <math display="inline">w_i</math> can be seen as weights. If input value <math display="inline">x_i</math> has positive weight, then increasing <math display="inline">x_i</math> also increases the predicted value <math display="inline">\hat{y}</math>. Similarly if <math display="inline">x_i</math> has negative weight, then increasing <math display="inline">x_i</math> decreases the predicted value <math display="inline">\hat{y}</math>. If <math display="inline">w_i</math> is a large value then <math display="inline">x_i</math> has a large impact on <math display="inline">\hat{y}</math>. If <math display="inline">w_i=0</math> then <math display="inline">x_i</math> has no effect on <math display="inline">\hat{y}</math>.</div><div><br></div><div>Suppose that <math display="inline">K</math> examples of the vector <math display="inline">{\bf x}</math> and the correct value of output <math display="inline">y</math> to each of them are given. We arrange the input vectors in a matrix <math display="inline">{\bf X}</math> on that way that vector <math display="inline">{\bf x}_k</math> is placed into the <math display="inline">k</math>-th row of matrix <math display="inline">{\bf X}</math>, <math display="inline">k=1,\ldots,K</math>. The output values and the predicted values are also arranged into a column vector <math display="inline">{\bf y}</math> and <math display="inline">{\bf \hat{y}}</math> so, that the correct value <math display="inline">y_k</math> and the predicted value <math display="inline">\hat{y}_k</math> belonging to the input vector <math display="inline">{\bf x}_k</math> comes to the <math display="inline">k</math>-th position in the vector <math display="inline">{\bf y}</math> and <math display="inline">{\bf \hat{y}}</math>, respectively.</div><div><br></div><div>The optimal parameter vector <math display="inline">{\bf w}</math> can be determined from an optimum task with respect to a performance measure quantifying the quality of prediction.</div><div><br></div><div><span> ''Linear regression with mean squared error''</span></div><div><br></div><div>One possible choice for the performance measure is the Mean Squared Error (MSE) of the predictions, which is biven by</div><div><br></div><div><math display="block">MSE = \frac{1}{K} \sum_{k} ({\bf \hat{y}}_k - {\bf y}_k)^2 = \frac{1}{K} \lVert {\bf \hat{y}} - {\bf y} \rVert_2^2,</math></div><div><br></div><div>where <math display="inline">\lVert z \rVert_2</math> stands for the <math display="inline">\mathbb{L}_2</math> norm of <math display="inline">z</math>.</div><div><br></div><div>Then the optimal value of parameter vector <math display="inline">{\bf w}</math> is obtained by minimizing the MSE. The necessary condition of having a local minimum of MSE is to have a value of parameter vector <math display="inline">{\bf w}</math>, for which the gradient of MSE is 0. It can be shown, that in our case, MSE as function of <math display="inline">{\bf w}</math> is a convex function, in which case there is only one such value of <math display="inline">{\bf w}</math>, and hence it is also the global minimum point. Therefore the best value of parameter vector <math display="inline">{\bf w}</math> is obtained by solving the equation</div><div><br></div><div><math display="block">\nabla_{{\bf w}} MSE = \nabla_{{\bf w}} \left( \frac{1}{K} \lVert {\bf \hat{y}} - {\bf y} \rVert_2^2 \right) = 0,</math></div><div><br></div><div>where <math display="inline">\nabla_{{\bf w}}</math> stands for the gradient with respect to <math display="inline">{\bf w}</math>. The set of predicted outputs can be given in vector-matrix form as <math display="inline">{\bf \hat{y}} = {\bf X}{\bf w}</math>. Applying it in the above relation gives the equation to be solved in terms of <math display="inline">{\bf w}</math> as</div><div><br></div><div><math display="block">\nabla_{{\bf w}} MSE = \nabla_{{\bf w}} \left( \frac{1}{K} \lVert {\bf X}{\bf w}  - {\bf y} \rVert_2^2 \right) = 0.</math></div><div><br></div><div>Omitting the constant <math display="inline">\frac{1}{K}</math>, evaluating the gradient and performing several rearrangements we get</div><div><br></div><div><math display="block">\begin{aligned}</div><div>\nabla_{{\bf w}} \left( \lVert {\bf X}{\bf w}  - {\bf y} \rVert_2^2 \right) &&\hspace{-5mm}= \nabla_{{\bf w}} \left({\bf X}{\bf w}  - {\bf y} \right)^T \left({\bf X}{\bf w}  - {\bf y} \right) = \nabla_{{\bf w}} \left( {\bf w}^T {\bf X}^T{\bf X}{\bf w} - 2  {\bf w}^T {\bf X}^T {\bf y} +  {\bf y}^T  {\bf y}\right) \nonumber \\</div><div>&&\hspace{-5mm}= 2 {\bf X}^T{\bf X}{\bf w} - 2 {\bf X}^T {\bf y} = 0, \nonumber</div><div>\end{aligned}</math></div><div><br></div><div>where we utilized in the second equality that <math display="inline">{\bf y}^T {\bf X} {\bf w} = {\bf w}^T {\bf X}^T {\bf y}</math>, since <math display="inline">{\bf y}^T {\bf X} {\bf w}</math> is a scalar and thus it equals to its transpose, and in the last equality that matrix <math display="inline">{\bf X}^T{\bf X}</math> is symmetric enabling the above form of <math display="inline">\nabla_{{\bf w}}\left( {\bf w}^T {\bf X}^T{\bf X}{\bf w}\right)</math>.</div><div><br></div><div>Finally the solution for the parameter vector <math display="inline">{\bf w}</math> can be expressed from the above equation as <math display="block">{\bf w} = \left({\bf X}^T{\bf X}\right)^{-1} {\bf X}^T {\bf y}.</math></div><div><br></div><div>Matrix <math display="inline">{\bf X}^T{\bf X}</math> is quadratic enabling taking the inverse of it. Nevertheless the inverse exists only if matrix <math display="inline">{\bf X}^T{\bf X}</math> is non-singular, which usually holds, since matrix <math display="inline">{\bf X}</math> is constructed from independent input examples.</div><div><br></div><div>The above equations in scalar form lead to system of equations, which are called as normal equations.</div><div><br></div><div>Linear regression is also used for a slightly extended prediction model, which is given as <math display="block">\hat{y}= {\bf w}^T {\bf x} + b,</math></div><div><br></div><div>where the scalar <math display="inline">b</math> is called bias and results that the hyperplane described by the multivariate function <math display="inline">y({\bf w})</math> does not pass through the origin. This extended case can be led back to the base case by introducing the extended vectors <math display="inline">{\bf x}^* = ({\bf x}, 1)^T</math> and <math display="inline">{\bf w}^* = ({\bf w}, b)^T</math> leading to the equivalent description</div><div><br></div><div><math display="block">\hat{y}= {\bf w}^{T} {\bf x}^.</math></div><div><br></div><div>An illustrating example with one-dimensional input values is shown in Figure 1.</div><div><br></div><div><span> ''Connection of linear regression with MSE to maximum likelihood estimation''</span></div><div><br></div><div>Let <math display="inline">{\bf z} = (z_1,\ldots,z_M)</math> be a sequence of samples taken from the underlying population producing independent samples, each of them from the same random variable <math display="inline">Z</math> with a likelihood function <math display="inline">p(z\|{\mbox{\boldmath$\theta$}})</math> from a known distribution family with unknown parameter vector <math display="inline">{\mbox{\boldmath$\theta$}}</math>. The likelihood function equals to the probability (mass) function and the probability density function (pdf) for discrete and continuous distribution, respectively. The Maximum Likelihood Estimation (ML estimation or MLE) of the parameter vector <math display="inline">{\bf \theta}</math> is the value, which maximizes the probability of the observed sequence of samples. In other words</div><div><br></div><div><math display="block">{\mbox{\boldmath$\theta$}}_{ML} = \arg \max_{{\mbox{\boldmath$\theta$}}} p({\bf z}\|{\mbox{\boldmath$\theta$}}) =  \arg \max_{{\mbox{\boldmath$\theta$}}} p(z_1,\ldots,z_M\|{\mbox{\boldmath$\theta$}}) = \arg \max_{{\mbox{\boldmath$\theta$}}} \prod_{m=1}^{M} p(z_m\|{\mbox{\boldmath$\theta$}}),</math></div><div><br></div><div>where we utilized that the samples are taken independently from the same random variable <math display="inline">Z</math>.</div><div><br></div><div>The product over more likelihoods can cause numerical issues, like e.g. numerical underflow. Therefore it is usual to use the logarithm of the likelihood instead of the likelihood. This modification does not cause any change in argmax and max, since the logarithm function is monotone. This gives the usual form of the ML estimate of the parameter vector <math display="inline">{\bf \theta}</math> as</div><div><br></div><div><math display="block">{\mbox{\boldmath$\theta$}}_{ML} = \arg \max_{{\mbox{\boldmath$\theta$}}} log \left( \prod_{m=1}^{M} log p(z_m\|{\mbox{\boldmath$\theta$}})\right) = \arg \max_{{\mbox{\boldmath$\theta$}}}\sum_{m=1}^{M} log p(z_m\|{\mbox{\boldmath$\theta$}}).</math></div><div><br></div><div>Applying the principle of MLE to a conditional likelihood function <math display="inline">p(y\|{\bf x})</math>, the conditional likelihood function from a known distribution family with unknown parameter vector <math display="inline">{\mbox{\boldmath$\theta$}}</math> becomes <math display="inline">p(y\|{\bf x}, {\mbox{\boldmath$\theta$}})</math>. Thus the ML estimate of the parameter vector <math display="inline">{\bf \theta}</math> in such case is given by</div><div><br></div><div><math display="block">{\mbox{\boldmath$\theta$}}_{ML} = \arg \max_{{\mbox{\boldmath$\theta$}}}\sum_{k=1}^{K} log~ p(y_k\|{\bf x}_k, {\mbox{\boldmath$\theta$}}).</math></div><div><br></div><div>Now we select the normal distribution as the known distribution family and set the mean of the distribution to be the predicted output value, <math display="inline">\hat{y}</math> and the variance of the distribution to a fixed value <math display="inline">\sigma^2</math>. Then the conditional likelihood function, which in this case a conditional pdf, becomes <math display="inline">p(y\|{\bf x},{\mbox{\boldmath$\theta$}}) = N(y\| \hat{y}({\bf x},{\bf w}),\sigma^2)</math>, where <math display="inline">N(y\|\mu,\sigma^2)</math> stands for the pdf of the normal distribution with mean <math display="inline">\mu</math> and variance <math display="inline">\sigma^2</math>. Applying the expression of the pdf of the normal distribution <math display="inline">N(y\|\mu,\sigma^2)= \frac{1}{\sqrt(2 \pi \sigma^2)}exp\left(-\frac{1}{2} \frac{(y-\mu)^2}{\sigma^2} \right)</math>, we get for the ML estimate of the predicted output values, <math display="inline">{\bf \hat{y}}</math></div><div><br></div><div><math display="block">\begin{aligned}</div><div>{\bf \hat{y}} &&\hspace{-5mm}= \arg \max_{\bf \hat{y}}\sum_{k=1}^{K} log N(y_k\| \hat{y_k}({\bf x_k},{\bf w}),\sigma^2)\nonumber \\</div><div>&&\hspace{-5mm} =  \arg \max_{\bf \hat{y}}\sum_{k=1}^{K} log \left(\frac{1}{\sqrt(2 \pi \sigma^2)}exp\left(-\frac{1}{2} \frac{(y_k-\hat{y_k})^2}{\sigma^2} \right)\right) \nonumber \\</div><div>&&\hspace{-5mm}= \arg \max_{\bf \hat{y}} \left(-K log \sigma -\frac{K}{2} log(2 \pi) - \sum_{k=1}^{K} \frac{(y_k-\hat{y_k})^2}{2\sigma^2} \right) = \arg \max_{\bf \hat{y}} \left(-\sum_{k=1}^{K} ((y_k-\hat{y_k})^2\right) \nonumber \\</div><div>&&\hspace{-5mm}= \arg \min_{\bf \hat{y}} \sum_{k=1}^{K} ((y_k-\hat{y_k})^2 =  \arg \min_{\bf \hat{y}} \lVert {\bf \hat{y}} - {\bf y} \rVert_2^2 \nonumber.</div><div>\end{aligned}</math></div><div><br></div><div>Assuming the linear relation among <math display="inline">\hat{y}</math> and <math display="inline">{\bf x}</math> as <math display="inline">{\bf \hat{y}} = {\bf X}{\bf w}</math>, the above optimization with respect to <math display="inline">{\bf w}</math> results in the same estimation for <math display="inline">{\bf w}</math> as minimizing the MSE. Consequently the minimizing MSE can be seen as a ML estimation, which justifies the use of MSE due to the desirable properties of the ML estimation.</div><div><br></div><div><span> '''''Klassifikation - Bayesian decision theory'''''</span></div><div><br></div><div>Bayesian decision theory is a fundamental probabilistic approach to the task of classification. This approach uses probabilistic arguments and losses associated to decisions.</div><div><br></div><div><span> ''Bayes decision rule - with minimizing the error probability''</span></div><div><br></div><div>Let <math display="inline">c_1</math> and <math display="inline">c_2</math> be two classes, e.g. two types of see fishes, like herring and mackerel. Suppose that an observer watching fishes arriving along a conveyor belt tries to predict which type of fish is the next one. The arriving type of fishes is random. It can be assumed that there is a probability <math display="inline">P(c_1)</math> that the next fish is herring and a probability <math display="inline">P(c_2)=1-P(c_1)</math> that the next fish is mackerel. These probabilities are called a priori probabilities (or priors) reflecting that they represent the only knowledge on the emergence of these types of fishes. Having only these probabilities available for the decision, it seems logical to use a decision rule: Decide <math display="inline">c_1</math> if <math display="inline">P(c_1) > P(c_2)</math> and otherwise decide <math display="inline">c_2</math>. This decision rule makes sense if there is only one fish to decide about its type. However applying this rule for a sequence of fishes seems to be incorrect: although we know that both types of fishes appear, we always decide the next one to be the same type. On this way we introduce an error whose probability is always the smaller one among <math display="inline">P(c_1)</math> and <math display="inline">P(c_2)</math>.</div><div><br></div><div>Fortunately in most of the practical situations we have not to make decisions with so little amount of informations. Let us assume that informations on several features of the fish types, like e.g. their length, weights, etc. are available in advance. These features can be measured for the next emerging fish and can be organized in a vector <math display="inline">{\bf x} \in \mathbb{R}^d</math>. Now we assume the more general case of having more <math display="inline">C</math> classes, <math display="inline">c_1,\ldots, c_C</math>. The information on the features belonging to class <math display="inline">c_i</math> can be given by the class-conditional density function <math display="inline">p({\bf x}\|c_i)</math>. Suppose that both the priori probabilities <math display="inline">P(c_i)</math> and the class-conditional density function <math display="inline">p({\bf x}\|c_i)</math> are given. Then the probability of each class, taking into account the informations observed in the features can be given by the conditional probability of that class, given the observed values of features, <math display="inline">P(c_i\|{\bf x})</math>. This conditional probability can be expressed by means of the Bayes formula as</div><div><br></div><div><math display="block">P(c_i\|{\bf x}) = \frac{p({\bf x}\|c_i)P(c_i)}{p({\bf x})} = \frac{p({\bf x}\|c_i)P(c_i)}{\sum_{j=1}^C p({\bf x}\|c_j)P(c_j)}.</math></div><div><br></div><div>The probability <math display="inline">P(c_i\|{\bf x})</math> is called posteriori probability (or posterior) reflecting the incorporation of the informations on the features into the probability. The likelihood <math display="inline">p({\bf x}\|c_i)</math> indicates that if the priors of the classes are equal then the class with the highest likelihood is the most likely to be the right class.</div><div><br></div><div>At this point it seems logical to establish the decision rule:<br /></div><div>Decide <math display="inline">c_i</math> with <math display="inline">P(c_i\|{\bf x}) > P(c_j\|{\bf x})</math> for every <math display="inline">j \neq i</math>.<br /></div><div>The expected value of the probability of error for this decision rule is given by</div><div><br></div><div><math display="block">P(error) = \int_{\mathbb{R}^d} P(error,{\bf x} )d{\bf x} = \int_{\mathbb{R}^d} P(error\|{\bf x}) p({\bf x}) d{\bf x}.</math></div><div><br></div><div>Deciding class <math display="inline">c_i</math> with highest <math display="inline">P(c_i\|{\bf x})</math> leads to error with probability <math display="inline">1 - P(c_i\|{\bf x}</math>, for which <math display="inline">1 - P(c_i\|{\bf x} < 1 - P(c_j\|{\bf x}</math> for every <math display="inline">j \neq i</math>. Therefore <math display="inline">1 - P(c_i\|{\bf x} = \min P(\mbox{~error~}\|{\bf x} )</math>, i.e. this error is the minimal one, for the given <math display="inline">{\bf x}</math>. It ensures that <math display="inline">P(\mbox{~error~}\|{\bf x} )</math> in the above integral is as small as possible and hence the integral and thus also <math display="inline">P(error)</math> is minimal. So the above decision rule minimizes the probability of error. In fact it is a Bayes decision rule with minimizing the probability of error.</div><div><br></div><div><span> ''The general Bayes decision rule''</span></div><div><br></div><div>We would like to generalize the Bayes decision rule by introducing a loss function instead of the probability of error. The decision among the classes is performed by taking the action <math display="inline">\alpha_i</math>, <math display="inline">i=1,\ldots, A</math>. However if the true class is <math display="inline">c_j</math> then we involve a loss <math display="inline">\lambda(\alpha_i\|c_j)</math>. Hence the expected loss by taking the action <math display="inline">\alpha_i</math> is given by</div><div><br></div><div><math display="block">R(\alpha_i\|{\bf x}) =  \sum_{j=1}^{C} \lambda(\alpha_i\|c_j) P(c_j\|{\bf x})</math></div><div><br></div><div>In the decision-theoretic terminology <math display="inline">R(\alpha_i\|{\bf x})</math> is called as conditional risk. Let <math display="inline">\alpha({\bf x})</math> be the decision function taking one of the <math display="inline">\alpha_i</math> actions for each <math display="inline">{\bf x}</math>. The overall risk is given by</div><div><br></div><div><math display="block">R = \int_{\mathbb{R}^d} R(\alpha({\bf x})\|{\bf x}) p({\bf x}) d{\bf x}.</math></div><div><br></div><div>Te goal is to find the decision function which minimizes the overall risk. To achieve it the action must be taken for each <math display="inline">{\bf x}</math>, which minimizes the conditional risk for that <math display="inline">{\bf x}</math>. In other words take <math display="inline">\alpha_i</math> with <math display="inline">i = \arg\min_{i} R(\alpha_i\|{\bf x})</math>. This leads to Bayes decision rule (also called Bayes decision procedure):<br /></div><div>To minimize the overall risk,</div><div><br></div><div># Compute the conditional risk <math display="block">R(\alpha_i\|{\bf x}) =  \sum_{j=1}^{C} \lambda(\alpha_i\|c_j) P(c_j\|{\bf x})</math> for <math display="inline">i=1,\ldots, A</math> and</div><div># Select the action <math display="inline">\alpha_i</math> for which <math display="inline">R(\alpha_i\|{\bf x})</math> is minimum, in other words select <math display="inline">\alpha_i</math> with <math display="inline">i = \arg\min_{i} R(\alpha_i\|{\bf x})</math>.</div><div><br></div><div>The overall risk resulted by applying the Bayes decision rule gives the Bayes risk, denoted by <math display="inline">R^</math>, which is the achievable lowest risk.</div><div><br></div><div><span> ''The general case with two classes''</span></div><div><br></div><div>Let us <math display="inline">\alpha_1</math> and <math display="inline">\alpha_2</math> be interpreted as deciding class <math display="inline">c_1</math> and <math display="inline">c_2</math>, respectively. This enables the simplified notation <math display="inline">\lambda_{i,j}= \lambda(\alpha_i\|c_j)</math>, <math display="inline">i,j = 1,2</math>. The conditional risk for <math display="inline">\alpha_1</math> and <math display="inline">\alpha_2</math> can be given as</div><div><br></div><div><math display="block">\begin{aligned}</div><div>&&\hspace{-5mm} R(\alpha_1\|{\bf x}) =  \lambda_{1,1} P(c_1\|{\bf x}) + \lambda_{1,2} P(c_2\|{\bf x}), \nonumber \\</div><div>&&\hspace{-5mm} R(\alpha_2\|{\bf x}) =  \lambda_{2,1} P(c_1\|{\bf x}) + \lambda_{2,2} P(c_2\|{\bf x}). \nonumber</div><div>\end{aligned}</math></div><div><br></div><div>The Bayes decision rule becomes</div><div><br></div><div><math display="block">\mbox{~Decide~} c_1 \mbox{~if~}  R(\alpha_1\|{\bf x}) < R(\alpha_2\|{\bf x}), \mbox{~otherwise~} c_2.</math></div><div><br></div><div>In terms of posteriori probabilities this becomes <math display="block">\mbox{~Decide~} c_1 \mbox{~if~}  (\lambda_{2,1} - \lambda_{1,1}) P(c_1\|{\bf x}) > (\lambda_{1,2} - \lambda_{2,2}) P(c_2\|{\bf x}), \mbox{~otherwise~} c_2.</math></div><div><br></div><div>Usually making error should lead to higher loss, which means that both <math display="inline">(\lambda_{2,1} - \lambda_{1,1})</math> and <math display="inline">(\lambda_{1,2} - \lambda_{2,2})</math> are positive.</div><div><br></div><div>The above inequality can be rearranged by applying Bayes rule in it, which gives <math display="inline">(\lambda_{2,1} - \lambda_{1,1}) p({\bf x}\|c_1 ) P(c_1) > (\lambda_{1,2} - \lambda_{2,2})p({\bf x}\|c_2 ) P(c_2)</math>. Hence the Bayes decision rule can be alternatively formulated as <math display="block">\mbox{~Decide~} c_1 \mbox{~if~} \frac{p({\bf x}\|c_1 )}{p({\bf x}\|c_2 )} > \frac{(\lambda_{1,2} - \lambda_{2,2}}{\lambda_{2,1} - \lambda_{1,1}} \frac{P(c_2)}{P(c_1)}, \mbox{~otherwise~} c_2.</math></div><div><br></div><div>The quantities <math display="inline">p({\bf x}\|c_i)</math> are likelihoods, which enables to interpret this Bayes decision rule as deciding <math display="inline">c_1</math> if the likelihood ratio <math display="inline">\frac{p({\bf x}\|c_1 )}{p({\bf x}\|c_2 )}</math> exceeds some <math display="inline">{\bf x}</math> independent threshold.</div><div><br></div><div><span> ''Criteria for specific Bayes decision rules - minimum error-rate classification''</span></div><div><br></div><div>From now on <math display="inline">\alpha_i</math> is interpreted as deciding class <math display="inline">c_i</math> and the simplified notation <math display="inline">\lambda_{i,j}= \lambda(\alpha_i\|c_j)</math> is enabled. Let us consider the loss given by the symmetrical or zero-one loss function:</div><div><br></div><div><math display="block">\begin{aligned}</div><div>\lambda_{i,j} = \left\{</div><div>\begin{array}{l}</div><div>0, \mbox{~if~} i=j   \\</div><div>1, \mbox{~if~} i \neq j</div><div>\end{array}</div><div>\right\}, ~i,j = 1, \ldots, C. \nonumber</div><div>\end{aligned}</math></div><div><br></div><div>This loss function assigns 0 loss to correct decision and a unit loss to any erroneous decisions, i.e. all errors are equally punished. Thus this loss function can be seen as a function indicating the errors. The conditional risk is given by</div><div><br></div><div><math display="block">R(\alpha_i\|{\bf x}) = \sum_{j=1}^C \lambda_{i,j} P(c_j\|{\bf x}) = \sum_{j \neq i}P(c_j\|{\bf x}) = 1 - P(c_i\|{\bf x}),</math></div><div><br></div><div>which is exactly the probability of error, i.e. the error rate. Minimizing the probability of error is equivalent to maximizing the posteriori probability. Therefore the Bayes decision rule minimizing the conditional risk gives the decision rule</div><div><br></div><div><math display="block">\mbox{~Decide~} c_1 \mbox{~if~} P(c_1\|{\bf x}) > P(c_2\|{\bf x}) , \mbox{~otherwise~} c_2.</math></div><div><br></div><div>This is the Bayes decision rule with minimizing the error rate, i.e. the error probability, which we already introduced before. Hence this logical decision rule corresponds to select the loss function being the zero-one function indicating the errors.</div><div><br></div><div><span> ''Criteria for specific Bayes decision rules - minimax criterion''</span></div><div><br></div><div>The principle of minimax criterion will be shown for the classification with two classes. We show that the overall risk is a linear function of the prior, i.e. it is linear in e.g. <math display="inline">P(c_1)</math>. Let <math display="inline">\mathcal{R}_1</math> and <math display="inline">\mathcal{R}_2</math> the (unknown) region, where the classifier decides <math display="inline">c_1</math> and <math display="inline">c_2</math>, respectively. It enables to express the overall risk in terms of the conditional risks as</div><div><br></div><div><math display="block">\begin{aligned}</div><div>R &&\hspace{-5mm}= \int_{\mathcal{R}_1} \left( \lambda_{1,1} P(c_1) p({\bf x}\|c_1 ) + \lambda_{1,2} P(c_2) p({\bf x}\|c_2 )\right) d{\bf x} \nonumber \\</div><div>&&\hspace{-5mm}+ \int_{\mathcal{R}_2} \left( \lambda_{2,1} P(c_1) p({\bf x}\|c_1 ) + \lambda_{2,2} P(c_2) p({\bf x}\|c_2 )\right) d{\bf x}. \nonumber</div><div>\end{aligned}</math></div><div><br></div><div>Using the relations <math display="inline">P(c_2) = 1 - P(c_1)</math> and <math display="inline">\int_{\mathcal{R}_1}p({\bf x}\|c_i )d{\bf x} + \int_{\mathcal{R}_2}p({\bf x}\|c_i )d{\bf x} = 1</math>, for <math display="inline">i=1,2</math> the above expression of the overall risk can be rearranged as</div><div><br></div><div><math display="block">\begin{aligned}</div><div>R &&\hspace{-5mm}= \lambda_{1,1} P(c_1)  \left(1- \int_{\mathcal{R}_2}  p({\bf x}\|c_1 ) d{\bf x} \right) + \lambda_{1,2} \left(1-P(c_1)\right) \int_{\mathcal{R}_1} p({\bf x}\|c_2 ) d{\bf x} \nonumber \\</div><div>&&\hspace{-5mm}+  \lambda_{2,1} P(c_1) \int_{\mathcal{R}_2} p({\bf x}\|c_1 ) d{\bf x}  + \lambda_{2,2} \left(1- P(c_1)\right) \left(1 - \int_{\mathcal{R}_1} p({\bf x}\|c_2 ) d{\bf x} \right) \nonumber \\</div><div>&&\hspace{-5mm}= \lambda_{2,2} + \left(\lambda_{1,2}-\lambda_{2,2}\right)  \int_{\mathcal{R}_1} p({\bf x}\|c_2 ) d{\bf x} \nonumber \\</div><div>&&\hspace{-5mm}+ P(c_1) \bigg(\left(\lambda_{1,1}-\lambda_{2,2}\right) +  \left(\lambda_{2,1}-\lambda_{1,1}\right)  \int_{\mathcal{R}_2} p({\bf x}\|c_1 ) d{\bf x} - \left(\lambda_{1,2}-\lambda_{2,2}\right)  \int_{\mathcal{R}_1} p({\bf x}\|c_2 ) d{\bf x}\bigg). \nonumber</div><div>\end{aligned}</math></div><div><br></div><div>Therefore the overall risk usually depends on the prior <math display="inline">P(c_1)</math> on linear way. Sometimes it is necessary to have a classifier, which performs well over a range of prior probabilities. This is the situation for example if the priori probabilities of the types of fishes arriving along a conveyor belt vary or we do not know the priori probabilities at all. Such a classifier can be designed by setting the coefficient of <math display="inline">P(c_1)</math> in the above expression of the overall risk to <math display="inline">0</math>, i.e. by setting<br /></div><div><math display="inline">\bigg(\left(\lambda_{1,1}-\lambda_{2,2}\right) +  \left(\lambda_{2,1}-\lambda_{1,1}\right)  \int_{\mathcal{R}_2} p({\bf x}\|c_1 ) d{\bf x} - \left(\lambda_{1,2}-\lambda_{2,2}\right)  \int_{\mathcal{R}_1} p({\bf x}\|c_2 ) d{\bf x}\bigg) = 0</math>,<br /></div><div>in which case the overall risk becomes independent of the prior <math display="inline">P(c_1)</math>. This setting is called the minimax solution and the resulted overall risk <math display="inline">\lambda_{2,2} + \left(\lambda_{1,2}-\lambda_{2,2}\right)  \int_{\mathcal{R}_1} p({\bf x}\|c_2 ) d{\bf x}</math> is the minimax risk. It can be seen that the resulted risk is the maximum among the prior dependent Bayes risks (=minimum risks besides given prior), which explains the name minimax for this criterion.</div><div><br></div><div><span> ''Estimation of the posteriors''</span></div><div><br></div><div>The design of a classifier based on Bayes decision rule requires the knowledge of the posteriori probabilities <math display="inline">P(c_i\|{\bf x})</math>, for <math display="inline">i=1,ldots,C</math>, which are determined by the priors <math display="inline">P(c_i)</math> and the likelihoods <math display="inline">p({\bf x}\|c_i )</math>. Usually the likelihoods are computed by applying parametric statistic estimations, in which the distribution family is assumed to be either unimodal (like e.g. the multivariate normal density) or multimodal.</div><div><br></div><div><span> '''''Klassifikation - lineare Diskriminanzfunktionen'''''</span></div><div><br></div><div><span> ''General discriminant functions''</span></div><div><br></div><div>A general way of representing a classifier is to specify it by a set of discriminant functions <math display="inline">d_i({\bf x})</math>, for <math display="inline">i=1,ldots,C</math>. The classifier decides to assign the class <math display="inline">c_i</math> to the input feature vector <math display="inline">{\bf x}</math>, if</div><div><br></div><div><math display="block">d_i({\bf x}) > d_j({\bf x})  \mbox{~for~all~} j \neq i.</math></div><div><br></div><div>The Bayes classifier can be also represented by means of discriminant functions. In general case the discriminant functions can be mapped as</div><div><br></div><div><math display="block">d_i({\bf x}) = - R(\alpha_i\|{\bf x}).</math></div><div><br></div><div>The maximal among the <math display="inline">C</math> discriminant functions determines the class with the minimum conditional risks. A still simpler discriminant function can be given for the minimum error-rate classifier as</div><div><br></div><div><math display="block">d_i({\bf x}) = P(c_i\|{\bf x}).</math></div><div><br></div><div>This can be further simplified by utilizing that the denominator of <math display="inline">P(c_i\|{\bf x})= \frac{p({\bf x}\|c_i ) P(c_i)}{\sum_{j=1}{C} p({\bf x}\|c_j ) P(c_j) }</math> is independent of <math display="inline">i</math>, which leads to alternative discriminant function as</div><div><br></div><div><math display="block">d_i({\bf x}) = p({\bf x}\|c_i ) P(c_i)).</math></div><div><br></div><div>Any monotone function can be applied to discriminant functions, since it does not change the class belonging to the maximal one. Therefore a further discriminant function can be defined for the minimum error-rate classifier by the help of the monotone <math display="inline">ln()</math> function as <math display="block">d_i({\bf x}) = ln p({\bf x}\|c_i ) + ln P(c_i)).</math></div><div><br></div><div>In this case all the above different discriminant functions define the same decision rule. The discriminant functions divide the feature space into disjunct decision regions <math display="inline">{\mathcal{R}_1}, \ldots, {\mathcal{R}_C}</math>. A feature vector <math display="inline">{\bf x}</math> falls into the decision region <math display="inline">{\mathcal{R}_i}</math> if <math display="inline">d_i({\bf x}) > d_j({\bf x})</math> for every <math display="inline">j \neq i</math>. Thus decision region <math display="inline">{\mathcal{R}_i}</math> represents the set of feature vectors for which the classifier decides class <math display="inline">c_i</math>. The decision regions are separated by decision boundaries in the <math display="inline">d</math>-dimensional space of feature vectors.</div><div><br></div><div>In the special case of classifier with two classes it is common to use a single discriminant function</div><div><br></div><div><math display="block">d({\bf x}) = d_1({\bf x}) - d_2({\bf x}).</math></div><div><br></div><div>instead of <math display="inline">d_1({\bf x})</math> and <math display="inline">d_2({\bf x})</math>. Hence the decision rule becomes</div><div><br></div><div><math display="block">\mbox{~Decide~} c_1 \mbox{~if~} d({\bf x}) > 0 , \mbox{~otherwise~} c_2.</math></div><div><br></div><div>Figure z.</div><div><br></div><div><span> ''Linear discriminant functions''</span></div><div><br></div><div>In case of discriminant functions for Bayes classifier the form of the underlying probability distribution must be assumed to be known. However discriminant functions without such assumption can be also designed. Hence, in limited, statistical sense, the procedure of determining such discriminant functions can be seen as nonparametric. An important class of such discriminant functions are the linear discriminant functions.</div><div><br></div><div>Linear discriminant functions are linear in the components of <math display="inline">{\bf x}</math> and can be given as the linear combination of the components of <math display="inline">{\bf x}</math>, in other words:</div><div><br></div><div><math display="block">d({\bf x}) = {\bf w}^T {\bf x} + w_0,</math></div><div><br></div><div>where <math display="inline">{\bf w}</math> is the weight vector and <math display="inline">w_0</math> is the bias or threshold weight. The decision boundary becomes a decision surface.</div><div><br></div><div>For the case of classifier with two classes there is only one discriminant function, <math display="inline">d({\bf x}) = {\bf w}^T {\bf x} + w_0</math>. The class <math display="inline">c_1</math> will be decided, if <math display="inline">{\bf w}^T {\bf x}</math> exceeds the threshold <math display="inline">-w_0</math>, and the class <math display="inline">c_2</math> otherwise.</div><div><br></div><div>The principle of the operation of the linear classifier with two classes and d-dimensional feature vector is shown in Figure w1.</div><div><br></div><div>Figure w1.</div><div><br></div><div>The decision surface is a hyperplane. Let us consider two feature vectors <math display="inline">{\bf x}_1</math> and <math display="inline">{\bf x}_2</math> both on the decision surface. Then <math display="inline">d({\bf x}_1) = 0 = d({\bf x}_2)</math> holds, which means <math display="inline">{\bf w}^T {\bf x}_1 + w_0 = {\bf w}^T {\bf x}_2 + w_0</math>, from which</div><div><br></div><div><math display="block">{\bf w}^T \left({\bf x}_1 - {\bf x}_2\right) = 0.</math></div><div><br></div><div>It follows that <math display="inline">{\bf w}</math> is perpendicular to any vector lying on the decision feature and hence the decision surface is a hyperplane with <math display="inline">{\bf w}</math> as normal vector.</div><div><br></div><div>The discriminant function <math display="inline">d({\bf x})</math> has a strong connection to the signed distance of <math display="inline">{\bf x}</math> to the hyperplane, <math display="inline">r</math>. This can be seen by expressing <math display="inline">{\bf x}</math> as the sum of its projection to the hyperplane <math display="inline">{\bf x}_p</math> and <math display="inline">r</math> times the unit vector in the direction of <math display="inline">{\bf w}</math>, i.e.</div><div><br></div><div><math display="block">{\bf x} = {\bf x}_p + r \frac{{\bf w}}{\lVert {\bf w} \rVert}.</math></div><div><br></div><div>Using <math display="inline">d({\bf x}_p) = 0</math>, for <math display="inline">d({\bf x})</math> we get</div><div><br></div><div><math display="block">d({\bf x}) = {\bf w}^T {\bf x}_p + {\bf w}^T r \frac{{\bf w}}{\lVert {\bf w} \rVert} + w_0 = d({\bf x}_p) + r \frac{\lVert {\bf w} \rVert^2}{\lVert {\bf w} \rVert} = r \lVert {\bf w} \rVert,</math> from which <math display="block">r = \frac{d({\bf x})}{\lVert {\bf w} \rVert}.</math></div><div><br></div><div>The normal vector <math display="inline">{\bf w}</math> points into the decision region <math display="inline">{\mathcal{R}_1}</math>, since <math display="inline">d({\bf x}) > 0</math> and thus <math display="inline">{\bf w}</math> points to the same side of the hyperplane as <math display="inline">{\bf x}</math>, if <math display="inline">{\bf x} \in {\mathcal{R}_1}</math>.</div><div><br></div><div>The distance of the origin to the hyperplane is <math display="inline">\frac{w_0}{\lVert {\bf w} \rVert}</math>. The origin falls into the decision region <math display="inline">{\mathcal{R}_1}</math> if <math display="inline">w_0 > 0</math> and into <math display="inline">{\mathcal{R}_2}</math> if <math display="inline">w_0 < 0</math>. If <math display="inline">w_0= 0</math>, then the hyperplane passes through the origin and linear discriminant function <math display="inline">d({\bf x})</math> has homogeneous form <math display="inline">d({\bf x})= {\bf w}^T {\bf x}</math>.</div><div><br></div><div>The above properties are illustrated geometrically on Figure w2.</div><div><br></div><div>Figure w2</div><div><br></div><div>For the general case of classifier with <math display="inline">C</math> classes, the linear discriminant functions are given as</div><div><br></div><div><math display="block">d({\bf x})_i = {\bf w}_i^T {\bf x} + w_{i0},</math></div><div><br></div><div>The decision rule is the same as the one for the general discriminant functions</div><div><br></div><div><math display="block">\mbox{~Decide~} c_i \mbox{~if~} d_i({\bf x}) > d_j({\bf x})  \mbox{~for~all~} j \neq i.</math></div><div><br></div><div>In case of the values of two ore more discriminant functions being the same for some <math display="inline">{\bf x}</math>, which is called ties (like in statistics), the decision is undefined. The classifier with <math display="inline">C</math> linear discriminant functions is also called as linear machine. It divides the d-dimensional feature space into <math display="inline">C</math> disjunct decision regions. For the boundary between two neighborous decision regions <math display="inline">{\mathcal{R}_i}</math> and <math display="inline">{\mathcal{R}_j}</math> holds <math display="inline">d({\bf x})_i = d({\bf x})_j</math>, or equivalently</div><div><br></div><div><math display="block">\left({\bf w}_i-{\bf w}_j\right)^T {\bf x} + \left(w_{i0}- w_{j0}\right) = 0.</math></div><div><br></div><div>It follows from the considerations done for classifier with two classes that the boundary is a hyperplane with normal vector <math display="inline">\left({\bf w}_i-{\bf w}_j\right)</math>. Furthermore the signed distance of <math display="inline">{\bf x}</math> to this hyperplane is given as <math display="inline">\frac{d_i({\bf x}) - d_j({\bf x})}{\lVert {\bf w}_i - {\bf w}_j  \rVert}</math>. These expressions show that in linear machine the differences of the weights are important, not the weights itself. The total number of hyperplanes can be less than the <math display="inline">\frac{C (C-1)}{2}</math> possible number of pairs. It can be shown that the decision regions in <math display="inline">{\bf x}</math> are convex.</div><div><br></div><div>Generalized linear discriminant functions are linear in some given functions of <math display="inline">{\bf x}</math> and have a form as</div><div><br></div><div><math display="block">d({\bf x}) = \sum_{i=1}^{d} {\bf a}_i {\bf y}_i({\bf x}).</math></div><div><br></div><div>Truncated serial expansions of <math display="inline">d({\bf x})</math> with any <math display="inline">{\bf y}_i</math>-s lead to polynomial discriminant functions as subclass of generalized linear discriminant functions. An example is the quadratic discriminant function, which can be given in the form</div><div><br></div><div><math display="block">d({\bf x}) = w_0 + \sum_{i=1}^{d} {\bf w}_i {\bf x}_i + \sum_{i=1}^{d}\sum_{j=1}^{d}{\bf w}_{ij}  {\bf x}_i {\bf x}_j.</math></div><div><br></div><div>Besides of the decision regions in <math display="inline">{\bf y}</math> being convex, they can have any dependency in <math display="inline">{\bf x}</math>. Hence generalized linear discriminant functions can describe more general feature spaces, which is their beneficial property providing motivation to use them.</div><div><br></div><div><span> ''Finding linear discriminant functions''</span></div><div><br></div><div>The parameters of the classifier with linear discriminant functions are the weights. Thus training the parameters of the classifier means to find proper linear discriminant functions. This is done by formulating the task as an optimization. A natural choice for the criterion of the optimization is to minimize the training error. To perform such minimization in a general settings, well-known numerical multivariate optimization algorithms, like e.g. gradient methods are used and adopted to the specialities of finding linear discriminant functions.</div><div><br></div><div><span> '''''Lernen aus Beispiele'''''</span> Learning from examples is a fundamental approach in machine learning. The goal is to build up knowledge from examples by training a statistical model, which represent the knowledge also for unseen examples. On this way the statistical model represents a generalization of the knowledge extracted from the examples. In some contexts such learning is called also training.</div><div><br></div><div>The classical form of such learning assumes providing the correct output for the input in each example. It is usual to say to have labelled training set, which means that the in each example the input is labelled by the correct output. This type of learning is called supervised learning.</div><div><br></div><div>Example - estimation of the transition probability matrix of a DTMC (Beispiel: Schätzung der Übergangswahrscheinlichkeiten einer DTMC (mittels des Verfahrens der Lagrange-Multiplikatoren)) As an example for learning from examples we provide the estimation of the transition probability matrix of a DTMC from observed state sequences.</div><div><br></div><div><span> ''The model''</span></div><div><br></div><div>Let <math display="inline">\mathcal{S}={s_1, \ldots, s_{\|\mathcal{S}\|}}</math> be the set of states of the DTMC. Furthermore let <math display="inline">\overrightarrow{\bf z} = (z_0, z_1,\ldots,z_T)</math> denote observed state sequence. The transition probability matrix of the DTMC is denoted by <math display="inline">{\bf A}</math>, i.e. <math display="inline">{\bf A}_{i,j}= P(z_t=s_j\|z_{t-1}=s_i)</math> for <math display="inline">s_i, s_j \in \mathcal{S}</math>. Observe that a given state sequence <math display="inline">\overrightarrow{\bf z}</math> implicitly includes many labelled examples in the form <math display="inline">..z_{t-1}, z_t..</math>, where the input is <math display="inline">z_{t-1}</math> and the next state <math display="inline">z_t</math> gives the corresponding correct output.</div><div><br></div><div><span> ''Problem formulation''</span> The task of estimating of the parameter matrix <math display="inline">{\bf A}</math> can be formulated as an optimization problem as</div><div><br></div><div><math display="block">{\bf A}^ = \arg\max_{{\bf A}} P(\overrightarrow{\bf z}\|{\bf A}),</math> where <math display="inline">{\bf A}^</math> is the estimated parameter matrix. We are going to estimate the parameter matrix<math display="inline">{\bf A}^</math> by applying the Maximum-Likelihood statistical point estimation method. Therefore we maximize the log likelihood of <math display="inline">\overrightarrow{\bf z}\|{\bf A}</math>, which has maximum at the same point, where the likelihood.</div><div><br></div><div>By omitting the condition to get simpler notation, the log likelihood <math display="inline">log P(\overrightarrow{\bf z})</math> can be expressed as</div><div><br></div><div><math display="block">\begin{aligned}</div><div>log P(\overrightarrow{\bf z})&&\hspace{-5mm}= P(z_1,\ldots,z_T) = log P(z_T\|z_{T-1},\ldots, z_1)P(z_{T-1}\|z_{T-2},\ldots, z_1) \ldots P(z_1\|z_0) P(z_0) \nonumber \\</div><div>&&\hspace{-5mm}= log P(z_T\|z_{T-1}) P(z_{T-1}\|z_{T-2})\ldots P(z_1\|z_0)P(z_0) = log \prod_{t=1}^{T}P(z_{t}\|z_{t-1}) P(z_0) \nonumber \\</div><div>&&\hspace{-5mm}= \prod_{t=1}^{T}{\bf A}_{z_{t-1},z_{t}}P(z_0) = \sum_{t=1}^{T} log {\bf A}_{z_{t-1},z_{t}} + log P(z_0),  \nonumber </div><div>\end{aligned}</math> where we used the multiplication theorem in the first and the Markov property in the second equality.</div><div><br></div><div>This can be further rearranged by introducing an indicator variable as</div><div><br></div><div><math display="block">log P(\overrightarrow{\bf z})=\sum_{i=1}^{\|\mathcal{S}\|}\sum_{j=1}^{\|\mathcal{S}\|}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} log {\bf A}_{i,j} + log P(z_0).</math></div><div><br></div><div>Omitting the term <math display="inline">log P(z_0)</math> does not influence the place of maximum of the log likelihood, since it does not depend on matrix <math display="inline">{\bf A}</math>. Furthermore some constraints on matrix <math display="inline">{\bf A}</math> must be taken into account in the optimization. Matrix <math display="inline">{\bf A}</math> is stochastic and hence 1. <math display="inline">{\bf A}_{i,j} \neq 0</math>, i.e. the elements of matrix <math display="inline">{\bf A}</math> are non-negative, 2. <math display="inline">\sum_{j=1}^{\|\mathcal{S}\|} {\bf A}_{i,j} = 1</math> for <math display="inline">i = 1, \ldots, \|\mathcal{S}\|</math>, i.e. the row sums of matrix <math display="inline">{\bf A}</math> are equal to 1.</div><div><br></div><div>We are going to apply the method of Lagrange multiplicator for the optimization. It ensures that the non-negativity of the resulted matrix <math display="inline">{\bf A}</math>. Therefore the task can be formulated as the following constrainted optimization problem:</div><div><br></div><div><math display="block">\arg\max_{{\bf A}} \sum_{i=1}^{\|\mathcal{S}\|}\sum_{j=1}^{\|\mathcal{S}\|}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} log {\bf A}_{i,j}, ~ \mbox{~subject~to~}~\sum_{j=1}^{\|\mathcal{S}\|} {\bf A}_{i,j} = 1, \mbox{~for~} i = 1, \ldots, \|\mathcal{S}\|.</math></div><div><br></div><div><span> ''Solution by applying the method of Lagrange multiplier''</span></div><div><br></div><div>Applying the method of Lagrange multiplicator requires the introduction of \|S\| Lagrange multipliers, <math display="inline">\alpha_i</math> for <math display="inline">i = 1, \ldots, \|\mathcal{S}\|</math>, which can be arranged in a vector <math display="inline">{\mbox{\boldmath$\alpha$}}</math>. Thus the Lagrange function can be given as</div><div><br></div><div><math display="block">\mathcal{L}({\bf A}, {\mbox{\boldmath$\alpha$}}) = \sum_{i=1}^{\|\mathcal{S}\|}\sum_{j=1}^{\|\mathcal{S}\|}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} log {\bf A}_{i,j} + \sum_{i=1}^{\|\mathcal{S}\|} \alpha_i \left(1 - \sum_{j=1}^{\|\mathcal{S}\|} {\bf A}_{i,j} \right).</math></div><div><br></div><div>Taking the first derivative of the Lagrange function with respect to <math display="inline">{\bf A}_{i,j}</math> and making equal to 0 yields <math display="block">\frac{\partial \mathcal{L}}{\partial {\bf A}_{i,j}} = \frac{1}{{\bf A}_{i,j}}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} - \alpha_i = 0, ~ i,j = 1, \ldots, \|\mathcal{S}\|,</math> from which <math display="inline">{\bf A}_{i,j}</math> can be expressed as <math display="block">{\bf A}_{i,j} = \frac{1}{\alpha_i}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} ~ i,j = 1, \ldots, \|\mathcal{S}\|.</math></div><div><br></div><div>Now taking first derivative of the Lagrange function with respect to <math display="inline">\alpha_i</math>, making equal to 0 and applying the above expression of <math display="inline">{\bf A}_{i,j}</math> yields <math display="block">\frac{\partial \mathcal{L}}{\partial \alpha_i} = 1 - \sum_{j=1}^{\|\mathcal{S}\|} {\bf A}_{i,j} = 1 - \sum_{j=1}^{\|\mathcal{S}\|} \frac{1}{\alpha_i}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} = 0,</math> from which we get the solution for <math display="inline">\alpha_i</math> as <math display="block">\alpha_i = \sum_{j=1}^{\|\mathcal{S}\|} \sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} = \sum_{t=1}^{T} 1_{z_{t-1} = s_i} ~ i = 1, \ldots, \|\mathcal{S}\|.</math></div><div><br></div><div>Applying the final expression of <math display="inline">\alpha_i</math> in the expression of <math display="inline">{\bf A}_{i,j}</math>, we get the for the estimate of <math display="inline">{\bf A}_{i,j}</math> <math display="block">{\bf A}_{i,j} = \frac{\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j}}{\sum_{t=1}^{T} 1_{z_{t-1} = s_i}} ~ i,j = 1, \ldots, \|\mathcal{S}\|.</math></div><div><br></div><div>This estimate of <math display="inline">{\bf A}_{i,j}</math> can be interpreted as the number of count of the transition <math display="inline">s_i --> s_j</math> divided by the number of count of <math display="inline">s_i</math>, i.e. the count of observing the DTMC being in state <math display="inline">s_i</math>. This estimate fits to the intuition.</div><div><br></div><div><span> '''''Learning durch Steuereung - MDP'''''</span> Learning by means of control is an iterative process, in whose each step the actor reacts with an action to the environment’s answer to its previous action. Thus the actor learns its behaviour stepwise. Such process arise very often in the nature, e.g. as controlling the movement of animals during looking for some food.</div><div><br></div><div>Such learning process can be described by the mathematical model of Markov-Entscheidungsprozess (MEP) - Markov decision process (MDP). MDP is diskreter stochastischer Kontrollprozess. In each state <math display="inline">s</math> of the MDP, the actor can take an action <math display="inline">a</math>, from the allowed set of actions at that state. It causes the MDP to move into the state <math display="inline">s^{'}</math> and giving a reward to the actor, which corresponds to the action and state transition, <math display="inline">R_a(s,s^{'})</math>, both at the next time step. On such way the MDP realizes an action and reward-feedback based learning. The working flow of the MDP model is shown in Figure x.</div><div><br></div><div>Abbildung x</div><div><br></div><div>The mathematical definition of MDP is given as follows. MDP is a 4-tuple <math display="inline">(\mathcal{S}, \mathcal{A}, P, \mathcal{R})</math>, where</div><div><br></div><div>* <math display="inline">\mathcal{S}</math> is the state space, i.e. set of states,</div><div>* <math display="inline">\mathcal{A}</math> is the action space, i.e. set of actions,</div><div>* <math display="inline">P</math>, in function form <math display="inline">P_a(s,s^{'}) = P(s(t+1) = s^{'}\| s(t) = s, a(t) = a)</math> describes the state transition from <math display="inline">s</math> to <math display="inline">s^{'}</math> while the action <math display="inline">a</math> is taken and</div><div>* <math display="inline">\mathcal{R}</math> is the set of possible rewards, and <math display="inline">R_a(s,s^{'})</math> is the reward for the state transition from <math display="inline">s</math> to <math display="inline">s^{'}</math> when the action <math display="inline">a</math> is taken.</div><div><br></div><div>An example diagram of a MDP is shown in Figure y.</div><div><br></div><div>Abbildung y</div><div><br></div><div>MDP is an extension of the Discrete-Time Markov Chain (DTMC) by adding actions (allowing selection opportunities) and rewards (providing feedback for motivating) to it. It implies that an MDP with only one action and a common reward reduces to a DTMC. MDP can be used for solving optimization tasks and it is usually implemented by dynamic programming.</div><div><br></div><div><span id="ci"></span></div><div>=== CI ===</div><div><br></div><div>Computational Intelligence (CI) is subarea of AI covering biologically motivated and hence computationally intensive methodologies and approaches.</div><div><br></div><div><span> '''''Definition von CI'''''</span></div><div><br></div><div>There is no commonly accepted unique definition of CI. The first definition origins from 1994 by Bezdek (see in ) and defines CI as having all of the following merkmalen:</div><div><br></div><div>* deals with numerical data,</div><div>* has a pattern-recognition component instead of represented knowledge,</div><div>* has computational adaptive and fault tolerance methodology and</div><div>* approximates human like performance in terms of speed and error-rate.</div><div><br></div><div>CI can be also defined by means of its major characteristics:</div><div><br></div><div>* a sub-field in KI,</div><div>* a set of nature-inspired approaches,</div><div>* addresses complex real-world problems, which may contain some uncertainties or might have stochastic nature.</div><div><br></div><div><span> '''''Positionierung von CI gegenüber KI und seiner Geschichte zur Geschichte der KI'''''</span></div><div><br></div><div>CI and KI both have the long-term goal of reaching general intelligence. However KI is based on hard computing techniques following binary logic having two values, the binary 0 or 1. This can be interpreted as an element in a set is either included or not. On the contrary CI is based on soft computing techniques following fuzzy logics, which enables real values between 0 and 1, which can be seen as a degree of membership an element in a set. This is a clear difference between KI and CI. On the other hand CI can be seen as a sub-area of KI in a broader sense.</div><div><br></div><div>While the history of KI began already in the 1950s, the term CI was first used in 1990 by the IEEE Neural Networks Council dealing with development of biological and artificial neural networks. In 2001 the Council became IEEE Computational Intelligence Society and several years later new areas such as fuzzy systems and evolutionary computation have been included into the area of interest of the society.</div><div><br></div><div><span> '''''Hauptkomponenten der CI'''''</span></div><div><br></div><div>The major components of CI are given as</div><div><br></div><div># Artificial Neural Networks based on the biological ones, which</div><div>#* enable processing and learning from experiential data and</div><div>#* provide fault tolerance in some extent.</div><div># Fuzzy logic being one of the major discipline of CI, which</div><div>#* enable soft computing techniques for modelling real-life complex processes by applying probabilistic methods and</div><div>#* can be applied for approximate reasoning, but not for learning.</div><div># Evolutionary computation (EC) providing algorithms (Evolutionary Algorithms -EA) for global optimization, which</div><div>#* are inspired by biological evolution</div><div>#* usually based on a population of candidate solutions</div><div><br></div><div><span> '''''Evolutionäre Algorithmen (EA)'''''</span></div><div><br></div><div>In this subsection we give the brief mathematical descriptions of the two most broadly known and applied EA algorithm: Optim book + Pattern rec book</div><div><br></div><div>* Particle Swarm Optimisations (PSO) and</div><div>* Genetic Algorithms.</div><div><br></div><div><span> '''''Partikelschwarmoptimierungen - Particle Swarm Optimisations (PSO)'''''</span></div><div><br></div><div><span> '''''Genetischer Algorithmus - Genetic Algorithms (+ Programming ?)'''''</span></div><div><br></div><div><span id="anwendungsgebiete-von-ai"></span></div><div>== Anwendungsgebiete von AI ==</div><div><br></div><div><span id="anwendungen-von-kg"></span></div><div>=== Anwendungen von KG ===</div><div><br></div><div>In this subsection we discuss the applications of the KG.</div><div><br></div><div>(1) Information retrieval (IR) systems include tasks like - image retrieval, - music retrieval, - web search, - domain-specific retrieval (e.g. geographic, legal,..), - cross-lingual retrieval and - text summarization (mainly as extracted parts of the original text).</div><div><br></div><div>(2) Semantic search is realized by integrating KG into the results of the search engine. This leads to improved ,,Big Data“ search engine. It extends the conventional search engine with the following new features - offer more relevant information, - identify and disambiguate entities in text, - provide links to related entities (exploratory search). Examples include - Google search engine implemented via integrating Google’s KG, - Bing, the Microsoft search engine realized via integrating the Microsoft’s KB Satori.</div><div><br></div><div>(3) Question Answering System (QA) incorporate semantic info from KG in order to enhance search result for semantic question. Therefore they provide semantically aware question answering services. Examples include - Watson, the IBM’s QA system, which uses among others YAGO, DBpedia and Freebase, - Digital/virtual assistants like Siri, Cortana, Google Now The QA systems can be classified into extractive or generative QA systems. - In extractive QA the answer is extracted from context (BERT-like models). - In generative QA the model generates free text based on the context for which it uses Text Generation models. On the other hand the QA systems can be also classified as open or closed QA systems. - In open QA the answer is taken from context. - In closed QA there is no context provided and thus the answer is a completely generated one. The fast QA over many document works by inserting first ranking by using passge ranking models. There are QA datasets provided for academic benchmark tests. The most used QA datasets for academic benchmark of extractive QA systems is the Stanford Question Answering Dataset (SQuAD) consisting of 100000+ QA pairs on 500+ articles.</div><div><br></div><div>(4) Recommendation/Recommender System gives personalized recommendations based on user’s behaviour, common preferences and interests. Based on analysing user clicks, these content platforms can suggests another contents for users to view or read. In such applications KGs help to improve accuracy, increase diversity of recommended items and bring interpretability to recommendations. Typical examples of such KG applications are content platforms, like e.g. Netflix, Search Engine Optimization (SEO), which is ein Teilgebiet des Suchmaschinenmarketings, or social media.</div><div><br></div><div>(5) Natural Language Processing (NLP) addresses processing and understanding text and spoken speech by applying mainly machine learning techniques. Information extraction (IE) is a NLP technique targeting extraction of structured information from typically unstructured text. More precisely IE locates in the intersection of IR and NLP. There exists a mutually beneficial relationship between KG and NLP. On one hand KG is used to realize NLP techniques, like e.g. text summarization. But on the other hand NLP techniques, like (named) entity recognition and relation extraction, are used to create KG.</div><div><br></div><div>(6) Enterprise knowledge management is about applying KG in industrial sector in order to utilize the benefits provided by KG. This benefits include - applying big data in order to create new business value, - creating business viewpoints on different granularity levels, - providing new insights through hierarchical inclusion of company relevant data and - providing advanced accessibility of company information to employees.</div><div><br></div><div>(7) Domain-specific applications enable creating added value due to applying KG in different fields. - Biomedical KG applications enable question answering and decision support in the life sciences due to integrating multiple sources of biomedical information. - Medical application utilize KG for integrating textual medical knowledge. One such example is retrieving specific info via Schlussforgerung. - In cybersecurity applications KG is used to store modelling patterns in large datasets of transactions and organisations. This serves as a base for building applications, like e.g. detecting and predicting attacks, identifying potential frauds (=verschiedene Arten von Wirtschaftskriminalität) and identifying security threats via suspicious transactions, abnormal user behaviour or fake accounts. - In e-commerce applications KG is used to store semantic information about the interactions between customers and products, and their characteristics. These can be used for a variety of tasks including finding similar or compatible products, identifying product classes, retrieving relevant products for a given search term and retrieving similar search terms. - In financial applications KG is created by identifying named entities from news of companies and by extracting business links between relevant stocks. Such KGs are used for tasks like e.g. predicting stock’s price movement. - KG is used in the field of news to implement fake news detection. - In the field of education KG nodes are used to store instructional concepts, which are used in tasks like learning resource recommendation and concept visualisation. - In Geoscience applications KG stores the textual geoscience data used for information extraction and knowledge discovery services.</div><div><br></div><div><span id="anwendungen-des-mls"></span></div><div>=== Anwendungen des MLs ===</div><div><br></div><div>ML is one of the technology which has many applications both commonly known ones and in a wide variety of Fachgebiets, including finance, healthcare, agriculture and many more.</div><div><br></div><div>In the following we give a brief descriptions of several commonly known applications.</div><div><br></div><div>Speech recognition is the process of converting the spoken speech into written n text. It is also called as speech to text (STT) or automatic speech recognition (ASR). Speech recognition or used e.g. in speech automated call centers, voice dialling (also called speech enabled dialling) or in Apple’s speech assistant Siri.</div><div><br></div><div>Speaker recognition (also called as voice recognition) is the task of identifying the speaker from a short (usually several seconds) speech.</div><div><br></div><div>Speech synthesis covers the task of converting written text into human speech. It is also called as text-to-speech (TTS) and it is the reverse task of the speech recognition.</div><div><br></div><div>Language translation, like e.g. Google Translate can translate written text in one language into the corresponding text in other language. It is based on Natural Language Processing(NLP) techniques including POS tagging and Named Entity Recognition (NER).</div><div><br></div><div>Image recognition is the task of identifying some pattern in an image or video. It can be used among others for face detection, face recognition or for further analysis.</div><div><br></div><div>Traffic pattern prediction. This task generates predictions about the upcoming traffic based on the actual one, which is used e.g. to identify and suggest the fastest route to the required destination.</div><div><br></div><div>Data mining has a goal of extracting information from a (usually large) data set and converting it into a structure, which is appropriate for further use. Data mining is applied e.g. by secret services or for outlier detection in the statistics.</div><div><br></div><div>E-commerce product recommendations. It is a marketing application generating product recommendations based on collected data on past customer behaviours, past shopping.</div><div><br></div><div>E-mail spam detection (email filtering) works by means of spam filters using ML algorithms for classifying the incoming e-mails being spam or not.</div><div><br></div><div>Malware detection is a pattern recognition system trained on features extracted by analysing suspicious activities.</div><div><br></div><div>Computer vision uses ML techniques to process, analyse, visualize or interpret high-dimensional real life data.</div><div><br></div><div>Transportation (Uber) uses ML to analyse traffic conditions in order to estimate the Estimated Time of Arrival (ETA) to the required Destination.</div><div><br></div><div>ML applied to business problems is also called as predictive analytics. Here we give short interpretations of several ML application examples in Finance.</div><div><br></div><div>* Stock Market and Day Trading. In this application ML is trained to predict the evolution of the prices in the stock market.</div><div>* Focused Account Holder Targeting. Here ML algorithms classify the account holders for segments with predefined balances and loans.</div><div>* Fraud Detection (in banking transactions) is implemented by ML algorithm giving a score for each transaction, which represents a probability of being a fraud. The training of the algorithm is based on patterns for unusual behaviour identified from large amount of transactional data. This is the most important use of ML in banking and finance area.</div><div>* Loan Eligibility Prediction is realized by different ML classifier (like e.g. Random Forest) to judge customers eligibility for being granted a loan.</div><div><br></div><div>As a next several applications of ML in Healthcare are shortly considered.</div><div><br></div><div>* In Personalized Treatment/Medication ML is used to identify patient gene patterns (=response markers) that could enable targeted therapies.</div><div>* In Genetics and Genomics ML is used to identify gene sequences in genome sequencing, gene modification and in general in genetic research.</div><div>* In Cancer Prognosis and Prediction (especially with ANNs) ML is used to construct prediction models to support decision on therapy and predict its evolution.</div><div>* In Drug development the process of drug discovery can be sped up by applying ML techniques.</div><div><br></div><div>Artificial Neural Networks (ANN) combine ML with NN model. The big majority of the applications has benefited from the usage of ANN instead of pure ML algorithms. These applications are e.g. image recognition, medical diagnosis, speech recognition, machine translation, computer vision, cancer prognosis and prediction, social network filtering, playing board and video games.</div><div><br></div><div>Embedded Machine Learning is a subfield of ML, in which ML is applied to embedded systems with limited resources. Such embedded systems include microcontrollers, wearable computers and edge devices. Using ML methods in embedded systems makes transferring and storing data on cloud servers unnecessary. Embedded Machine Learning techniques include among others approximate computing and hardware acceleration.</div><div><br></div><div>For more details on application of ML see e.g. the Wikipedia side [https://en.wikipedia.org/wiki/Machine_learning .]</div><div><br></div><div><span id="weitere-anwendungsgebiete-von-ai"></span></div><div>=== Weitere Anwendungsgebiete von AI ===</div><div><br></div><div>Further application areas of AI include</div><div><br></div><div># Educational sector e.g. by creating automated messages to students, or by designing content based on the user’s interest (Smart Content Creation),</div><div># Robotics e.g. by making real time decisions possible based on NLP, Object Recognition and Human-Robotics Interaction (HRI),</div><div># Navigation e.g. by computing the best route based on GPS and AI technologies,</div><div># Healthcare e.g. by patient monitoring or surgical assistance,</div><div># Automobiles e.g. by Advanced Driving Assistance System (ADAS) or Autonomous Driving,</div><div># Agriculture e.g. by crops monitoring, supply chain maintenance or weather forecasting,</div><div># Human Resource e.g. by screening,</div><div># Lifestyle e.g. by Personalized Recommendation, Virtual Assistance,</div><div># Gaming e.g. by Animation,</div><div># Astronomy e.g. by Analysis of astronomical data and Detection e.g. of exoplanets,</div><div># Travel and Transport e.g. enabling Truck platooning,</div><div># Military e.g. by detecting Cyberattacks and Decision Support e.g. for resource allocation.</div><div><br></div><div>The feedback of recent development of NN to older AI concepts resulted in modern AI model combinations. Perhaps the two most important ones are the Neural Language Models and Large Language Models.</div><div><br></div><div>Neural Language Models (NLM) are LMs based on recurrent NN. They are also called as continuous space language models. They transform the LM representation of words into continuous low-dimensional vector space, called embedding space. This has the advantage of locating the semantically similar units closer to each other in the embedding space. Such units can be parts of corpus, sentences, sub-words or characters. Furthermore NLMs avoid the data sparsity problem of SLMs since they represent words as non-linear combinations of weights in a neural net . The NLMs can be categorized to be either Non-contextual Embeddings or Contextual Embeddings. Non-contextual Embeddings apply the same embeddings for a given semantic unit regardless of the given context. Examples of Non-contextual Embeddings include Word2Vec  or Wikipedia2Vec . On the contrary Contextual Embeddings can represent different semantics of the semantic units in different context. Examples of Contextual Embeddings includes Bidirectional Encoder Representations from Transformers (BERT) introduced by Google  or Sentence-BERT (SBERT), a fine-tuned version of BERT. The transformer model is a specific NN architecture introduced by Vaswani et al. </div><div><br></div><div>Large Language Models (LLM) are transformer NN based LMs, which are pre-trained using self-supervised learning, learn billions of parameters during training and need large computational resources for the training and during operation. They can achieve general-purpose language understanding and generate answer in the form of human-like generated text. These LLMs are used in generative AI systems. Recent versions can complete specific tasks prompt-engineered, i.e. providing a short structured description of the task that can be interpreted by the LLM. Most prominent examples are Open AI’s GPT-3.5 and GPT-4 model (used in ChatGPT) and Google’s PaLM (used in Bard).</div><div><br></div><div><span id="ethik-in-der-ki"></span></div><div>== Ethik in der KI ==</div><div><br></div><div>Ethik ist im allgemeinen eine Menge von moralischer Regeln und Leitfaden, die den Menschen helfen, zwischen Recht und Unrecht zu entscheiden. Es ist zu erwarten, das KI schon in der näheren Zukunft erhebliche Auswirkungen auf die ganze Menscheit und Welt haben wird. Deshalb ist es wichtig auf ethischen Fragen in Zusammenhang mit KI bedeutsam Aufmerksamkeit zu schenken.</div><div><br></div><div>In der vorigen Jahren haben Organisationen verschiedene Big Data Algoritmen und KI Lösungen eingesetzt meist um ihre Geschäftsergebnisse durch Automatisierung und datengesteuerte Entscheidungsfindung zu verbessern. Dabei wurden einige Unternehmen in Zusammenhang mit ihrer KI-Anwendungen mit unerwarteten negative Konsequenzen konfrontiert, insbesondere aufgrund unfairer Ergebnisse und durch Anwendung von mit Vorurteilen behafteten Datensätzen. Dies hat dazu geführt, dass führende Unternehmen und Forschungs-und Datenwissenschafts-Communities im Bereich der KI sich mit den ethischen Aspekten der KI eingehend befassen mussten. Mangel an angemessene Regeln in diesem Bereich kann zu Reputationsverlust, hohe Geldstrafen sowie zu regulatorischen und rechtlichen Problemen führen.</div><div><br></div><div>Ethik der KI (kurz KI-Ethik) ist ein Teilbereich der angewandten Ethik, der sich unten anderen mit der folgende Fragestellungen sich beschäftigt:</div><div><br></div><div>* die gezielte Rolle von KI-Syteme und die ethische Aspekte die ihre Benützung entsprechend ihrer Rollen ermöglichen,</div><div>* ethische Regeln, Leitpfaden für Menschen, die KI-Systeme planen, herstellen, testen, zertifizieren und benutzen.</div><div>* gewünschtes ethische Verhalten von KI-Systemen (Maschinenethik).</div><div><br></div><div>Als Leitfaden für die Ethik in der experimentellen Forschung und der Entwicklung von Algorithmen sind der Belmont-Bericht ([https://www.hhs.gov/ohrp/sites/default/files/the-belmont-report-508c_FINAL.pdf )] in akademischer Gemeinschaft weit verbreitet. Die drei wesentliche Prinzipien des Belmont-Berichts lauten:</div><div><br></div><div># Respekt für Personen</div><div># Gutes tun</div><div># Gerechtigkeit (bzgl. Fairness und Gleichheit)</div><div><br></div><div>Obwohl eine vielzahl von ethischen Elemente, Prinzipien und Richtlinien für KI vorgeschlagen wurden, existiert derzeit keine einheitliche Richtlinien für KI-Ethik. Allerdings besteht eine gewisse Konsens über die folgende zwei Elemente in die KI-Ethik zu integrieren:</div><div><br></div><div>* Governance - um die Einhaltung der gesetzlichen Vorschriften in Zusammenarbeit mit Regierungsbehörden sicherzustellen.</div><div>* Erklärbarkeit - die Funktionsweise der KI Systeme zu erklären (Transparenz) um Vertrauen gegenüber KI-Systeme zu schaffen.</div><div><br></div><div>Es gibt mehrere Organisationen mit dem Ziel das KI-Ethik zu fördern. Dies sind die folgenden.</div><div><br></div><div>* CHAI: Das Center for Human-Compatible Artificial Intelligence ([https://humancompatible.ai/ )] ist eine funktionierende Kooperation verschiedener Universitäten und Institute, welche dei vertrauenswürdige KI und nachweislich nutzbringender Systeme fördert.</div><div>* DARPA: Die Defense Advanced Research Projects Agency des US-Verteidigungsministeriums ([https://www.darpa.mil/work-with-us/ai-next-campaign )] fördert die erklärbarer KI-Forschung.</div><div>* NASCAI: Die National Security Commission on Artificial Intelligence ([https://www.nscai.gov/ )] ist eine US Kommission ,,die die Methoden und Mittel prüft, die notwendig sind, um die Entwicklung von künstlicher Intelligenz, maschinellem Lernen und damit verbundenen Technologien voranzutreiben, um die nationalen Sicherheits- und Verteidigungsbedürfnisse der Vereinigten Staaten umfassend zu erfüllen.“</div><div>* AlgorithmWatch: Eine gemeinnützige Organisation ([https://algorithmwatch.org/en/ )], die sich auf den Einsatz von erklärbaren und nachvollziehbaren Entscheidungsprozessen und Algorithmen abzielt.</div><div>* AI Now Institute: Eine gemeinnützige Organisation an der New York University ([https://ainowinstitute.org/ )], die sich mit der sozialen Auswirkungen der KI beschäftigt.</div><div><br></div><div>Laut Luciano Floridi und Josh Cowls herrscht eine weitgehende Übereinstimmung darüber, dass es möglich ist, die ethischen Prinzipien für gesellschaftlich nützlichen Einsatz von KI basierend auf die vier Prinzipien der Medizinethik und ein zusätzliches fünftes Prinzip, Erklärbarkeit aufzubauen :</div><div><br></div><div>,,1. Fürsorge (Benefizienz): KI-Technologie soll der Menschheit nützen, das Wohlergehen fördern, die Menschenwürde wahren und der Erhaltung des Planeten dienen.<br /></div><div>2. Schadensvermeidung (Non-Malefizienz): Negative Folgen eines übermäßigen oder missbräuchlichen Einsatzes von KI müssen vermieden werden, KI soll nur in einem sicheren Rahmen eingesetzt werden.<br /></div><div>3. Autonomie: Menschen müssen die volle Entscheidungsfreiheit darüber haben, ob und welche Aufgaben sie an KI-Systeme delegieren, ihre Freiheit, sich eigene Normen und Standards zu setzen, muss gewahrt bleiben.<br /></div><div>4. Gerechtigkeit: Wahrung der Solidarität, Vermeidung von Ungerechtigkeit, gleichberechtigter Zugang zu den Vorteilen der KI.<br /></div><div>5. Erklärbarkeit (Verständlichkeit): Die Prinzipien 1 bis 4 sind nur realisierbar, wenn auch Laien nachvollziehen können, welchen Nutzen oder Schaden KI für die Gesellschaft hat und wer für welche Auswirkungen verantwortlich ist.“</div><div><br></div><div>Es gibt derzeit eine Reihe von ethischer Diskussionen bzgl. KI Systemen. Sollche KI-Ethik einbeziehende Debatten kommen in mehrere Bereichen vor. Autonomes Fahren stellt ein Musterbeispiel dar. Von automatisierter Systeme sind erwartet, dass sie im Vergleich zur menschlichen Fahrleistung zumindest eine Schadensminderung erzielen. Da derzeitige KI Syteme sehr selten, aber eventuell Fehlerhafte Reaktionen produzieren, ist es zumindest fragwürdig ob autonome Fahrzeuge diese Erwartung erfüllen können. Außerdem ist es auch problematisch, wie ein autonomes Fahrzeug in einem nicht normierbare dilemmatische Entscheidungen, wie z.B. Leben gegen Leben bei plötzlich auftretenden Gefahren, reagieren werden sollen. Ähnliche Debatte kommen auch im Bereiche Autonome Waffensysteme und Maschinenethik vor. Im Allgemeinen die folgende Debatte auslösende Quellen können identizifiert werden:</div><div><br></div><div># KI Systemen machen eventuell Fehler - durch Beschränkungen der Sammlung von Daten und Dokumentation sowie der algorithmischen Regeln und - Aufgrund der Unvollständigkeit der Korpora bei der Entscheidungsfindung (ist eine Angabe alle nötige Informationskanäle in die Korpora um die gleiche Entscheidung von KI zu haben, was ein Mensch machen würde, ist nicht möglich).</div><div># Bedenken hinsichtlich der Designziele des Systems, z.B. bessere Geschäftsergebnis zu schaffen statt öffentliches Interesse zu folgen. Eine andere Debatte bilden die Überlegungen über die technologische Singularität, i.e. wann die KI das Übertreffen die menschliche Intelligenz erreicht. Obwohl diese Superintelligenz steht nicht unmittelbar bevor, ist Sie mit Angst vor Gefahr auf die Menschheit verbunden. Allerdings ist diese Angst zumindest Teilweise unbegründet, da diese dadurch ausgelöst, dass die Funktionsweise der KI Systeme für meistens die Menschen nicht bekannt ist, also durch die fehlende Transparenz.</div><div><br></div><div><span id="wissensgraphen"></span></div><div>= Wissensgraphen =</div><div><br></div><div><span id="grundlagen-von-wissensgraphen"></span></div><div>== Grundlagen von Wissensgraphen ==</div><div><br></div><div><span id="einleitung"></span></div><div>=== Einleitung ===</div><div><br></div><div>Basically a Knowledge Graph (KG) is a graph representing entities, like e.g. "Josef Strauss" or "Vienna" with limited set of semantic relations among them, like e.g. "place_of_birth" . The entities are represented by the nodes and edges specify the semantic relations. Node can also represent group of similar entities, like e.g. "Person", which is called concept or class. A KG can be defined by its properties as</div><div><br></div><div># it describes real-world entities and their relations in a graph form,</div><div># it defines concepts and possible relations of entities,</div><div># it enables potentially relating any entities with each other and</div><div># it covers numerous topical domains.</div><div><br></div><div>Beispiel (from paper3)</div><div><br></div><div>KGs origins from the project called Knowledge Graphs, which was created by two universities in Netherlands in the 1980s in order to design specific semantic networks with edges restricted to a limited set of relations to facilitate algebras on the graph. Knowledge Base (KB) is a data base which can store the same knowledge as a KG. Hence Knowledge graph realizes a graph-structured data model based Knowledge Base (KB).</div><div><br></div><div>- Related fields: data mining, Semantic Web (SW), databases, NLP - The same or similar problem exists in different fields having community specific names Entity Resolution (ER) vs. schema matching, type matching, ontology alignment</div><div><br></div><div>- Roots of KG - NLP: in the domain of information extraction (IE), from 1990s ==> IE research involved - Named Entity Recognition (NER) or simply Entity Recognition (ER) - Relation Extraction (RE) - Semantic Web (SW): one goal is to model and represent knowledge ==> Resource Description Framework (RDF), 1998 - data = set of triples: subject, predicate, object it models facts - Knowledge Discovery in Databases (KDD) and Data Mining ==> Application: Recommendation systems - using KG improves performance due to involving external knowledge - Recent advance: deep learning on KGs for recommendation systems ==> Graph Neural Networks (GNNs)</div><div><br></div><div><span id="elemente-von-kg"></span></div><div>=== Elemente von KG ===</div><div><br></div><div><span id="kg-in-der-praxis"></span></div><div>=== KG in der Praxis ===</div><div><br></div><div><span id="qualität-von-kg"></span></div><div>=== Qualität von KG ===</div><div><br></div><div><span id="beispiele"></span></div><div>=== Beispiele ===</div><div><br></div><div><span id="erzeugen-von-wissensgraphen"></span></div><div>== Erzeugen von Wissensgraphen ==</div><div><br></div><div><span id="klassifizierung-der-kg-konstruktion"></span></div><div>=== Klassifizierung der KG-Konstruktion ===</div><div><br></div><div><span id="liste-der-aufgaben-der-kg-konstruktion"></span></div><div>=== Liste der Aufgaben der KG-Konstruktion ===</div><div><br></div><div><span id="schritte-der-kg-konstruktion"></span></div><div>=== Schritte der KG-Konstruktion ===</div><div><br></div><div><span id="methoden-für-hauptaufgaben-in-der-kg-konstruktion"></span></div><div>=== Methoden für Hauptaufgaben in der KG-Konstruktion ===</div><div><br></div><div><span id="maschinelles-lernen-und-deep-learning"></span></div><div>= Maschinelles Lernen und Deep Learning =</div><div><br></div><div><span id="ansätze-des-maschinellen-lernens"></span></div><div>== Ansätze des maschinellen Lernens ==</div><div><br></div><div><span id="einführung-in-maschinelles-lernen"></span></div><div>=== Einführung in Maschinelles Lernen ===</div><div><br></div><div>The idea of machine learning is building up knowledge from examples and generalizing it.</div><div><br></div><div><span> '''''Definition von maschinellem Lernen'''''</span> Machine learning deals with developing and studying statistical algorithms for learning knowledge from examples and generalizing it for performing tasks automatically.</div><div><br></div><div><span> '''''Interpretation des Prozesses des maschinellem Lernens'''''</span> The knowledge will be built up by training a statistical model from examples which enables also the correct evaluation of the subset of unseen data. This is called transfer of learning. However another part of unseen data could not be correctly evaluated due to either lack of enough training data or not proper parameterization of the underlying statistical data.</div><div><br></div><div><span> '''''Positionierung und angewandte Methoden des maschinellen Lernens'''''</span> The term maschine learning was proposed in 1959 by Arthur Samuel. Terms like self-teaching computers, computing machinery were also used in the beginnings. Machine learning is a subfield of AI. In the early times it followed the goal of realizing artificial intelligence and applied. ML became an own field in 1990s as it turned to practical problems and moved from applying symbolic methods (like logic programming) to fuzzy logic, computational statistics, probability theory, pattern recognition and information retrieval. The statistical approach based pattern recognition achieved a big commercial success in the 1990s in practical applications like Large Vocabulary Speech Recognition Systems. ML is known in business context as predictive analytics. Related fields are Information extraction, Data mining and Statistical learning, in which some ML methods are adopted to statistic.</div><div><br></div><div><span> '''''Tasks of ML algorithms'''''</span> The ML algorithms realize in mathematical sense the following tasks - regression, - classification, - representation learning or - optimal control.</div><div><br></div><div>Regression and classification can be used for predicting the output based on the trained ML models.</div><div><br></div><div><span> '''''Methods used in ML'''''</span> Numerous methods are used in ML. For regression task besides of linear regression, also polynomial regression and kernel regression are used. For classification by means of statistic models, locating the classes in some observable space, dynamic programming search and decision trees can be used. Powerful methods for realizing discriminant functions for arbitrary classification problem include support vector machines, and special structure multilayer NNs. For more detailed description on these methods see section IV.</div><div><br></div><div><span id="learning-ansätze"></span></div><div>=== Learning Ansätze ===</div><div><br></div><div>- überwachtes (maschinelles) Lernen - Supervised learning (SL) - unüberwachtes Lernen - Unsupervised learning - Halb-überwachtes Lernen - Semi-supervised learning (SSL) - Verstärkungslernen - Reinforcement learning (RL)</div><div><br></div><div>The classical learning approaches/paradigms of ML are usually categorized as</div><div><br></div><div>* supervised learning</div><div>* unsupervised learning</div><div>* reinforcement learning</div><div><br></div><div>In supervised learning (SL) the machine learns from dataset consisiting of examples, which are labeled. The typical tasks of supervised learning in mathematical sense are classification and regression. The examples represent the input usually as sequence of vectors. Each example is labeled and the labels represent the right answer, e.g. the right class in case of classification, belonging to that example. The vectors in the examples either represent the whole input or they are feature vectors representing essential information. An example for input vectors representing the whole input is the image classification, where each example is a picture represented as sequence (or matrix) of vectors, each of them descriping a pixel. Another type of example is speech recognition, in which sequence of features extracted from the audio data are used in the examples. Besides of extracting only the essential info needed for the considered task, feature vectors are usually much more compact representation of the input than the whole input data.</div><div><br></div><div>In unsupervised learning the machine learns from examples without labels. The examples represent the input usually as sequence of feature vectors. In general the task of unsupervised learning is representation learning. Representation learning transforms the input data into a representation preserving the essential part of the information about the input data, while keeping the representation simpler, lower dimensional and in many case sparser. More concretely the typical tasks of unsupervised learning are density estimation and learning the structure of the input dataset, like e.g. in case of clustering. In discrete case density estimation learns an estimate of the probability function of the input dataset, while in continuous case the probability distribution function (pdf) of the input dataset is esteimated, in both cases representing them by less parameters than the size of the input data. Clustering can be seen also as classification with automatically determined classes.</div><div><br></div><div>Reinforcement learning (RL) is a different from supervised and unsupervised learning in the sense that it does not need a dataset for the learning. Instead the machine iteratively interacts with its environment and learns from the feedbacks of the environment to its actions. The task realized by reinforcement learning is the optimal control arising also in biological systems and has connection to neuroscience and psychology.</div><div><br></div><div>Note, that all the three basic paradigms of ML (supervised learning, unsupervised learning and reinforcement learning) can be formulated as an optimization task.</div><div><br></div><div>The dataset usually, but not always is represented in a form of design matrix. In this matrix each row stores a different example. Each column represents a different feature. For example the Iris dataset , which is one of the oldest dataset studied in ML, contains four features in each line. These features describe different sizes of the plants, like sepal width, sepal length, etc. Such matrix representation of dataset is possible only if the examples as vectors have the same size. In the case of having example vectors with different size the dataset can be represented as a set of example vectors.</div><div><br></div><div>The typical tasks of supervised and unsupervised learning can be roughly but mathematically described by applying a typical statistical learning theory assumptions on the data generating process. According to them it is assumed that each example represented by a random vector <math display="inline">{\bf x}</math> is generated from a background population and therefore these random vectors are independent and identically distributed. These assumptions together are commonly referred as i.i.d. assumptions.</div><div><br></div><div>Then the density estimation task of the unsupervised learning can be described as observing samples of a random vector <math display="inline">{\bf x}</math> in the dataset and learning the probability function/density <math display="inline">p({\bf x})</math>. On the other hand the classification or regression tasks of supervised learning can be described as observing the samples of a random vector pairs <math display="inline">{\bf x}</math> and <math display="inline">{\bf y}</math> and learning how to predict <math display="inline">{\bf y}</math> from <math display="inline">{\bf x}</math>, in most of the cases by learning the conditional probability function/density <math display="inline">p({\bf y}\|{\bf x})</math>. Here <math display="inline">{\bf y}</math> represents a label, which can consists also of more components. In case of classification the vectors <math display="inline">{\bf y}</math> are values of a discrete space, while in case of regression they are vectors from continuous space. The term supervised learning comes from the view that an instructor provides the right target <math display="inline">{\bf y}</math> to the input <math display="inline">{\bf x}</math> to show the machine what to be predicted. The supervised learning and unsupervised learning are not formally defined and separated. This can be seen by converting the learning of <math display="inline">p({\bf x})</math> to learning several conditional distributions and by converting the learning of <math display="inline">p({\bf y}\|{\bf x})</math> to learning several distributions. The former can be done by applying the multiplication theorem of probability as</div><div><br></div><div><math display="block">p({\bf x}) = \prod_{i=1}^m p(x_i\|x_1,\ldots,x_{i-1}),</math> meaning that an unsupervised problem can be converted into <math display="inline">m</math> supervised ones. Here <math display="inline">p(x_1\|x_0)=p(x_1)</math> by definition. The later can be shown by using the Bayes rule as</div><div><br></div><div><math display="block">p({\bf y}\|{\bf x}) = \frac{p({\bf x}, {\bf y})}{\sum_{{\bf \upsilon}} p({\bf x}, {\bf \upsilon})}.</math> Thus a supervised learning can be performed by unsupervised learning of <math display="inline">p({\bf x}, {\bf y})</math> and then applying the above relation. Transforming a supervised learning task to several unsupervised learning tasks arises in practice when no labeling available and hence it is not possible to learn <math display="inline">p({\bf y}\|{\bf x})</math> directly from the data. Such cases are e.g. the tasks of denoising or synthesis (see below).</div><div><br></div><div>There are many tasks which can be led back to or based on classification or any of the typical tasks of the supervised or unsupervised learnings. Below we give a list and a brief discussion of several such tasks.</div><div><br></div><div>* Classification with missing input. This task is solved by applying more function mapping from input to classes instead of using only one. Each function mapping realizes a classification with different subset of missing inputs. On such way this task is led back to more classification tasks.</div><div>* Anomaly detection. This task learns besides of many usual input patterns also several unusual ones. Thus essentially it is also a classification. This task has very important applications like e.g. credit card fraud detection.</div><div>* Transcription. This task converts an unstructured input data into text. It covers e.g. speech recognition or optical character recognition. Text consists of finite number of units (letters, words) and hence this task is ultimately also a classification.</div><div>* Machine translation. This task performs a conversion from input sequence of units (lie e.g. words) to output sequence of units, like translating one natural language to another one. Translators usually use an abstact level to establish connections between the units of the input and output. Predicting units of the output from the sequence of input is realized also by means of classification.</div><div>* Sampling. This task generates new examples similar to the examples of dataset, which is used for learning. It can be realized by a combination of classification, representation learning and random number generation.</div><div>* Denoising. In this task the machine must predict the clean example <math display="inline">{\bf x}</math> from the corrupted example <math display="inline">{\bf \tilde{x}}</math>. Hence this would be a classical supervised learning. However the clean examples are not available and hence this task is realized by means of unsupervised learning indirectly.</div><div>* Synthesis. This task generates new kind of output from the input. Typical example is speech synthesis, in which synthesized speech is generated from the text form input. Basically it is also pattern recognition, like in case of transcription, but there will be more variants of output generated instead of only one in order to support the selection of more natural and realistic variant. In principle the synthesis represents a supervised learning. However usually there are no speech labels prepaired and thus synthesis is realized by means of unsupervised learning indirectly.</div><div>* Generating structured output. In this task the components of the output vector have interrelationships or/and they fit to a structure. A typical example is parsing, i.e. structuring the units of the input natural language sentence into a tree describing the grammatical structure of the sentence in terms of nouns, verbs, adverbs, etc. It can be realized by means of grammar rules, which represent supervised learning. Synthesis also generates structured output.</div><div><br></div><div>There are also other ML learning approaches, which can be seen as some mixture of supervised and unsupervised learning.</div><div><br></div><div>In semi-supervised learning only a subset of examples have labels.</div><div><br></div><div>In multi-instance learning the dataset are partitioned to disjunct groups of examples and the groups are labeled but the individual examples are not. For example in binary classification such labels can assign the information whether the group contains an example of a class or not.</div><div><br></div><div>Self-supervised learning (SSL) is a result of combining artificial neural networks (ANN) with the classical ML approaches. It utilizes the capability of ANN enabling the extraction some internal structure from an unstructured text. SSL consists of an unsupervised learning phase followed by a supervised one. In the first phase some structure info present in the input, like embedded metadata, domain knowledge or correlations, are extracted by means of unsupervised learning. Then this info is used for automatical labeling the input. In the second phase a supervised learning is performed on this labeled input.</div><div><br></div><div><span id="typical-ml-modelle"></span></div><div>=== Typical ML Modelle ===</div><div><br></div><div>Three characterizing examples of ML models are</div><div><br></div><div>* Linear regression model,</div><div>* Hidden Markov model (HMM) and</div><div>* Artificial Neural Network (ANN) models.</div><div><br></div><div>Linear regression model and HMM are statistic models. They learn the conditional probability distribution <math display="inline">p({\bf y}\|{\bf x})</math>. In contrast to them ANN models learn discriminant functions to perform classification task.</div><div><br></div><div><span> '''''Linear regression model'''''</span></div><div><br></div><div>Recall that the maximum likelihood estimate of the parameters <math display="inline">{\bf w}</math> of the normally distributed random variable with pdf <math display="inline">\mathbb{N}(y\| \hat{y}({\bf x},{\bf w}),\sigma^2)</math> equals to its MSE estimate, assuming the linear relation <math display="inline">{\bf \hat{y}} = {\bf X}{\bf w}</math>. Here <math display="inline">\hat{y}({\bf x},{\bf w})</math> is the mean and <math display="inline">\sigma^2</math> is the given fixed variance of the normally distributed random variable. Based on this the conditional probability density <math display="inline">p(y\|{\bf x})</math> of the linear regression model, which predicts the one-dimensional <math display="inline">y</math> from input vector <math display="inline">{\bf x}</math>, can be seen as the normally distributed random variable with pdf <math display="inline">\mathbb{N}(y\| \hat{y}({\bf x},{\bf w}),\sigma^2)</math>. In other words</div><div><br></div><div><math display="block">p(y\|{\bf x}) = \mathbb{N}(y\| \hat{y}({\bf x},{\bf w}),\sigma^2).</math></div><div><br></div><div>The training of the model implements the maximum likelihood estimation of the parameters as <math display="block">{\bf w} = \arg\max_{{\bf w}} p(y\|{\bf x},{\bf w}).</math></div><div><br></div><div><span> '''''Hidden Markov model''''' </span> Hidden Markov Models are used to realize a classifier. During the classification the probabilities of the possible outputs are computed and the most probable one is decided to be the output. On such a way the probabilities of the possible outputs realize discriminant functions. Additionally the trained parameters of the statistic model implicitly locate the classes in some observable space.</div><div><br></div><div>In HMM there is a set of possible output symbols, the output alphabet, <math display="inline">V</math>. In each discrete time epoch the HMM emits an output symbol, producing a sequence of outputs, <math display="inline">\overrightarrow{\bf x}=(x_1,...,x_T)</math>, <math display="inline">x_i \in V</math>, where <math display="inline">i=1,...T</math> denotes the discrete time. Moreover there is an underlying discrete-time Markov chain with the state space <math display="inline">S</math> and transition probability matrix <math display="inline">{\bf A}</math>. The sequence of states, <math display="inline">\overrightarrow{\bf y}=(y_1,\ldots,y_T)</math>, <math display="inline">y_i \in S</math>, <math display="inline">i=1,\ldots,T</math>, can not be directly observed. Instead the sequence of output symbols can be observed. If there is a probabilistic connection between the emitted output symbol at time <math display="inline">t</math>, <math display="inline">v_t</math>, and the sequence of states of the underlying DTMC, <math display="inline">(y_1,...,y_T)</math>, then this connection together with the DTMC determines the sequence of output symbols, <math display="inline">(x_1,...,x_T)</math> in stochastic sense. In the simplest case the output symbols are emitted independently of each other and thus the emission process can be characterized by the conditional probabilities <math display="inline">b_{j,k} = P(x_t = v_j\|y_t = s_k)</math> for <math display="inline">v_j \in V</math>, <math display="inline">s_k \in S</math> and <math display="inline">t=1,...,T</math>, i.e. the emitted output symbol at time <math display="inline">t</math> is fully determined by the state of the underlying DTMC at that time <math display="inline">t</math>. These conditional probabilities can be organized in a matrix <math display="inline">{\bf B} = [b_{j,k}]</math>. Then the matrices <math display="inline">{\bf A}</math> and <math display="inline">{\bf B}</math> are the parameters of the HMM. When HMM is used as classifier, then the given sequence of observations, <math display="inline">\overrightarrow{\bf x}</math> serves as input and the underlying sequence of states, <math display="inline">\overrightarrow{\bf y}</math> corresponds to the class. The conditional probability density <math display="inline">p(\overrightarrow{\bf y}\|\overrightarrow{\bf x})</math> of the HMM model is given by</div><div><br></div><div><math display="block">p(\overrightarrow{\bf y}\|\overrightarrow{\bf x}) = \frac{p(\overrightarrow{\bf y},\overrightarrow{\bf x})}{\sum_{\overrightarrow{\bf y}}p(\overrightarrow{\bf y},\overrightarrow{\bf x})}= \frac{\prod_{t=1}^{T} {\bf A}_{y_{t-1},y_{t}} \prod_{t=1}^{T} {\bf B}_{y_{t},x_{t}} }{\sum_{\overrightarrow{\bf y}}\prod_{t=1}^{T} {\bf A}_{y_{t-1},y_{t}} \prod_{t=1}^{T} {\bf B}_{y_{t},x_{t}} },</math></div><div><br></div><div>where <math display="inline">{\bf A}_{y_{0},y_{1}} = P(y_1)</math> by definition.</div><div><br></div><div>A given sequence of output <math display="inline">\overrightarrow{\bf x}</math> can be created by many sequence of states of the underlying DTMC, each of them with some probabilities. Therefore, during classifying, the sequence of states generating the given sequence of output <math display="inline">\overrightarrow{\bf x}</math> with the highest probability, <math display="inline">\overrightarrow{\bf y}^</math> which is also called as best path, is selected to be the corresponding class. In other words</div><div><br></div><div><math display="block">\overrightarrow{\bf y}^ = \arg\max_{\overrightarrow{\bf y}} p(\overrightarrow{\bf y}\|\overrightarrow{\bf x}) .</math></div><div><br></div><div>The above optimum task In the practice the best path is computed usually by means of the Viterbi algorithm, which implements a computationally efficient solution of the above optimization task.</div><div><br></div><div>The training of the HMM model means learning the parameter matrices <math display="inline">{\bf A}</math> and <math display="inline">{\bf B}</math>. This is done by the maximum likelihood estimation of these parameters as <math display="block">({\bf A},{\bf B}) = \arg\max_{{\bf A},{\bf B}} p(\overrightarrow{\bf y}\|\overrightarrow{\bf x}, {\bf A},{\bf B}).</math></div><div><br></div><div>A computationally efficient realization of the above optimization task can be developed by utilizing the EM algorithm. This results in the Baum-Welch algorithm, which is the standard algorithm used in the practice for training a HMM model. Such HMM model can be used e.g. for weather forecast or implementing the acoustic part of a speech recognition task.</div><div><br></div><div><span> '''''Künstliche neuronale Netze (KNN) - Artificial Neural Network (ANN)'''''</span> Artificial Neural Network (ANN) can realize discriminant functions not necessarily having any probabilistic interpretation. This includes also non-linear functions. The realized functions can be seen as approximations determined by the weight parameters of the ANN. On such way ANNs can be considered as generalization of the conventional ML models, which motivates their use for classification tasks. ANN is a network of artificial neurons, which built on the principles of biological neural networks. These principles were already studied in the 1950s and the layered network of perceptrons introduced by Frank Rosenblatt in 1958 can be considered as a first ANN. ANNs consisting of at least three layers can be used also for non-linear separable classification tasks. (für nichtlinear trennbare Klassifizierung)</div><div><br></div><div><span id="basic-concepts-of-ml"></span></div><div>=== Basic concepts of ML ===</div><div><br></div><div>In this subsection we discuss the basic concepts of ML and several related parts of statistical learning theory. These apply to learning from dataset, i.e. the concepts of supervised and unsupervised learning are covered.</div><div><br></div><div><span> '''''Generalization ability'''''</span></div><div><br></div><div>The superior goal of machine learning from dataset is that the ML algorithm must perform the considered task well also on unseen input examples. More concretely it means that the distribution <math display="inline">p({\bf x})</math> or <math display="inline">p({\bf y}\|{\bf x})</math> must be learned good enough to be valid also for unseen examples, i.e. sequences of vectors <math display="inline">{\bf x}</math>. Hence the challenge of any ML algorithm learning from dataset is to establish a generalization ability.</div><div><br></div><div><span> ''Training and generalization error''</span></div><div><br></div><div>To establish a generalization ability requires of course that the ML algorithm must perform the considered task good enough at least on the dataset used for learning. Controlling these abilities requires first to be able to quantify them by computing some measures. The dataset used for learning is called training data. The dateset used for measuring the performance of the ML algorithm on new test data, contains unseen examples and it is called test set. The performance of the ML algorithm on the training data can be measured by the training error, which is the expected value of the error on the test set. Similarly the test error is the expected value of the error on the test set. It is able to measure the performance of the ML algorithm on the test set and hence also the generalization ability of the ML algorithm. Because of this the generalization error is defined to be the test error. In practice the expected value of the error is approximated by the statistical mean of the errors of the individual examples. For example in the linear regression model with one-dimensional output the weight parameters of the regression model were trained by minimizing MSE, which equals to the training error: <math display="block">\mbox{MSE} = \frac{1}{K} \lVert {\bf X}^{train}{\bf w}  - {\bf y}{train} \rVert_2^2.</math></div><div><br></div><div>The test error is expected to be greater or equal to the training error. Thus the generalization error can be decomposed as<br /></div><div>generalization error = training error + gap between training and generalization error.<br /></div><div>Therefore a ML algorithm with high generalization ability can be achieved by</div><div><br></div><div># making the training error small and</div><div># making the gap between training and generalization error small.</div><div><br></div><div><span> ''Overfitting, underfitting and capacity''</span></div><div><br></div><div>Two potential problems of training related to the above issues, training error and gap between training and generalization error, are overfitting and underfitting. Overfitting occurs when the machine’s model is fitted not only to the examples from the underlying distribution <math display="inline">p({\bf x})</math> or <math display="inline">p({\bf y}\|{\bf x})</math> but follows some evolution between the examples, which differs from the underlying distribution. This results in low training error, but high test error, since the test set probably contains also examples locating between the training examples. Hence in case of overfitting the gap between training and generalization error is large. Underfitting occurs when the machine’s model can not be fit to all of the seen training examples. This causes that the training error can not be sufficiently low. Overfitting and underfitting are related to and therefore can be controled by the representation capacity (or simple capacity) of the model. Roughly speaking it specifies the modeling capability of the model, e.g. in terms of set of functions on its input.</div><div><br></div><div>We explain the relation of capacity to overfitting and underfitting in more detailes by the help of an illustrating example. Let us assume that for one-dimensional <math display="inline">{\bf y}</math> and <math display="inline">{\bf x}</math>, the underlying <math display="inline">p(y\|x)</math> concentrates around a parabolic funcion on its input <math display="inline">{\bf x}</math>. In this case applying a linear regression as the machine’ model <math display="block">y = wx + b,</math> this linear function of x, with the two parameters w and b can not cover all the examples being parabolic in x. This leads to underfitting. Taking a higher degree function as machine’s model, e.g. a polynomial function of x with degree 7 <math display="block">y = \sum_{i=1}^7 w_i x^i + b,</math> gives a model with eight paramaters (<math display="inline">w_1,\ldots,w_7</math> and b) and more complex set of functions to describe the machine’s model. The set of functions specifying the modelling capability of the machine is called also as hypothesis space. Observe that in spite of the fact that this model is polynomial of degree 7 in x, it is still a linear function of the parameters <math display="inline">w_1,\ldots,w_7</math> and b and thus the theoretical framework of MSE is still can be applied to it so the model still can be trained by using the closed form normal equations. Applying this model as the machine’s model can cover all the given (less than 7) examples from the parabolic population, but between the example x points it follows a curve being polynomial of degree 7 instead of parabolic. This leads to overfitting, i.e. low training error but high test error. The optimal model of a ML algorithm is, roughly speaking, the one having the capacity being the same as the capacity of the underlying distribution of the training data, or in more general the one having the capacity which is proper for the real complexity of the task and the available amount of training data. High capacity models are necessary to solve complex tasks, but if the needed capacity is lower than the one of the machine’s model then it can lead to overfit. In our illustrating example the optimal machine’model is the quadratic function of the input, in other words <math display="block">y = w_2 x^2 + w_1 x + b.</math> Figure c shows all the three cases for 5 training examples from the underlying parabolic distribution of the training data. It can be seen that the linear model yields underfitting since the linear function can not fit to the parabolic one. The degree 7 model can cover all the examples but it can represent infinitely many functions (not only the parabolic one) passing through the training points, because it has more parameters as the number of available training examples. Hence it can lead to overfitting.</div><div><br></div><div>Finding the best function among the set of functions specified by the representation capacity gives an optimization problem, which can be solved only numerically in most of the cases. In practice it implies that the ML training does not provide the theoretically best, globally optimal function, but instead a function that considerable reduces the training error. This additional limitation of the numerical optimization algorithm causes that the experienced, so called effective capacity of the model is usually less than the model’s representational capacity.</div><div><br></div><div>The ideas about establishing the best generalization ability of the ML models can be seen as application of more general thoughts of philosophers to statistical learning theory. The principle of parsimony, nowadays widely known as Occam’s razor states that among the potential hypothesis spaces being capable of covering the observations, the simplest one should be selected. This principle from the 13th-14th century has been formulated more precise in the 20th century by several scientist being the founders of statistical learning theory, like Vapnik, Chervonenkis or Blumer (,  and ).</div><div><br></div><div>The capacity of a ML model as being a classifier is formulated exactly for some cases by statistical learning theory. The Vapnik-Chervonenkis dimension provides the capacity of a binary classifier as the highest number of different examples, m in a training set which can be labeled by the classifier arbitrarily.</div><div><br></div><div>There is a trade-off between training error and gap between training and generalization errors as a function of capacity. Typically the training error decreases as the capacity increases, and converges to an asymptotic minimum training error. On the other hand the generalization error is always greater or equal to the training error and the gap between them increases in the overfitting region as the capacity increases. This results in a U-shaped curve for the generalization error as a function of capacity. The optimal capacity should be high enough to enable sufficiently complex hypotheses space yielding to low training error (i.e. to avoid underfitting as much as possible) and should be low enough to avoid overfitting as much as possible and thus leading to low gap between training and generalization errors. The situation is illustrated on Figure u.</div><div><br></div><div>Figure u.</div><div><br></div><div><span> ''Dependency on training data size''</span></div><div><br></div><div>When the machine would know the true underlying distribution of the training data, one would expect no training error. However even in this case there were some error due to some noise in the distribution e.g. because of dependency of <math display="inline">{\bf y}</math> also on other variables besides <math display="inline">{\bf x}</math>. This remaining error, which is the theoretical minimum of the test error, is called Bayes error.</div><div><br></div><div>When the model capacity is optimal then increasing the training data size implies that the test error asymptotically reaches the Bayes error, but the training error can fall below the Bayes error. This is because the training error is measured on the trained examples and hence the noise causing Bayes error does not necessarily arises. When the model’s capacity lies in the underfitting region then increasing the training data size causes the increase of training error and simultaneously the decrease of test error. The first one is because a larger dataset is more difficult to fit. In this case the training error must increase at least up to the Bayes error. The second one is due to the more determined hypothesis set fitting to training data, i.e. less number of hypotheses are consistent with the training data. Due to underfitting the test error asymptotes to an error higher than the Bayes error as the training data size increases.</div><div><br></div><div>The most important trends on generalization error as well as its relation to training error and capacity can be summarized as</div><div><br></div><div>* The expected generalization error is greater or equal to the expected training error.</div><div>* The optimal capacity is located at the minimum point of the generalization error as a function of capacity.</div><div>* Expected generalization error is monoton decreasing with increasing training data size.</div><div>* With increasing training data size the generalization error goes asymptotically to</div><div> Bayes error when the model capacity is optimal and</div><div> an error value exceeding the Bayes error when the capacity locates in the underfitting region.</div><div><br></div><div>It follows that with the model having an optimal capacity is still possible that the gap between the training and generalization gap is large. In this case this large gap, and hence also the generalization error, can be reduced by increasing the training data size.</div><div><br></div><div>On the other hand, when using extremaly large training data, overfitting can not happen, so the dominant concern is underfitting.</div><div><br></div><div><span> '''''ML training relevant properties of statistical point estimation'''''</span></div><div><br></div><div>In many machine’s model statistical point estimation is used to compute the optimal values of the ML parameters. This is the case e.g. in the linear regression model or the HMM. Let <math display="inline">\hat{\mbox{\boldmath$\theta$}}_n</math> and <math display="inline">{\mbox{\boldmath$\theta$}}</math> denote the estimation over <math display="inline">n</math> examples (i.e. <math display="inline">{\bf x}</math>) and the true values of the ML parameters. Recall that the estimation <math display="inline">\hat{\mbox{\boldmath$\theta$}}_n</math> is a statistic of <math display="inline">n</math> samples taken from the underlying data-generating distribution, and hence it can be also seen as a probability distribution determined by the distribution of the underlying data-generating distribution. Relevant properties of the point estimation include</div><div><br></div><div>* bias,</div><div>* variance and</div><div>* consistency.</div><div><br></div><div><span> '''''Bias, variance and MSE'''''</span></div><div><br></div><div>The bias of the estimator <math display="inline">\hat{\mbox{\boldmath$\theta$}}</math> is the difference between its expectation over a data set and its true value, in other words <math display="block">Bias(\hat{\mbox{\boldmath$\theta$}}_n) = E[\hat{\mbox{\boldmath$\theta$}}]- {\mbox{\boldmath$\theta$}}.</math></div><div><br></div><div>An estimator is unbiased if its bias equals to <math display="inline">0</math> and asymptotically unbiased if <math display="inline">\lim_{n \rightarrow \infty }  bias(\hat{\mbox{\boldmath$\theta$}}_n) = 0</math>. For example the sample mean <math display="inline">\frac{1}{n}\sum_{i=1}^n {\bf x}^{(i)}</math> the examples <math display="inline">{\bf x}^{(i)}</math>, <math display="inline">i=1,\ldots,n</math> is an unbiased estimator of the mean. However the sample variance <math display="inline">\frac{1}{n}\sum_{i=1}^n \left({\bf x}^{(i)} - \hat{\mbox{\boldmath$\mu$}}_n\right)^2</math>, where <math display="inline">\hat{\mbox{\boldmath$\mu$}}_n</math> is the sample mean, has a non-zero bias. It needs a slight modification to give the unbiased sample variance as <math display="inline">\frac{1}{n-1}\sum_{i=1}^n \left({\bf x}^{(i)} - \hat{\mbox{\boldmath$\mu$}}_n\right)^2</math>.</div><div><br></div><div>The variance of the estimator <math display="inline">\hat{\mbox{\boldmath$\theta$}}</math> is simple its variance as <math display="block">Var(\hat{\mbox{\boldmath$\theta$}}_n).</math> The variance of the estimator expresses the uncertainty of the estimation. The standard error of the estimator is the square root of its variance and denoted by <math display="inline">SE(\hat{\mbox{\boldmath$\theta$}}_n)</math>. The standard error of the sample mean can be given as</div><div><br></div><div><math display="block">SE(\frac{1}{n}\sum_{i=1}^n {\bf x}^{(i)}) = \sqrt{Var(\frac{1}{n}\sum_{i=1}^n {\bf x}^{(i)})} = \sqrt{\frac{1}{n^2}\sum_{i=1}^n Var({\bf x}^{(i)})} = \sqrt{(\frac{\sigma^2}{n})} = \frac{\sigma}{\sqrt{(n)}},</math> where <math display="inline">\sigma^2</math> is the true variance of the underlying distribution. This means that the variance of the sample mean decreases proportional with the number of samples. Equivalently the standard error of the sample mean decreases with square root of the number of samples, i.e. less than linear. In practice the standard error has usually higher relevance than the variance.</div><div><br></div><div>The MSE of an estimation <math display="inline">\hat{\mbox{\boldmath$\theta$}}_n</math> is given by <math display="block">MSE(\hat{\mbox{\boldmath$\theta$}}_n) = E[(\hat{\mbox{\boldmath$\theta$}}_n) - ({\mbox{\boldmath$\theta$}}_n)^2].</math></div><div><br></div><div>There is a meaningful relation between the bias, the variance and the MSE of an estimation as <math display="block">MSE(\hat{\mbox{\boldmath$\theta$}}_n) = Bias^2(\hat{\mbox{\boldmath$\theta$}}_n) + Var(\hat{\mbox{\boldmath$\theta$}}_n).</math></div><div><br></div><div>This can be derived simply from the definitions as <math display="block">\begin{aligned}</div><div>MSE(\hat{\mbox{\boldmath$\theta$}}_n)  &&\hspace{-5mm}= E[(\hat{\mbox{\boldmath$\theta$}}_n) - ({\mbox{\boldmath$\theta$}}_n)^2] = E[(\hat{\mbox{\boldmath$\theta$}}_n^2)-2 E[(\hat{\mbox{\boldmath$\theta$}}_n)] {\mbox{\boldmath$\theta$}}_n + ({\mbox{\boldmath$\theta$}}_n)^2  \nonumber \\</div><div> &&\hspace{-5mm}= \left(E[(\hat{\mbox{\boldmath$\theta$}}_n^2) - E^2[(\hat{\mbox{\boldmath$\theta$}}_n)\right) + \left(E^2[(\hat{\mbox{\boldmath$\theta$}}_n) -2 E[(\hat{\mbox{\boldmath$\theta$}}_n)] {\mbox{\boldmath$\theta$}}_n + ({\mbox{\boldmath$\theta$}}_n)^2\right) \nonumber \\</div><div> &&\hspace{-5mm}= Var(\hat{\mbox{\boldmath$\theta$}}_n) + \left(E[(\hat{\mbox{\boldmath$\theta$}}_n)] - {\mbox{\boldmath$\theta$}}_n\right)^2 = Var(\hat{\mbox{\boldmath$\theta$}}_n) + Bias^2(\hat{\mbox{\boldmath$\theta$}}_n).  \nonumber</div><div>\end{aligned}</math></div><div><br></div><div>When the generalization error is measured not as error in the output <math display="inline">{\bf y}</math>, but instead as MSE of the ML parameters, then the dependency of generalization error on capacity has still the same U-shaped curve, since the goodness of the ML training is reflected on every error levels (error in output, error in ML parameters, MSE of ML parameters,...) on similar way. Then bias and variance are the components of the generalization error. The increase in capacity results also in increase in variance (recall that the number of <math display="inline">n</math>-degree polynomials fitting the same number of training examples increases with <math display="inline">n</math> as discussed in linear regression task). Therefore the bias has to decrease with increasing capacity. The situation with these relations among bias, variance, MSE and capacity are shown in figure dd.</div><div><br></div><div>Figure dd.</div><div><br></div><div><span> '''''Consistency'''''</span></div><div><br></div><div>An estimator is consistent if the estimated value goes to the true parameter value as the number of samples, <math display="inline">n</math> goes to infinity. In other words <math display="block">\lim_{n \rightarrow \infty } \hat{\mbox{\boldmath$\theta$}}_n) = {\mbox{\boldmath$\theta$}}.</math></div><div><br></div><div>If the convergence is in probability then the consistency property is called weak consistency.</div><div><br></div><div>The maximum likelihood estimation has the nice property that under some conditions it is consistent. Recall that maximum likelihood estimation is often used in ML training, like e.g. in case of the linear regression model or the HMM.</div><div><br></div><div><span> '''''On specificity level of ML algorithms'''''</span></div><div><br></div><div>From the point of view of statistical learning theory a ML algorithm can generalize from a finite number of seen training examples. The question arises whether an ML algorithm can be established, which can generalize for every possible ML tasks. Intuitively it seems to be a contradictory, since to infer also on any not seen examples the machine’s model must have information about every possible examples, e.g in a form covering the characteristics of the underlying population completely.</div><div><br></div><div><span> ''No free lunch theorem''</span></div><div><br></div><div>The answer to this question is question is given by the no free lunch theorem (). It states that, averaged over all possible underlying data-generating distributions, every classification algorithm has the same test error. Hence as expected there is no universal best classifier.</div><div><br></div><div>Fortunately this result holds only over all possible underlying distributions. For a real-word task we can specify the enabled data-generating distributions by making some assumptions it. This encourages us to incorporate such assumptions, which are specific to the considered task, as further information into the design of the ML algorithm for that task. Such information can be e.g. some knowledge on the form of the underlying distribution (like its distribution family or the true capacity describing the set of functions needed to capture the underlying distribution completely).</div><div><br></div><div>As a conclusion, the right approach to design an ML algorithm is not to look for an universal best algorithm, but instead to plan a specific ML algorithm by building some preferences to narrow its scope to the task relevant underlying distributions.</div><div><br></div><div><span> ''Hyperparameters''</span> The usual direct objective of the learning is to determine several parameters of the ML model by minimizing the training error based on the machine’s model. For example in case of the linear regression the parameters in the vector <math display="inline">{\bf w}</math> are determined by minimizing the squared error on the training data. However, as we have seen, the model with the so determined parameters does not lead necessarily to the best generalization error, since it depends also on the representation capacity of the model. Hence the optimal capacity must be also determined, usually by iteratively repeating the the training with different capacity setting. Consequently the capacity as model parameter is not determined by the ML algorithm itself.</div><div><br></div><div>Such model parameters, which are used to control the ML algorithm but are not determined by the ML algorithm itself, are called hyperparameters. The capacity is an example for hyperparameter, but ML algorithms can have many other hyperparameters. A setting is assigned to be a hyperparameter when learning on the training set is not appropriate</div><div><br></div><div>* because the resulted optimization would be difficult, or</div><div>* more often, because that setting can not be learned by the ML algorithm on the trainng data.</div><div><br></div><div>An example for the second case is the capacity. If the ML algorithm would try to learn capacity on the training data, then always the maximum available capacity were chosen, since the training error decreases with increasing capacity. So it would result an overfitting.</div><div><br></div><div>To resolve this issue, another set of data is used, which was not observed during the training and alo does not overlap with the test set. This set of data is called validation set. Usually the validation set is split from the training set. The remaining part of the training set is used to learn the parameters and the validation set is used to measure the generalization error after the training and to update the hyperparameters accordingly.</div><div><br></div><div>Note that the validation is also the part of the "training" of the whole model including the hyperparameters and hence the validation error typically underestimates the generalization error measured on the test set.</div><div><br></div><div><span> ''Regularization''</span></div><div><br></div><div>The capacity as hyperparameter in the linear regression task shapes the hypotheses space by excluding set of functions from it. Another and more general way of controlling the capacity is to give preferences for one subset of hypotheses space over another one. This can be realized by modifying by training criterion for linear regression by adding a so called weight decay to it as</div><div><br></div><div><math display="block">\arg\min_{{\bf w}} MSE({\bf w}) + \lambda {\bf w}^T {\bf w},</math></div><div><br></div><div>where <math display="inline">\lambda</math> is a hyperparameter for controlling the strength of preferring smaller weights (i.e. smaller <math display="inline">{\bf w}^T {\bf w}</math>). Here <math display="inline">MSE({\bf w})</math> enables also functions of higher degree in <math display="inline">{\bf x}</math>, but still linear in <math display="inline">{\bf w}</math>. The above minimizing forces the resulted weights realizing a tradeoff between best fitting the training examples and having small norm. Applying it to the example of Figure c, setting <math display="inline">\lambda= 0</math> would lead to overfitting, but setting a moderate value of <math display="inline">\lambda</math>, the weight decay would effect to fit with simpler function having smaller coefficients. Hence setting a moderate value of <math display="inline">\lambda</math> would result in a function being a good approximation of the true function describing the data set.</div><div><br></div><div>More generally regularizing a ML model learning the function <math display="inline">y({\bf x},{\mbox{\boldmath$\theta$}}</math> means to add a regularizer to it in order to set preferences for different solutions. There are also other ways of expressing of giving preferences for some solutions over other ones. In general regularization refers to any modification made on the ML training algorithm aiming to reduce the generalization error but usually not the training error. In the above regularization example with weight decay reducing the gap between generalization and training error means to find the optimal capacity and thus achieving the best generalization ability.</div><div><br></div><div>Note, that regularization does not eliminate the necessity of having hyperparameters, but enables an alternative, more general way of controlling the generalization ability. According to no free lunch theorem there is no universal regularization, instead the regularization must be designed to be well suited to the particular task. However general-purpose forms of regularization can be proper for wide range of task, and therefore such regularization can be seen as a step towards generalization of ML algorithms.</div><div><br></div><div><span id="optimization-framework-for-training-ml-models"></span></div><div>=== Optimization framework for training ML models ===</div><div><br></div><div><span> '''''From mathematical optimization to ML problem'''''</span></div><div><br></div><div>The goal of ML can be described as to minimize the error between the predicted values produced by the trained ML model and the true values on the test data. This can be formalized to a general optimization framework, for the case of one-dimensional output, by enabling a more general per-example loss function <math display="inline">L({\bf x},y,{\mbox{\boldmath$\theta$}})</math>, instead of an error measure as</div><div><br></div><div><math display="block">R^({\mbox{\boldmath$\theta$}}) = E_{({\bf x},y)\sim p } [L({\bf x},y,{\mbox{\boldmath$\theta$}})],</math></div><div><br></div><div>where <math display="inline">E[]</math> stands for the expectation, which is taken over the true underlying data-generating distribution <math display="inline">p</math>. The quantity <math display="inline">R^({\mbox{\boldmath$\theta$}})</math> is called risk. Thus the optimization task formalizing the goal ML can be given as</div><div><br></div><div><math display="block">{\mbox{\boldmath$\theta$}} = \arg\min_{{\mbox{\boldmath$\theta$}}} R^({\mbox{\boldmath$\theta$}}) = \arg\min_{{\mbox{\boldmath$\theta$}}} E_{({\bf x},y)\sim p } [L({\bf x},y,{\mbox{\boldmath$\theta$}})].</math></div><div><br></div><div>However in ML problem the true underlying data-generating distribution <math display="inline">p</math> is not know. Instead we can compute the expected loss on the training data resulting in the empirical risk, <math display="inline">R({\mbox{\boldmath$\theta$}})</math> as</div><div><br></div><div><math display="block">R({\mbox{\boldmath$\theta$}}) = E_{({\bf x},y) \sim \hat{p} } [L({\bf x},y,{\mbox{\boldmath$\theta$}})] \approx \frac{1}{m} \sum_{i=1}^{m} L({\bf x}^{(i)}, y^{(i)}, {\mbox{\boldmath$\theta$}} )),</math></div><div><br></div><div>where <math display="inline">\hat{p}</math> is the training data-generating distribution, <math display="inline">m</math> is the number of training examples as well as <math display="inline">{\bf x}^{(i)}</math> and <math display="inline">y^{(i)}</math> is the input and output in the <math display="inline">i-</math>th example, <math display="inline">i=1,\ldots,m</math>, respectively. Hence the ML training can be formalized into an optimization task as</div><div><br></div><div><math display="block">\begin{aligned}</div><div>{\mbox{\boldmath$\theta$}} &&\hspace{-5mm}= \arg\min_{{\mbox{\boldmath$\theta$}}} R({\mbox{\boldmath$\theta$}}) = \arg\min_{{\mbox{\boldmath$\theta$}}} E_{({\bf x},y) \sim \hat{p} } [L({\bf x},y,{\mbox{\boldmath$\theta$}})]\nonumber \\</div><div>&&\hspace{-5mm} \approx \arg\min_{{\mbox{\boldmath$\theta$}}} \frac{1}{m} \sum_{i=1}^{m} L({\bf x}^{(i)},y^{(i)},{\mbox{\boldmath$\theta$}}) = \arg\min_{{\mbox{\boldmath$\theta$}}} \sum_{i=1}^{m} L({\bf x}^{(i)},y^{(i)},{\mbox{\boldmath$\theta$}}).\nonumber</div><div>\end{aligned}</math></div><div><br></div><div>When maximum likelihood estimation is used in ML training, like e.g. in HMM model, then its optimum task formalization is given by <math display="block">\begin{aligned}</div><div> {\mbox{\boldmath$\theta$}} &&\hspace{-5mm}= \arg\max_{{\mbox{\boldmath$\theta$}}} \sum_{i=1}^{m} \log p_{\mbox{~distribution~family~}}({\bf x}^{(i)},y^{(i)},{\mbox{\boldmath$\theta$}})\nonumber \\</div><div> &&\hspace{-5mm}= \arg\min_{{\mbox{\boldmath$\theta$}}} \frac{1}{m}\sum_{i=1}^{m} \left(-\log p_{\mbox{~distribution~family~}}({\bf x}^{(i)},y^{(i)},{\mbox{\boldmath$\theta$}})\right)\nonumber \\</div><div> &&\hspace{-5mm}\approx  \arg\min_{{\mbox{\boldmath$\theta$}}} E_{({\bf x},y)~ \hat{p} } \left(-\log p_{\mbox{~distribution~family~}} ({\bf x},y,{\mbox{\boldmath$\theta$}}))\right).\nonumber</div><div>\end{aligned}</math></div><div><br></div><div>Hence the loss function corresponds to the negative log likelihood. Another typical choices of loss function include</div><div><br></div><div> squared error <math display="inline">(\hat{\theta}({\bf x})-y)^2</math>, e.g. in MSE minimization for linear regression,</div><div>* squared error of output <math display="inline">= E[(\hat{y}({\bf x},{\mbox{\boldmath$\theta$}})-y)^2]</math> or</div><div>* error as difference between predicted and true output, i.e. <math display="inline">\hat{y}({\bf x},{\mbox{\boldmath$\theta$}})-y</math>, like e.g. in most of the classification tasks.</div><div><br></div><div>If the machine learning problem were a pure mathematical optimization task, then it could be solved by applying an mathematical optimization algorithm. However the machine learning problem differs from pure mathematical optimization in the following aspects:</div><div><br></div><div># The true underlying data-generating distribution <math display="inline">p</math> is not know<br /></div><div><math display="inline">\Rightarrow</math> The empirical risk is used instead of the risk.</div><div># In ML problem loss functions can arise, which can not be optimized efficiently by numerical optimization methods. <math display="inline">\Rightarrow</math> In such cases surrogate loss function is used instead of the original loss function</div><div># ML training stops according to early stopping criterion instead of reaching minimum. <math display="inline">\Rightarrow</math> Early stopping based usually on detecting start of overfitting. <math display="inline">\Rightarrow</math> The (surrogate) loss function can have still large derivatives at stopping.</div><div># The objective function of optimization in ML problem is usually an expectation of the loss over the training data. <math display="inline">\Rightarrow</math> The (approximation of the) objective function usually has a form including sum over training examples.</div><div><br></div><div><span> '''''Organizing the training process'''''</span></div><div><br></div><div>The optimization in ML training is performed by means of numerical optimization techniques. First-order methods need only to compute the gradient of the objective function in each iteration. The second-order methods compute additionally also the Hessian matrix or any approximation of in each iteration. In any case the gradient is computed in each iteration.</div><div><br></div><div>Several arguments shows that it is advisable to take into account the following recommendations at organizing the training process.</div><div><br></div><div># Each iteration in a training process should be organized over smaller number of training examples, so called minibatches, instead of large size training set.</div><div>#* The computation of gradient of the objective function must be computed on every training examples used in an iteration. Therefore less size minibatches are required. However to retain a good approximation ability for the real gradient, the training examples of the minibatch have to be selected randomly from the training data. This is also the basic idea of the stochastic gradient descent algorithm.</div><div>#* Recall that the standard error of the sample mean decreases less than linear (<math display="inline">\frac{\sigma}{sqrt(n)}</math>) with the number of samples. This speaks for using not too many training examples, since it does not pay off above some region of <math display="inline">n</math>.</div><div>#* Also the potential redundancy present in training data motivates rather using an estimation of the gradient from small number of samples than computing it from a larger training set.</div><div># It is advised to shuffle the examples in the training set before training the ML model.</div><div>#* In many field the datasets contain correlated examples due to natural way of arranging them, like e.g. in time order, etc.. This is an intrinsic property of such datasets. To ensure the i.i.d. assumption of the underlying data-generating distribution, it is necessary to randomize the training data.</div><div>#* The random composition of a minibatch would require a uniform sampling of a training set at each iteration. This would be impractical in case of very large training sets. Instead it is enough to shuffle the dataset only once, storing it and using it for the further training process.</div><div><br></div><div>Optimization algorithms in ML are classified according to the used portion/way of the training set as</div><div><br></div><div>* Batch or deterministic (gradient) methods - process all training examples in each iteration leading to a large bath.</div><div>* Minibatch or minibatch stochastic methods - use smaller number of training examples, minibatches in each iteration.</div><div>* Stochastic methods - use only one training example in each iteration.</div><div>* Online methods - refer to the way of taking the examples of a stream creating examples continually.</div><div><br></div><div>Nowadays is common to call the minibatch stochastic methods also as simply stochastic methods.</div><div><br></div><div>The proper minibatch size is influenced by the following factors:</div><div><br></div><div># The most important factor is the order of the applied numerical optimization algorithm. Only gradient (<math display="inline">{\bf g}</math>) based first-order methods are robust and need smaller batch size, while second-order methods computing also Hessian matrix (<math display="inline">{\bf H}</math>) in each iteration require much larger batch sizes to average fluctuations in <math display="inline">{\bf H}^{-1}{\bf g}</math>, which is used in updating the ML parameters.</div><div># Hardware related considerations</div><div>#* Efficient utilization of multicore architectures requires a minimum batch size.</div><div>#* Using special hardware, like Graphics Processing Unit (GPU) requires minibatch size to be power of 2 to produce better run time. Typical sizes are from 32 to 256.</div><div>#* In case of parallel processing of minibatches, the amount of necessary memory increases with the minibatch size. Thus the available memory provides an upper limit for the minibatch size.</div><div># Small minibatch size cause regularizing effect due to implicitely adding some noise to the learning process. However it produces high variances and hence very small learning rate must be set to keep the stability of the training process. This together with small minibatch size results in very high training time.</div><div><br></div><div><span> '''''Stochastic gradient descent'''''</span></div><div><br></div><div>Stochastic gradient descent (SGD) is the most widely used numerical optimization method in ML. The major idea of SGD is to compose a smaller size minibatch for each iteration in the training by randomly sampling the training set. Due to random sampling, computing the gradient of the objective function as the mean of the gradients of the loss of such sampled examples gives an estimation of the gradient of the mean loss over the true data-generating distribution, i.e. the gradient of the generalization error (when loss = error). In other words:</div><div><br></div><div><math display="block">\frac{1}{m}\sum_{i=1}^{m}\nabla_{{\mbox{\boldmath$\theta$}}} L({\bf x}^{(i)},y^{(i)},{\mbox{\boldmath$\theta$}}) \approx E_{({\bf x},y)\sim p } \nabla_{{\mbox{\boldmath$\theta$}}}[L({\bf x},y,{\mbox{\boldmath$\theta$}})] = \nabla_{E_{({\bf x},y)\sim p } [L({\bf x},y,{\mbox{\boldmath$\theta$}})]}.</math></div><div><br></div><div>Of course this holds only if new examples are used in each minibatch. When the same training dataset will be passed through multiple times, then each run over the training set is called epoch. In this case only the first epoch estimates the gradient of the generalization error, the additional epochs tend to estimate the training error. However usually additional epochs decrease the training error stronger then the increase of the gap between training and generalization error due to harming the independence of the examples in the consecutive epochs. Therefore most implementation of SGD first shuffle the training set once and then it will be passed through multiple times.</div><div><br></div><div>Similar to batch gradient descent, SGD has also the parameter learning rate, <math display="inline">\epsilon</math>. However while batch gradient can work with fixed <math display="inline">\epsilon</math>, the right use of SGD requires to decrease the learning rate gradually over the time. This is necessary, because the random sampling used in SGD introduces some noise in each iteration. A non-decreasing learning rate would prevent the gradient to become small as the number of iterations increases. The schematic operation of the SGD algorithm is shown in Algorithm [[#tab:alg_sgd\|[tab:alg_sgd]]]</div><div><br></div><div><span>'''—————————————————————————————'''</span><br /></div><div>Algorithm  Stochastic gradient descent<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div>Input:<br /></div><div>- Learning rate parameters describing the sequence of learning rates <math display="inline">\epsilon_1,\epsilon_2,\ldots</math><br /></div><div>- Initial parameter <math display="inline">{\mbox{\boldmath$\theta$}}_{\mbox{init}}</math><br /></div><div>Output: the estimated parameters <math display="inline">{\mbox{\boldmath$\theta$}}</math>.<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div>1 Initialization: <math display="inline">k=1</math>, <math display="inline">{\mbox{\boldmath$\theta$}}_1 = {\mbox{\boldmath$\theta$}}_{\mbox{init}}</math> and set <math display="inline">\epsilon_1</math><br /></div><div>2 while stop criterion does NOT fulfil<br /></div><div>3   sample <math display="inline">m</math> examples of the training data to compose<br /></div><div>the minibatch <math display="inline">({\bf x}^{(1)}, {\bf y}^{(1)}) \ldots, ({\bf x}^{(m)},{\bf y}^{(m)})</math><br /></div><div>4   compute gradient of the actual loss function as <math display="inline">{\bf g} = \frac{1}{m} \sum_{i=1}^{m} \nabla_{{\mbox{\boldmath$\theta$}}}  L({\bf x}^{(i)}, {\bf y}^{(i)}, {\mbox{\boldmath$\theta$}}_k)</math><br /></div><div>5   Update: compute next value of <math display="inline">{\mbox{\boldmath$\theta$}}</math> as <math display="inline">{\mbox{\boldmath$\theta$}}_{k+1} = {\mbox{\boldmath$\theta$}}_k) - \epsilon_k {\bf g}</math><br /></div><div>6   Increment <math display="inline">k</math> as <math display="inline">k=k+1</math> and set <math display="inline">\epsilon_k</math><br /></div><div>7 end <span>'''—————————————————————————————'''</span><br /></div><div><br></div><div><br></div><div>The only remaining question is to specify the decreasing sequence of learning rates <math display="inline">\epsilon_1,\epsilon_2,\ldots</math>. First of all, to ensure the convergence of SGD, the following a sufficient condition must be taken into account:</div><div><br></div><div><math display="block">\begin{aligned}</div><div>&&\hspace{-5mm}\sum_{k=1}^{\infty}\epsilon_k = \infty, \mbox{~and~} \nonumber \\</div><div>&&\hspace{-5mm}</div><div>\sum_{k=1}^{\infty}\epsilon_k^2 < \infty. \nonumber</div><div>\end{aligned}</math></div><div><br></div><div>A commonly used way in practice to let the learning rate decrease linearly in the first <math display="inline">T</math> iterations and then let it remain constant as <math display="block">\epsilon_k = \left\{</div><div>\begin{array}{l}</div><div>(1- \frac{k}{T}) \epsilon_0 + \frac{k}{T} \epsilon_T, \mbox{~wenn~} k=1,\ldots,T  \\</div><div>\epsilon_T, \mbox{~wenn~} k > T</div><div>\end{array}</div><div>\right\}.</math></div><div><br></div><div>Best practices for setting the values of <math display="inline">T</math>, <math display="inline">\epsilon_0</math> and <math display="inline">\epsilon_T</math> are provided as</div><div><br></div><div>* usually <math display="inline">T</math> is set to result about several hundreds of iterations until stopping,</div><div>* <math display="inline">\epsilon_0 \approx \frac{1}{100}</math> and</div><div>* <math display="inline">\epsilon_0</math> is set higher than the learning rate would lead to best performance after the first 100 iterations, but not too high to avoid strong oscillations in the objective function.</div><div><br></div><div>Setting <math display="inline">\epsilon_0</math> too small should be avoided, since it would result in slowing down the learning process.</div><div><br></div><div>The convergence of SGD can be characterized by the so called excess error, which is the difference of the value of the objective function after <math display="inline">k</math> iterations and its minimum value at the convergence of the algorithm. SGD makes rapid initial progress in the excess error, which then becomes asymptotically slow. For convex problems the excess error goes with <math display="inline">\mathcal{O}(\frac{1}{\sqrt(k)})- \mathcal{O}(\frac{1}{k})</math>.</div><div><br></div><div>The most important advantages of SGD can be summarized as</div><div><br></div><div># SGD training minimizes the generalization error instead of the training error, as long as training examples are not reused.</div><div># In SGD the computation time of an update, i.e. one pass through on a minibatch, does not grow with the size of the training data, since it does not depend on it. This ensures convergence in practically reasonable time span also when the training data size becomes very large.</div><div># SGD makes rapid progress after small number of steps due to evaluating the gradient from very few examples.</div><div><br></div><div><span id="on-design-of-ml-algorithm"></span></div><div>=== On design of ML algorithm ===</div><div><br></div><div>Elements of ML algorithm are listed as</div><div><br></div><div>* model,</div><div>* lost function,</div><div>* numerical optimization method and</div><div>* dataset.</div><div><br></div><div>The machine model and the lost function have to be chosen to be appropriate for the particular task. Traditionally the objective function nand constraints of the ML problem (without ANNs) is designed to result a convex optimization problem. Thus the difficulties of a general optimization problem do not arise and hence both first-order and second-order numerical optimization methods can be applied.</div><div><br></div><div>Several best practices on the partitioning and necessary sizes of the dataset are provided as as</div><div><br></div><div># Partitioning the dataset</div><div>#* partitioning the dataset into training and test set - % numbers ?</div><div>#* Typically about 80 % of training examples are used for training and 20 % for validation</div><div># The necessary magnitude of minibatch size in the training:</div><div>#* <math display="inline">\approx</math> 100 - in case of using first-order numerical optimization methods, like (stochastic) gradient descent,</div><div>#* <math display="inline">\approx</math> 10000 - in case of using second-order numerical optimization methods.</div><div># Tipp for using small test set: small test set ==> cross-validation</div><div><br></div><div>The schematic algorithm of determining the parameters of the machine’s model is given in Algorithm [[#tab:schem_alg_mod_pars\|[tab:schem_alg_mod_pars]]].</div><div><br></div><div><span>'''—————————————————————————————'''</span><br /></div><div>Algorithm  Schematic algorithm of determining the model parameters<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div>Input:<br /></div><div>- machine’s model,<br /></div><div>- lost function,<br /></div><div>- numerical optimization method,<br /></div><div>- training data, validation data, test data. Output: optimal model parameters.<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div>1 Initialize hyperparameters<br /></div><div>2 while stop criterion based on pass through over validation set NOT fulfils<br /></div><div>3   update hyperparameters<br /></div><div>4   initialize <math display="inline">{\mbox{\boldmath$\theta$}}</math><br /></div><div>5   while stop criterion based on training error NOT fulfils<br /></div><div>6     update <math display="inline">{\mbox{\boldmath$\theta$}}</math> by means of an iteration step based on pass through over training data<br /></div><div>7   end<br /></div><div>8 end<br /></div><div>9 compute test error based on pass through over test set<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div><br></div><div><br></div><div><span id="differences-between-machine-learning-and-deep-learning"></span></div><div>== Differences between machine learning and deep learning ==</div><div><br></div><div>- Kompakt in einem Satz formuliert</div><div><br></div><div>Deep Learning (DL) is machine learning applying ANN having also at least one hidden layer. However this added part causes many differences in the operation and design of the DL algorithms comparing to the conventional ML algorithms.</div><div><br></div><div><span id="deep-learning-versus-conventional-machine-learning"></span></div><div>=== Deep Learning versus conventional machine learning ===</div><div><br></div><div>Although conventional ML and DL applies the same learning approaches, like e.g. supervised or unsupervised learning, their mathematical models are different. While conventional ML applies mainly statistic or deterministic models, which realize discriminant functions as probability functions over the hypotheses space, DL applies ANNs with hidden layer with approximating any kind of, also non-linear, discriminant functions. Accordingly the feature extraction in DL models is implicit and the way of operation can be described by mathematically tractable way only partly. As a consequence of it DL can be seen by many people as a black box, whose internal operation at most only partly understandable. On the other hand the ANNs applied in DL were originally inspired by observing the human brain structure, while the conventional ML models are based on human created mathematical abstractions. The major similarities and differences of DL and conventional ML are summarized in terms of its properties in the Table [[#tab:dl_vs_conv_ml\|2]].</div><div><br></div><div><div class="center"></div><div><br></div><div><div id="tab:dl_vs_conv_ml"></div><div><br></div><div>{\| class="wikitable"</div><div>\|+ DL versus conventional ML</div><div>\|-</div><div>! style="text-align: left;"\| Properties  </div><div>! style="text-align: center;"\|   DL  </div><div>! style="text-align: center;"\|   conventional ML</div><div>\|-</div><div>\| style="text-align: left;"\| Learning approaches  </div><div>\| style="text-align: center;"\|   supervised/unsupervised  </div><div>\| style="text-align: center;"\|   supervised/unsupervised</div><div>\|-</div><div>\| style="text-align: left;"\|  </div><div>\| style="text-align: center;"\|   and  reinforcement  </div><div>\| style="text-align: center;"\|   and  reinforcement</div><div>\|-</div><div>\| style="text-align: left;"\| Modells  </div><div>\| style="text-align: center;"\|   ANN  </div><div>\| style="text-align: center;"\|   statistic or deterministic</div><div>\|-</div><div>\| style="text-align: left;"\| Discriminant functions  </div><div>\| style="text-align: center;"\|   any functions </div><div>\| style="text-align: center;"\|   probability distributions</div><div>\|-</div><div>\| style="text-align: left;"\| realized  </div><div>\| style="text-align: center;"\|   including non-linear ones  </div><div>\| style="text-align: center;"\|  </div><div>\|-</div><div>\| style="text-align: left;"\| Feature extraction  </div><div>\| style="text-align: center;"\|   implicit  </div><div>\| style="text-align: center;"\|   explicit</div><div>\|-</div><div>\| style="text-align: left;"\| Mathematically describable </div><div>\| style="text-align: center;"\|   not  </div><div>\| style="text-align: center;"\|   yes</div><div>\|-</div><div>\| style="text-align: left;"\| in a tractable way  </div><div>\| style="text-align: center;"\|    </div><div>\| style="text-align: center;"\|  </div><div>\|-</div><div>\| style="text-align: left;"\| Understandable  </div><div>\| style="text-align: center;"\|   not fully  </div><div>\| style="text-align: center;"\|   fully</div><div>\|-</div><div>\| style="text-align: left;"\| Based on  </div><div>\| style="text-align: center;"\|   observing the </div><div>\| style="text-align: center;"\|   human created </div><div>\|-</div><div>\| style="text-align: left;"\|  </div><div>\| style="text-align: center;"\|   human brain structure  </div><div>\| style="text-align: center;"\|  mathematical abstractions</div><div>\|-</div><div>\| style="text-align: left;"\| Optimization task  </div><div>\| style="text-align: center;"\|   general non-convex  </div><div>\| style="text-align: center;"\|   convex</div><div>\|-</div><div>\| style="text-align: left;"\| for parameter learning  </div><div>\| style="text-align: center;"\|    </div><div>\| style="text-align: center;"\|  </div><div>\|}</div><div><br></div><div><br></div><div></div></div><div><br></div><div></div></div><div>DL algorithms are intended for more complex ML tasks, which can be too difficult for conventional ML algorithms. Accordingly the corresponding optimization task for parameter learning in DL problem is general non-convex, i.e. not a convex one any more. It has far-reaching consequences for the optimization design of such algorithms.</div><div><br></div><div>In the following we review the potential issues in optimization due to its general non-convex nature and go through on established and promising algorithmic strategies, which can cope with the identified challenges.</div><div><br></div><div><span id="potential-issues-in-dl-optimization"></span></div><div>=== Potential issues in DL optimization ===</div><div><br></div><div>Since most optimization task for parameter learning in DL problem is general non-convex, the majority of the used numerical optimization methods are only first-order ones. An list of major potential issues by numerical optimization of general non-convex problems is given as</div><div><br></div><div>* Poorly conditioned Hessian matrix.</div><div>* Ill-conditioning of the Hessian matrix.</div><div>* Finding local mininum instead of the global one.</div><div>* Saddle points and flat regions</div><div>* Steep regions</div><div><br></div><div><span> '''''Poorly conditioned Hessian matrix'''''</span></div><div><br></div><div>A quadratic matrix is poorly conditioned if the ratio of the absolute values of its highest and lowest eigenvalue, the so-called condition number of the matrix, is high. The eigenvalue with the highest and lowest absolute value of a Hessian matrix equals to the maximum and minimum second derivative (in some directions) of the function at that point, respectively. Therefore in case of poorly conditioned Hessian matrix, in one direction the derivative increases fast, while there is another direction, in which it changes slowly. In this case gradient descent can perform poorly. This is because the algorithm can waste lot of time by descending canyon wall being the steepest direction, instead of descending in the direction with less but longer descending direction going to globally less values of the objective function. The situation is illustrated in Figure z.</div><div><br></div><div>Figure z (=Figure 4.6 from Dl book) An example shows a minimizing a quadratic function with Hessian matrix having condition number 6. The steps of the gradient descents are marked with red.</div><div><br></div><div>Being unaware of the second derivatives providing some information on the more global structure of the objective function, the gradient descent can not choose a direction in which the objective function decreases perhaps slowly but longer. In this case a second-order optimization algorithm utilizing the Hessian matrix could lead to better performance.</div><div><br></div><div>When the optimization problem is convex then second-order methods, like Newton’ method can be successfully used also in case of poorly conditioned Hessian matrix.</div><div><br></div><div><span> '''''Ill-conditioning of the Hessian matrix'''''</span></div><div><br></div><div>The second order Taylor-expansion of the empirical risk <math display="inline">R({\mbox{\boldmath$\theta$}})</math> around <math display="inline">{\mbox{\boldmath$\theta$}}^{(0)}</math> can be given as <math display="block">R({\mbox{\boldmath$\theta$}}) \approx R({\mbox{\boldmath$\theta$}}^{(0)}) + ({\mbox{\boldmath$\theta$}} - {\mbox{\boldmath$\theta$}}^{(0)})^T {\bf g} + \frac{1}{2}({\mbox{\boldmath$\theta$}} - {\mbox{\boldmath$\theta$}}^{(0)})^T  {\bf H} ({\mbox{\boldmath$\theta$}} - {\mbox{\boldmath$\theta$}}^{(0)}),</math></div><div><br></div><div>where <math display="inline">{\bf g}</math> and <math display="inline">{\bf H}</math> is the gradient and the Hessian matrix at <math display="inline">{\mbox{\boldmath$\theta$}}^{(0)}</math>, respectively. Applying the recursive relation of the gradient descent, <math display="inline">{\mbox{\boldmath$\theta$}} = {\mbox{\boldmath$\theta$}}^{(0)} - \epsilon {\bf g}</math> to the above approximation yields</div><div><br></div><div><math display="block">R({\mbox{\boldmath$\theta$}}) \approx R({\mbox{\boldmath$\theta$}}^{(0)}) - \epsilon {\bf g}^T {\bf g} + \frac{1}{2}\epsilon^2 {\bf g}^T  {\bf H} {\bf g},</math></div><div><br></div><div>Thus the change in the objective function, <math display="inline">R({\mbox{\boldmath$\theta$}})</math>, in a gradient descent step equals to <math display="block">- \epsilon {\bf g} {\bf g} + \frac{1}{2}\epsilon^2 {\bf g}^T  {\bf H} {\bf g}.</math></div><div><br></div><div>The required effect of the (stochastic) gradient descent algorithm is to decrease the value of the objective function in each step. Ill-conditioning happens when</div><div><br></div><div><math display="block">\frac{1}{2}\epsilon^2 {\bf g}^T  {\bf H} {\bf g} > \epsilon {\bf g}^T  {\bf g},</math> and hence the value of the objective function increases instead of decreasing. Ill-conditioning can cause SGD to get stuck, i.e. even small steps increase the objective function. Ill-conditioning can be catched by monitoring both the terms <math display="inline">{\bf g}^T {\bf g}</math> and <math display="inline">{\bf g}^T  {\bf H} {\bf g}</math>. When <math display="inline">{\bf g}^T  {\bf H} {\bf g}</math> grows faster than <math display="inline">{\bf g}^T {\bf g}</math> then the learning rate <math display="inline">\epsilon</math> must be shrunk leading to slowing the learning process down.</div><div><br></div><div>Note that growing of <math display="inline">{\bf g}^T {\bf g}</math> alone does not necessarily means wrong evolution of the learning process. The objective function can decrease and thus the training process can be successful also while <math display="inline">{\bf g}^T {\bf g}</math> is growing. This is the case, e.g. in concave region.</div><div><br></div><div><span> '''''Finding local minimum instead of the global one'''''</span></div><div><br></div><div>Previously for a long time it was believed that existence of local minimum is a problem for optimizing parameters of NN. However it turned out that objective functions of NNs have multiple local minima having the same objective function value. This is due to</div><div><br></div><div>* weight space symmetry and</div><div>* non-uniquness of scale parameters.</div><div><br></div><div>It follows from the structure of NNs that a NN has many equivalent forms. Swapping incoming weights for units <math display="inline">i</math> and <math display="inline">j</math> of the same layer causes exchanges the roles of units <math display="inline">i</math> and <math display="inline">j</math>, but does not change the overall structure of the NN. Hence a NN with M layers and N units in each of them has <math display="inline">n!^m</math> equivalent forms leading to <math display="inline">n!^m</math> points in the hypothesis space of the input parameter vector consisting of every weights of the NN units. This is called weight space symmetry.</div><div><br></div><div>Scaling the incoming weight of a unit in several specific NNs, like e.g. rectified linear network, by <math display="inline">\alpha</math> and its outgoing weight by <math display="inline">\frac{1}{\alpha}</math> at the same time does not lead to any change in the resulted value of the represented objective function. Therefore there are infinitely many local minima corresponding to infinitely many input parameter settings with different scales, each of them with the same value of the objective function. In this sense one can speak about non-uniquness of scale parameters.</div><div><br></div><div>In all the above cases local minimum means no problem, because the common value of the objective function at these local minima is probably the lowest value. However local minimum can be a problem if the value of the objective function at that point is much higher than the value at global minimum. To check, whether it is th case is very difficult, especially in high dimensional hypotheses spaces.</div><div><br></div><div><span> '''''Saddle points and flat regions'''''</span></div><div><br></div><div>Saddle point of multivariate function is a point, which is a minimum in one direction and there is another direction, in which it is a maximum. For an example see Figure c.</div><div><br></div><div>Figure c. An example saddle point</div><div><br></div><div>Saddle points mean potential danger for numeric optimization algorithms, since iterating in descent direction can find one of them instead of local minimum and could stop there due to gradient being close to zero. Many types of high-dimensional functions shows the interesting property, that they have have much more saddle points than minimum. This can be understood by the help of the relations of the eigenvalues of the Hessian matrix to the extreme points (local minimums, saddle points, local maximum points). The Hessian matrix has only positive eigenvalues at local minimum. However at saddle point it has both positive and negative eigenvalues. In case of <math display="inline">n</math>-dimensional function there are <math display="inline">n</math> possible different eigenvalues. Roughly speaking assuming each eigenvalue can be positive or negative with probability <math display="inline">0.5</math>, the probability of a point being minimum is <math display="inline">0.5^n</math>, while te probability of being a saddle point is <math display="inline">1-2 0.5^n</math>, which is <math display="inline">2^n-2=2^{n-1}</math> times higher than the probability of being a minimum (now we neglect the case when the eigenvalue is <math display="inline">0</math> in this consideration). Therefore the ratio of number of saddle points to the number of minimums grows exponentially with the dimension of the objective function.</div><div><br></div><div>First-order algorithms using only gradient can either stop at saddle point (due to small gradient at that point) or escape from the environment of the saddle point when the gradient is still enough high there. Hence saddle points mean potential danger for first-order algorithms, but in many cases they do not cause a real problem. But saddle-points mean a definite problem for Newton’s method, since it was designed to reach the point with zero gradient.</div><div><br></div><div>Local maximum points usually do not mean problem for the optimization algorithms, since the number of such points become exponentially rare as the dimension of the function increase and the minimizing algorithm visit such points with very low probability due the the high value of the objective function around them.</div><div><br></div><div>Flat regions are problematic for all optimization algorithms in a general non-convex case. In this regions the algorithms stuck due to zero gradient and zero Hessian matrix. If the objective function has higher value in flat region than at its global minimum then the algorithm finding the flat region fails to find the minimum. However in convex problem, flat regions are valid minimum points, since the objective function must have the same value there as at global minimum, otherwise the function were not convex.</div><div><br></div><div><span> '''''Steep regions'''''</span></div><div><br></div><div>Extremely steep regions are problematic for gradient descent. The gradient is large at extrem steep region and hence the actual iteration step causes jumping far in parameter, usually jumping over the local cliff structure. On this way the algorithm can not come close enough to the bottom of the stepp region. This problem arises approaching the steep region from both above and below.</div><div><br></div><div><span id="algorithmic-strategies-for-dl"></span></div><div>=== Algorithmic strategies for DL ===</div><div><br></div><div>To cope with the challenges due to the above discussed issues in DL optimization, several modifications of the stochastic gradient descent algorithm, several second-order algorithm and several parameter initialization have been proposed.</div><div><br></div><div><span> '''''Modifications of the gradient descent algorithm'''''</span></div><div><br></div><div>We give a brief discussion on the modifications of the stochastic gradient descent algorithms, which are used most widely for training Dl models. These include</div><div><br></div><div>* Momentum,</div><div>* Nesterov momentum algorithm,</div><div>* AdaGrad algorithm,</div><div>* RMSProp algorithm,</div><div>* RMSProp algorithm with Nesterov momentum and</div><div>* Adam algorithm.</div><div><br></div><div><span> ''Momentum''</span></div><div><br></div><div>The basic idea of the method of momentum () is to use an accumulated weighted sum of the past gradients with decayed weights instead of the actual one. This attenuate the effect of the actual gradient, which is beneficial when the actual gradient does not show in the globally best descending direction or it includes some noise.</div><div><br></div><div>The method of momentum improves the gradient descent in case of</div><div><br></div><div>* poorly conditioned Hessian matrix leading to high curvature and</div><div>* variance, i.e. random noise in the stochastic gradient</div><div><br></div><div>by accelerating the learning.</div><div><br></div><div>Formally the method of momentum applies the variable <math display="inline">{\bf v}</math> for keeping and updating the accumulated weighted sum of the past gradients with decayed weights. This implies the change of the update rule becoming as</div><div><br></div><div><math display="block">\begin{aligned}</div><div>  &&\hspace{-5mm}{\bf v} = \alpha {\bf v}  - \epsilon \left( \frac{1}{m}\sum_{i=1}^{m}\nabla_{{\mbox{\boldmath$\theta$}}} L({\bf x}^{(i)},y^{(i)},{\mbox{\boldmath$\theta$}}) \right) \nonumber \\</div><div>  &&\hspace{-5mm} {\mbox{\boldmath$\theta$}} = {\mbox{\boldmath$\theta$}} + {\bf v}. \nonumber</div><div>\end{aligned}</math></div><div><br></div><div>The <math display="inline">\alpha</math> is an additional hyperparameter determining how quick the effect of past gradients fade away. Values of <math display="inline">\alpha</math> commonly used in practice are 0.5, 0.9 and 0.99. In SGD the step size was the norm of the gradient. Here besides of the norm, the step size also depends on the grade of alignment of the subsequent gradients. The largest step size occurs when the subsequent gradients point in the same direction. In this case the step size is</div><div><br></div><div><math display="block">\epsilon \frac{\lVert {\bf g} \rVert}{1- \alpha}</math></div><div><br></div><div>The schematic operation of the SGD with momentum algorithm is shown in Algorithm [[#tab:alg_sgd_mom\|[tab:alg_sgd_mom]]]</div><div><br></div><div><span>'''—————————————————————————————'''</span><br /></div><div>Algorithm  Stochastic gradient descent with momentum<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div>Input:<br /></div><div>- Learning rate parameter <math display="inline">\epsilon</math>, momentum parameter <math display="inline">\alpha</math><br /></div><div>- Initial parameter <math display="inline">{\mbox{\boldmath$\theta$}}_{\mbox{init}}</math>, initial velocity <math display="inline">{\bf v}_{\mbox{init}}</math><br /></div><div>Output: the estimated parameters <math display="inline">{\mbox{\boldmath$\theta$}}</math>.<br /></div><div><span>'''—————————————————————————————'''</span><br /></div><div>1 Initialization: <math display="inline">k=1</math>, <math display="inline">{\mbox{\boldmath$\theta$}}_1 = {\mbox{\boldmath$\theta$}}_{\mbox{init}}</math> and <math display="inline">{\bf v}_1 = {\bf v}_{\mbox{init}}</math><br /></div><div>2 while stop criterion does NOT fulfil<br /></div><div>3   sample <math display="inline">m</math> examples of the training data to compose<br /></div><div>the minibatch <math display="inline">({\bf x}^{(1)}, {\bf y}^{(1)}) \ldots, ({\bf x}^{(m)},{\bf y}^{(m)})</math><br /></div><div>4   compute gradient of the actual loss function as <math display="inline">{\bf g} = \frac{1}{m} \sum_{i=1}^{m} \nabla_{{\mbox{\boldmath$\theta$}}}  L({\bf x}^{(i)}, {\bf y}^{(i)}, {\mbox{\boldmath$\theta$}}_k)</math><br /></div><div>5   Update: compute next value of <math display="inline">{\bf v}</math> as <math display="inline">{\bf v}_{k+1} = \alpha {\bf v}_k - \epsilon_k {\bf g}</math><br /></div><div>6   Update: compute next value of <math display="inline">{\mbox{\boldmath$\theta$}}</math> as <math display="inline">{\mbox{\boldmath$\theta$}}_{k+1} = {\mbox{\boldmath$\theta$}}_k) + {\bf v}</math><br /></div><div>7   Increment <math display="inline">k</math> as <math display="inline">k=k+1</math><br /></div><div>7 end <span>'''—————————————————————————————'''</span><br /></div><div><br></div><div><br></div><div>The name momentum comes from a physical analogy of a particle moving in the parameter space. Here <math display="inline">{\bf v}</math> is velocity, negative gradient is a force moving the particle. Assuming unit mass, the momentum equals to velocity, which is added to the parameter.</div><div><br></div><div><span> ''Nesterov momentum''</span></div><div><br></div><div><span> '''''Second-order algorithms'''''</span></div><div><br></div><div>Second-order methods require some considerations and in some cases significant modification in order to be suitable for parameter learning in neural networks. The second-order methods being relevant to for training Dl models are</div><div><br></div><div>* Modified Newton’s method,</div><div>* Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm and</div><div>* Non-linear conjugate gradient algorithm</div><div><br></div><div><span> '''''Parameter initialization'''''</span></div>		ffgg

ZsoltTest: Unterschied zwischen den Versionen

Version vom 1. Februar 2024, 16:57 Uhr

Navigationsmenü

Suche