A two-step framework for text line segmentation in historical Arabic and Latin document images

(1)

https://doi.org/10.1007/s10032-021-00377-1 S P E C I A L I S S U E P A P E R

A two-step framework for text line segmentation in historical Arabic and Latin document images

Olfa Mechi¹ ·Maroua Mehri¹·Rolf Ingold²·Najoua Essoukri Ben Amara¹

Received: 22 November 2020 / Revised: 30 April 2021 / Accepted: 28 May 2021 / Published online: 11 June 2021

Abstract

One of the most important preliminary tasks in a transcription system of historical document images is text line segmentation.

Nevertheless, this task remains complex due to the idiosyncrasies of ancient document images. In this article, we present a complete framework for text line segmentation in historical Arabic or Latin document images. A two-step procedure is described. First, a deep fully convolutional networks (FCN) architecture has been applied to extract the main area covering the text core. In order to select the highest performing FCN architecture, a thorough performance benchmarking of the most recent and widely used FCN architectures for segmenting text lines in historical Arabic or Latin document images has been conducted. Then, a post-processing step, which is based on topological structure analysis is introduced to extract complete text lines (including the ascender and descender components). This second step aims at refining the obtained FCN results and at providing sufficient information for text recognition. Our experiments have been carried out using a large number of Arabic and Latin document images collected from the Tunisian national archives as well as other benchmark datasets.

Quantitative and qualitative assessments are reported in order to firstly pinpoint the strengths and weaknesses of the different FCN architectures and secondly to illustrate the effectiveness of the proposed post-processing method.

Keywords Historical documents·Text line segmentation·Pixel-wise classification·Benchmark·FCN architectures· Topological structural analysis

1 Introduction

Since the late twentieth century, researchers, historians and archivists working on cultural heritage documents have pointed out growing needs closely related to the preservation and exploitation of archival documents, and the dissemi- nation of their content by providing world-wide access to

B

^{Olfa Mechi}

olfamechi@yahoo.fr Maroua Mehri

maroua.mehri@eniso.u-sousse.tn Rolf Ingold

rolf.ingold@unifr.ch Najoua Essoukri Ben Amara najoua.benamara@eniso.rnu.tn

1 LATIS-Laboratory of Advanced Technology and Intelligent Systems, ENISo-National Engineering School of Sousse, Sousse University, 4023 Sousse, Tunisia

2 DIVA Group, University of Fribourg, 1700 Fribourg, Switzerland

larger document collections and by proposing global virtual libraries. Hence, numerous initiatives through research projects and studies have been taken to develop robust and accurate document transcription, indexing and retrieval tools.

In this context, this work has been carried out as part of a research project with the support of the Tunisian Ministry of Higher Education and Scientific Research and the col- laboration of the Tunisian national archives (ANT)¹and our industry partner “Smart Information Trade”. Due to the ever- increasing amount of available digitized document images, the ANT are looking for novel solutions able to optimize the accessibility and navigability of huge mass of document images. Furthermore, providing robust and accurate text recognition systems has been pointed out by theANTas a pri- mary necessity [1,2]. Therefore, our research project focuses on developing a smart solution able to characterize automatically document layouts and contents on the one hand, and to ensure multilingual text transcription and indexing of handwritten and printed archival documents on the other

1 http://www.archives.nat.tn/.

(2)

hand. Researchers working on historical document image analysis (HDIA) are continuing to propose more efficient and robust text transcription systems [3–5]. Some systems have been proposed to deal with document images which require a preliminary text line segmentation [6–10], while others have been focused on word recognition based on pre- segmented text lines [11,12]. Furthermore, the current text recognition systems are hindered by many issues related to the performance of the text line segmentation task. Indeed, text line segmentation has been always considered as a determining prerequisite task for achieving high accuracy rate of text recognition. Nevertheless, segmenting ancient document images into text lines is not a straightforward task due to the idiosyncrasies of the digital collections of theANT (cf.

Fig.1). The digital collections of theANTencompass ancient manuscripts through early printed and manuscript books to typewritten administrative documents of the twentieth century in Arabic and Latin. Moreover, text line segmentation is considered as a tricky task due to the presence of significant degradation levels, different kinds of noise (e.g., yellow pages, ink stains and back-to-front interference) and scan- ning defects (e.g., defects of curvature and light) on the one hand, and the unavailability ofa prioriknowledge about the document image characteristics (e.g., layout, content and digitization resolution) on the other hand [13].

Recently, fueled by the recent increase in computer hard- ware power, a new field of machine learning research called representation learning also known as deep learning has gained great attention of many researchers working on many sub-fields and tasks related to the issues surrounding computer vision and pattern recognition [14,15]. Many appli- cations in various areas (e.g., medical images and computer vision) have benefited of deep architectures. The deep models have shown outstanding performance in semantic segmentation [16]. They have the advantage of analyzing entire arbitrarily-sized images without any a priori knowledge.

Besides, their effectiveness is proved for image classification [17], segmentation [16] and detection [18] tasks,etc.Further- more, deep solutions have recently become an interesting alternative to the classical or conventional image processing methods (e.g., projection profiles, clustering and filtering) to tackle many issues related to HDIA tasks [19]. More particularly, the use of deep architectures for text line segmentation has been shown to be efficient with pages having complex layouts (e.g., variations in spacing between characters, words, lines, paragraphs and margins) [10]. Indeed, deep methods have addressed the challenges of classical image processing. Many researchers have stated that the most efficient methods used to segment historical documents into text lines are based on deep architectures. They have clearly argued that proposing a deep method is a consistent choice for meeting the need to segment a page into text lines under significant degradation and different noise levels and kinds

Fig. 1 Examples of archival document images collected by theANT

[20]. They also have demonstrated that deep methods outper- form the existing classical state-of-the-art ones. Moreover, it has been shown that deep solutions have good performance even for skewed document images and handwritten text as well as for curved and arbitrarily oriented text lines [21,22]. For instance, Renton et al. [23] proposed for handwritten text line segmentation in document images a deep methods based on fully convolutional networks (FCN) that clearly outperformed steerable filters. Furthermore, three of the five participating methods in the ICDAR2017 competition on baseline detection (cBAD2017)²are based on deep architectures [24].

In the literature, many researchers working on text line segmentation focused on proposing end-to-end deep-based methods without introducing a post-processing step [25], while others proposed text line segmentation methods based on combining deep architectures and other image processing techniques [21]. Faced with the wide variety of deep architectures, many questions arise. For example, which are

2 https://scriptnet.iit.demokritos.gr/competitions/5/.

(3)

the most adequate deep architectures for segmenting text into lines in historical Arabic and Latin document images? Which deep architecture represents the best compromise between performance and complexity? Is there a need for a post- processing step to refine the results of a deep architecture?

To have clear answers to the above questions, we present in this article a complete framework for text line segmentation in historical Arabic and Latin document images. The proposed framework is composed of two steps. The first step aims at determining the highest performing FCN architecture able to extract the main area covering the text core.

Then, a post-processing step which is based on topological structural analysis aims at extracting whole text lines (including the ascender and descender components). Our experiments have been conducted using a large number of ancient document images collected from theANTand different benchmark datasets provided in the context of recent open competitions at ICDAR and ICFHR conferences [24,26,27].

Quantitative and qualitative results along with computational cost (resources in terms of the number of parameters and time consumption considerations) are reported in order to high- light the strengths and weaknesses of the different assessed FCN architectures as well as the effectiveness of the proposed post-processing method.

The remainder of this article is structured as follows.

Section2 reviews the main recent solutions proposed for text line segmentation in historical document images. Sec- tion 3 details the proposed solution. In Sect. 4, we detail the experimental corpora and the experimental protocol used to compare the four investigated deep architectures and to evaluate the effectiveness of the proposed post-processing method. Section5presents firstly the different computed performance evaluation metrics, and subsequently the obtained results. Finally, our conclusions and further work are given in Sect.6.

2 Related work

In the literature, four main text line representations are defined to evaluate a text line segmentation method: set of pixels, enclosing polygon, baseline and X-height (cf. Fig.2).

– Set of pixelscorresponds to the pixels belonging to the textual content;

– Enclosing polygoncorresponds to a geometric representation in the shape of a polygon enclosing the connected pixels representing textual content;

– Baselinecorresponds to a virtual line enclosing the most characters whereas descender remains below;

– X-height corresponds to the area covering the text core without considering its ascenders and descenders [10,23, 24].

Fig. 2 Illustration of the four text line representations used in the literature

The state-of-the-art methods used for text line segmentation may be classified into three main categories: ad hoc, deep and hybrid.

2.1 Ad hoc approaches

The ad hoc approaches of text line segmentation are based on combining different image analysis techniques (e.g., clustering, projection profiles, filtering, smearing-based analysis and projection profiles). In the literature, these approaches have been categorized into three classes: global, local and hybrid.

– Global methods are based on estimating the text line zones by firstly grouping the components of character sequences, and then by splitting those belonging to mul- tiple text lines. For instance, the constrained seam carving technique was proposed by Zhang et al. [28] for text line segmentation in handwritten documents. The constrained seam carving technique was based on computing the energy map and determining the horizontal seams that correspond to the text line positions. Shi et al. [29]

used the steerable directional filters to extract text lines, and then applied few heuristic-based techniques to sep- arate the connected lines. Alaie et al. [30] determined the inter-line gaps in order to split the document images into vertical strips, and then applied a pixel-wise filtering technique on each detected vertical strip, and finally applied a thinning algorithm to localize handwritten text lines.

– Local methodsare firstly based on determining the local units such as the connected components (CC), and then grouping them in order to localize text lines. For instance, Louloudis et al. [31] used the Hough technique on a set of selected image points to extract text lines. The Hough-based approach is only well-adapted for printed

(4)

document images. The superpixel technique was applied by Ryu et al. [32] to extract the CC, and then a cost function was used to group the extracted CC that form text lines. Ryu et al. [32] demonstrated that their method achieved satisfying results only for Latin and Chinese handwritten document images.

– Hybrid methodsare based on combining both the local and global methods to extract text lines. For instance, a hybrid method was proposed by Kiumarsi et al. [33]. This method was based on firstly determining the separator lines by identifying the CC sequences, and then applying an adaptive projection profile to extract the handwritten text lines. Kiumarsi et al. [33] showed that their method achieved competitive results with low computational cost. Nevertheless, their method was not able to extract correctly the text lines in document image having small interline gaps.

Likforman-Sulem et al. [34] presented a well compre- hensive survey of the ad hoc approaches used for extracting text lines in handwritten documents. They stated that these approaches gave unsatisfactory performance, especially for document images having complex layouts, which is the case for historical documents.

2.2 Deep approaches

Since the results achieved by the ad hoc approaches used for text line segmentation are still not satisfying compared with the new competitive deep learning-based approaches especially when dealing with historical document images, researchers are continuing to propose and evaluate novel deep text line segmentation methods. Deep approaches have become the most suitable choices for solving different pixel- wise HDIA tasks, and particularly for text line segmentation one. Indeed, by using a deep architecture a pixel-wise classification is carried out to assign each pixel to either the text line class or the background one. Different variants of deep architectures have been proposed in the literature for solving many issues in the HDIA tasks. For instance, the use of FCN models which are variants of the classical deep convolutional neural networks (CNN) is pervasive in text line segmentation [21,22]. FCN are composed of only connected layers (e.g., convolution, pooling and upsampling) where dense layers are excluded. One of the major advantages of FCN is their abil- ity to cope with input images having different resolutions.

Moreover, FCN have the advantage to tackle both the classification and regression tasks by integrating and excluding the dense layers, respectively. Hence, FCN could have a reduced numerical complexity (i.e., a reduced number of trainable parameters) when the dense layers are excluded in the case of the classification task [16,23].

Over the past five years, numerous text line segmentation methods based on using deep architectures have been recently proposed. For instance, a generic CNN-based framework, calleddhSegment, was proposed by Oliveira et al. [35]

for addressing different HDIA tasks such as layout analysis, baseline extraction and page extraction. Oliveira et al.

[35] had not shown the generalization ofdhSegment since the same experimental corpus was used for both the prediction and training phases. Barakat et al. [36] presented a FCN model for Arabic handwritten text line detection. They used a sliding window to handle the text line extraction as a pixel- wise classification task. They showed the effectiveness of their method for challenging handwritten document images.

Nevertheless, they demonstrated that their method had higher computational complexity due to the fact that overlapping areas were processed. Mechi et al. [37] proposed an adaptive U-Net architecture for text line segmentation in Arabic and Latin handwritten documents. They demonstrated the robustness of their deep model by evaluating it on different historical document image datasets that had various layouts (complex and simple) and contents (scripts). Barakat et al.

[38] proposed a CNN-method to extract text lines from Ara- bic handwritten documents. Their method had the advantage of using unlabeled document images as network input. Vo et al. [39] proposed an adaptive FCN model that was based on exploring the spatial coordinates. They showed that their adaptive FCN model was well-suited for different kinds of input data. Kundu et al. [8] used the generative adversarial networks (GAN) for text line extraction. GAN achieved competitive results especially for handwritten document images.

Nevertheless, the GAN model was highly sensitive to the input data. Hence, a hyper-parameter fine-tuning of the GAN model was required.

2.3 Hybrid approaches

The hybrid approaches are based on combining deep architectures with classical image processing techniques (e.g., smearing, projection and CC analysis). For instance, Kiessling et al. [40] proposed firstly a fully convolutional encoder- decoder architecture in order to classify each pixel to either baseline or background. Then, a script- and layout-agnostic post-processing step was carried out to extract baseline in Persian and Arabic handwritten documents. Kiessling et al. [40] stated that deep-based segmentation methods could be applied for a wide variety of scripts when cou- pled with appropriate script-agnostic post-processing steps.

Vanilla ResNet-18 was firstly applied by Alberti et al. [41]

for semantic segmentation at pixel level. Then, the seam carving algorithm was used to extract the polygons surrounding text lines. The main advantage of Alberti et al.

[41]’ method was the use of the semantic segmentation as a pre-processing phase for image denoising. Alberti et al.

(5)

[41] showed that their method gave satisfying results for document images having double columns. However, an additional pre-processing phase was required to handle document images having double columns. Neche et al. [22] proposed to couple a deep RU-Net architecture (which is a variant of the U-Net architecture extended with a residual structure) with a CC analysis step in order to extract baselines from Arabic handwritten documents. Neche et al. [22]’s method achieved high performance compared with the state-of-the-art ones.

Nevertheless, Neche et al. [22] used a limited number of document images (only 50 images) was used to assess their method.

Grüning et al. [21] presented a hybrid method that was based on combining a deep neural network called ARU-Net (which is a variant of the U-Net architecture extended with an attention model and a residual structure), a bottom-up clustering method and few image processing techniques to extract baselines from Latin handwritten documents. Two stages were proposed by Grüning et al. [21]. The first one was based on the ARU-Net architecture focused on classifying each pixel to one of the three categories: separator, baseline and background. Indeed, ARU-Net generated two maps as output. The first map determined the pixel belonging to the baseline class, while the second one defined the beginning and end of each text line. Then, a second stage was applied to select a set of superpixels from the baseline heatmap and deduce the superpixel states by computing the inter-line gaps and orientations. Finally, the second stage was completed by computing the Delaunay neighborhood, projection profile, data cost and data energy techniques in order to determine the superpixels belonging to text lines. Grüning et al. [21]

showed that their method was able to extract curved and arbitrarily text lines. However, many extensive and heavy post-processing phases were introduced in the second stage.

3 Proposed framework

In this section, we present our framework for text line segmentation in archival document images. The proposed framework aims at extracting text lines in handwritten document images based on using a pixel-wise classification task. Indeed, each document pixel has to belong either to text class or background one. The proposed framework is composed of two steps. First, a deep fully convolutional networks (FCN) architecture has been applied to extract the main area covering the text core (cf. Sect.3.1). In order to select the highest performing FCN architecture, a thorough performance benchmarking of the most recent and widely used FCN architectures for segmenting text lines in historical Ara- bic or Latin document images has been conducted. This step aims at determining which FCN architecture among the four assessed in this article that represents a constructive compro-

mise between the text line segmentation performance and the computational cost. More specifically, we focus on exploring the performance of each FCN architecture according to the document script (Arabic or Latin) and layout (simple or complex). The first step extracts the main area covering the text core (cf. Sect.3.1). Second, we present in Sect.3.2a post- processing method, based on topological structural analysis.

The second step extracts whole text lines. It aims at refining the FCN results on the one hand, and providing sufficient information for text recognition on the other hand. Renton et al. [23], Neche et al. [22], and Mechi et al. [37] claimed that it is more efficient to propose a two-step solution to extract text lines in historical documents. It has recently been demonstrated that the deep text recognition methods that are based on using whole text lines (including the ascender and descender components) as input are outperforming those based on word and character as input [42]. Subsequently, a post-processing step is required to refine and improve the text line segmentation results at X-height level, and subsequently to provide sufficient information for end-to-end text recognition framework able to transcribe handwritten text lines in Arabic and Latin. Figure3 illustrates the scheme of the proposed framework.

3.1 X-height-based text line extraction step

Since it has been shown in the literature that the FCN architectures are efficient for semantic segmentation of document images and particularly for text line segmentation, we focus our work on investigating and comparing different FCN variants in order to propose an efficient framework for text line segmentation [16]. Faced with the wide variety of the FCN models, many questions arise. For example, which are the most adequate FCN models for segmenting text into lines in Arabic and Latin document images? Which FCN model represents the best compromise between performance and complexity? To have clear answers to the above questions, we have conducted a comparative study of four recent FCN models used for text line segmentation in Arabic and Latin historical document images. After referring to the most recent and widely used FCN architectures for text line segmentation in the literature, we propose a comparative study of the four following FCN variants: classical U-Net [21], dilated FCN [23], RU-Net [22] and adaptive U-Net [37]. This study focuses on evaluating the performance of each FCN variant according to the document script (Arabic or Latin) and layout (simple or complex). It aims at determining the FCN variant having the best trade-off between the lowest computational cost and the highest performance of segmenting archival document images into lines at X-height level.

Renton et al. [23], Neche et al. [22] and Mechi et al. [37]

used the X-height as a text line representation for evaluating their FCN-based text line segmentation method. Compared

(6)

Classical U-Net

Dilated FCN

RU-Net

Adaptive U-Net

Selection of the best FCN architecture (best trade-off between the lowest computational cost and the highest performance)

Modified RLSA Detection of the ascender and descender

components Extraction of the foreground

contours Step 1: Extraction of text lines at X-height level

Step 2: Extraction of the whole text lines (including the ascender and descender components)

Output

Zoomed region

Extraction of the X-height contours Input

Comparison

Fig. 3 Scheme of the proposed framework for text line segmentation in historical documents (Arabic and Latin scripts)

to other text line representations, the X-height one has several advantages. First, it provides sufficient and reliable information for the text recognition task (which is within our research project goal) since it covers the text core. Indeed, it gives for each line its associated image that can ease the preparation of the data required for the optical character recognition systems. Then, it deals with overlapping line issues (particularly in the case of handwritten documents written in Arabic) [23].

Moreover, the X-height representation has the advantage to contain sufficient information that can be used to deduce the other text line representations defined in the literature (set of pixels, enclosing polygon and baseline). Thus, in the first step of the proposed framework, we have chosen the X-height as a text line representation.

3.1.1 Classical U-Net

The classical U-Net architecture is a variant of FCN, which was introduced by Ronneberger et al. [43] for segmenting medical images. It is composed of the contracting (downsam- pling) and expansive (upsampling) paths. The contracting path is used for feature extraction, while the expansive one is used for ensuring accurate localization by combining the contextual information captured from the contracting path.

It operates both as an encoder and a decoder. Ronneberger et al. [43] used the upsamling in the expansive path in order to increase the output resolution. Grüning et al. [21] compared the performances of the classical U-Net with other FCN variants (RU-Net, ARU-Net and LARU-Net) for text line detection in historical document images. They stated that ARU-Net outperformed the classical U-Net. The classical U-Net architecture is described in more detail in [43].

3.1.2 Dilated FCN

The dilated FCN is a CNN where dilated convolutions were used in the decoder part and whose dense layers in the encoder part of the model have been removed. Renton et al.

[23] stated that the dilated FCN had numerous advantages.

First, by removing the dense layers, the number of parameters in the training phase were reduced on the one hand, and images of variable sizes can be fed as input on the other hand.

Second, by using dilated convolution, image resolution was kept unchanged on both the training and prediction phases.

The dilated FCN is described in more detail in [23].

3.1.3 RU-Net

The residual U-Net, which is called RU-Net, is a variant of the U-Net architecture. The RU-Net was based on residual con- nections between layers having similar spatial dimensions.

The RU-Net is described in more detail in [22].

3.1.4 Adaptive U-Net

The adaptive U-Net architecture is a variant of the classical U-Net, which was introduced by Mechi et al. [37] for text line segmentation in historical Arabic and Latin document images. It is based on using the deconvolution for the decoder part in order to keep the same resolution on both the input and output of the network architecture. They stated that the adaptive U-Net required less trainable parameters than the classical one. Mechi et al. [37]’ method achieved competitive results for images having simple and complex layouts. The adaptive U-Net is described in more detail in [37].

(7)

3.2 Post-processing step

Once we have selected manually the FCN variant that has the best trade-off between the lowest computational cost and the highest performance, we propose a post-processing method able to refine the obtained FCN results by extracting the whole text lines. It has recently been demonstrated that the deep text recognition methods that are based on using whole text lines as input are outperforming those based on analytical approaches (i.e., segmentation-based method) [42]. Figure4a and b illustrates the outputs of the X-height-based text line extraction step (i.e., text line at X-height level) and the post- processing step (whole text line including the ascender and descender components), respectively.

The proposed post-processing step is composed of the two following modular processes: (i) modified run length smoothing algorithm (RLSA) and (ii) ascender and descender component detection.

3.2.1 Modified RLSA

This stage focuses on detecting the foreground contours of the binary image of the analyzed document by extracting larger CC that correspond as much as possible to a sub-word or few adjacent sub-words. For this purpose, a binarization step is firstly carried by applying the Otsu’s algorithm on the analyzed document [44]. It has been recently shown that a fast and adequate foreground/background separation will be ensured by applying the Otsu’s algorithm [45]. Then, the CC are extracted from the binary image of the analyzed document. Afterward, the heights and widths of the extracted CC are analyzed in order to determine the appropriate vertical and horizontal run-length smoothing values. Indeed, in our work the computation of the vertical and horizontal run-length smoothing values is dependent of the median values of the widths and heights of the extracted CC, respectively. Then, the RLSA is applied to fill the space between the extracted CC by linking the neighboring black areas [46]. The RLSA is carried out row-wise and column-wise by replacing

(a) X-height-based text line extraction step

(b)Post-processing step

Fig. 4 Illustration of the outputs of the X-height-based text line extraction step (i.e., text line at X-height level) and the post-processing step (i.e., whole text line)

a horizontal (vertical, respectively) sequence of background pixels with foreground ones when the number of background pixels in the horizontal (vertical, respectively) sequence is equal or smaller to a predefined horizontal thresholdTh(vertical threshold T_v, respectively). The horizontal (Th) and vertical T_v thresholds are given by Equation 1 and Equa- tion2, respectively.

Th =ch×mwh (1)

T_v=c_v×mh_v (2)

wheremwhandmh_vcorrespond to the median values of the widths and heights of the extracted CC, respectively.chand c_vhave been experimentally set to 2.5 and 2.0, respectively.

OnceThandT_vare fixed, the modified RLSA is applied on the binary image of the analyzed document, and subsequently the foreground contours are extracted.

3.2.2 Ascender and descender component detection This stage focuses on computing and analyzing the intersection areas between each extracted X-height contour of the output of the selected FCN variant, and each foreground contour determined by using the modified RLSA. The foreground contour extraction task consists in determining the convex hull of a finite set of pixels that define the boundaries of a foreground content (i.e., text, graphic or noise). Hence, we simply use a contour extraction function that draws a curve joining all the boundary pixels of a foreground content.

Three possible scenarios related to the computed intersection areas could be envisaged (cf. Fig.5a).

– Number of intersection areas=0

The analyzed foreground contour does not most likely correspond to a textual content. This happens when the analyzed foreground contour corresponds most likely to a non-textual content (noise or graphic).

– Number of intersection areas=1

A single X-height contour overlaps exactly with one foreground contour. This happens when the analyzed foreground contour corresponds exactly to a single text line.

– Number of intersection areas>1

The same foreground contour overlaps at least two X- height ones due to the fact that the analyzed foreground belongs at least to two text lines.

In the third scenario where the number of intersection areas exceeds 1, we propose to split the extracted foreground contour into smaller ones in order to assign each foreground contour to exactly a single X-height. The split process starts

(8)

by assigning to the analyzed foreground contour, the X- height (Xh) having the largest intersection area with the foreground contour. Then, the distances between each point belonging to the foreground contour and the medians of all intercepted X-heights with the analyzed foreground contour are computed. Afterward, only the foreground contour points having minimal distance to theXh are retained, and subsequently a novel foreground contour is defined by only the retained foreground points. The proposed algorithm is recur- sively applied until each foreground contour point will be assigned to exactly a single X-height. Once all foreground contours have been assigned to their associated X-height contour, the ascender and descender components of each extracted text line can be finally defined. For each text line, the ascender and descender components correspond to the highest and lowest points of all foreground contours assigned to exactly one X-height contour, respectively. The descender and ascender components are determined according to Eqs.3 and4, respectively.

Descender =L P F−L P X (3)

Ascender =H P F−H P X (4)

where

– H P F denotes the highest point belonging to the foreground contour.

– H P Xdenotes the highest point belonging to the X-height contour.

– L P F denotes the lowest point belonging to the foreground contour.

– L P Xdenotes the lowest point belonging to the X-height contour.

Once the descender and ascender components assigned to the detected text lines at X-height level are determined, the whole text lines is localized. The proposed method for detecting the descender and ascender components is detailed in Algorithm 1 and Fig.5. Figure5a illustrates the output of the modified RLSA by highlighting the three possible scenarios related to the computed intersection areas (foreground and X-height contours). On the other side, Fig.5b depicts the output of the ascender and descender component detection task.

Figure 6 illustrates the post-processing step used for extracting the whole text lines.

4 Experiments

To evaluate the performance of the proposed framework, a set of experiments has been conducted on a large number of

Algorithm 1:Ascender and descender detection

Inputs :FC/* Foreground contours */

X C/* X-height contours */

Parameters:N ewContour s

Outputs :Descender and ascender components for each text line N ewContour s← ∅

FunctionAssignContours(FC_i, X C) Ar eas← ∅

Center s← ∅ forj in X Cdo

Determine the centerC_jofX C_j

Compute the intersection areaAbetweenFC_iandX C_j if(( A=0)then

AddAtoAr eas AddC_jtoCenter s end

end

if(length(Areas)=0)then RemoveFCifromFC end

if(length(Areas)=1)then AddFC_itoN ewContour s end

if(length(Areas)>1)then

Retain fromCenter sonly the intercepted X-height medians (I X C) withFC_i

Select the X-height median (X Hm) having the largest intersection area

FC_i1← ∅/* A contour grouping points having minimum distances from X Hm

*/

FC_i2← ∅/* A contour grouping points not having minimum distances from X Hm */

forP_iin FC_ido forC_jin I X Cdo

if(d(Pi, X Hm)<d(Pi, Cj))then AddP_itoFC_i1

else

AddPitoFCi2 end

end end end

AddFC_i1toN ewContour s if(FC_i2= ∅)then

AssignContours(FC_i2,X C) end

returnN ewContour s fori in FCdo

AssignContours(FC_i,X C) end

fori in X Cdo

N ewFC_i←Gather fromN ewContour sthe contours assigned toX C_i

Descender_i←lowest point inN ewFC_i- lowest point inX C_i Ascender_i←highest point inN ewFC_i- highest point inX C_i end

historical Arabic and Latin document images collected from different datasets. In this section, our experimental corpora are firstly presented (cf. Sect. 4.1). Secondly, the experimental protocol is briefly outlined (cf. Sect. 4.2). Finally,

(9)

(a)Extracted X-height and foreground contours (b)Detected whole text lines Fig. 5 Illustration of the proposed method for detecting the descender

and ascender components. Ina, the foreground contours are colored in red while the X-height contours are colored in blue. Inb, the detected

whole text lines are surrounded by rectangular bounding box having a random color (color figure online)

an overview of the different used performance evaluation metrics is presented (cf. Sect.4.3).

4.1 Experimental corpora

To analyze the performances of the four evaluated FCN architectures in this work, our experiments have been carried out with qualitative and quantitative observations deduced firstly from historical document images written in Arabic and Latin collected from different benchmark datasets dedi- cated to the text line segmentation task and provided in the context of recent open competitions at ICDAR and ICFHR conferences, and secondly from a large private number of document images collected from theANT. Hence, our experimental corpus is composed of the five following datasets:

cBAD,READ,DIVA-HisDB,RASMandANT.

– The cBAD dataset² was released in the context of the cBAD 2017 competition (cf. Fig. 7a). It is a freely available dataset containing 2188 handwritten historical document images written in Latin and digitized at 300/400 dpi. It was collected from nine different Euro- pean archives which were drafted between 1470 and 1930. It is composed of two sub-datasets differentiated by the layout complexity level:cBAD (Track A)andcBAD

(Track B) for simple and complex documents, respectively. ThecBAD (Track A)dataset contains pages having simple layout (neither have marginal notes nor tables).

However, the cBAD (Track B)dataset is composed of more challenging document images (degraded document images that have rotated text lines, multi-column text, marginal notes, tables, etc.) [24].

– The READ dataset³was released in the context of the READ research project⁴(cf. Fig.7b). It is a private multi- writer corpus containing 10,000 German handwritten document images, digitized in color and collected from several European archives. Some images of the READ dataset have low resolution and poor quality digitization [27].

– The DIVA-HisDB dataset⁵is a publicly available dataset (cf. Fig.7c). It is a multi-writer dataset (the number of writers was not specified). It contains 150 Latin handwritten document images that were digitized at 600 dpi.

It is composed of three medieval manuscripts:CSG18, CSG863, and CB55. The documents composing the DIVA-HisDBdataset have complex layouts and contain

3 https://scriptnet.iit.demokritos.gr/competitions/~icdar2017htr/.

4 https://read.transkribus.eu/.

5 https://diuf.unifr.ch/main/hisdoc/diva-hisdb.

(10)

Extracted foreground contours (FC) Predicted X-height contours (XC) Intersection

areas

Number of intersection areas =0 Number of intersection areas >1 Number of intersection areas=1

Remove contour (FCi) from FC Retain from Centers only the intercepted X-height medians

(IXC) with FCi

Add FCitoNewContours

Select the X-height median (XHm) having the largest intersection area

Add Pito FCi1

d(Pi, XHm) < d(Pi, Cj) Adaptive RLSA

Gather from NewContoursthe contours assigned to XCi

Ascender part=highest point in FCi– highest point in XCi Descender part=lowest point in FCi- lowest point in XCi

Pre-trained model Input image

No Yes

Add FCi2toNewContours Add Pito FCi2

Output: All whole text lines FCi

FCi1:Contour grouping points having minimum distances from XHm

FCi2:Contour grouping points not having minimum distances from XHm

Retrieve all points (P) in FCi

Cj

ai

Recursive

Until each foreground contour has been assigned to only one X-height

Fig. 6 Illustrative scheme of the post-processing step used for extracting the whole text lines

decorations, main text block and comments written in different calligraphy styles [26].

– The RASM dataset⁶ was released in the context of the ICFHR 2018 competition on recognition of historical Arabic scientific manuscripts (RASM 2018). It is composed of 100 images of scientific handwritten

6https://www.primaresearch.org/RASM2018/.

manuscripts written in Arabic between tenth and twentieth centuries (cf. Fig.7d). It was collected from the Qatar digital library. Some documents of theRASMdataset contain single-column handwritten text, while other have graphical as-well-as textual content. Furthermore, the documents composing theRASMdataset presents various particularities (e.g., presence of marginal and handwritten notes, decorations and stamps, non-straight text lines,

(11)

(a)cBAD (b)READ

(c)DIVA-HisDB (d)RASM

(e)ANT-A (f)ANT-L

Fig. 7 Examples of historical document images used in our experiments

varying font sizes and column widths) that complicate the analysis process of such documents [47].

– The ANT dataset is a private corpus collected from the ANT¹ (cf. Fig. 7e and f). It is composed of two sub- datasets of handwritten document images: ANT-A and ANT-L, which are differentiated by the text script. The ANT-AandANT-Lsub-datasets contain 180 and 116 documents written in Arabic and Latin, respectively. Both sub-datasets were obtained from the digitization of the constitution of the republic of Tunisia, written in the sev- enteenth century, digitized at 300 dpi and made available

as high-resolution color images. The digitized documents of theANTdataset have many particularities, such as the presence of various types of noise and degradation (e.g., black borders), large variability of calligraphic style and cursive lines [37].

4.2 Experimental protocol

To evaluate the four (classical U-Net, dilated FCN, RU-Net and adaptive U-Net) evaluated FCN architectures used in our work for text line segmentation, the ground truth at X- height level of our experimental corpora are required. Only the X-height annotations of thecBADandREADdatasets are available. On the other side, the ground truth of theDIVA- HisDB,RASM, ANT datasets are not publicly available at X-height level. Hence, we have defined the ground truths at X-height level according to thePAGEformat usingAletheia,⁷ a document image annotation tool [48]. Indeed, the X-height regions have been ground truthed by zoning each text line and labeling manually the spatial X-height boundaries of text lines.

In our experiments, we have firstly pre-trained each FCN architecture using the dataset having a larger data volume, which is theREADdataset (2499 images) in order to over- come the issues related to the limited number of images in the training phase (e.g., model convergence). Afterward, we have fine-tuned each FCN architecture on the training sub- dataset of each dataset of our experimental corpora in the training phase. The following ratios 80% and 20% have been set to split each dataset of our experimental corpora into the training and test sub-datasets, respectively. Furthermore, to analyze the dependency of any given method on the training data, we have fine-tuned each FCN architecture on the training sub-dataset of thecBADdataset. The later contains 755 document images (176, 40 and 539 document images for the training, validation and test, respectively). Finally, the performances of the four FCN architectures have been computed on the different test sub-datasets of thecBAD (Track A),cBAD (Track B),DIVA-HisDB,RASM,ANT-AandANT-Ldatasets.

In the training phase, the FCN architecture inputs for all evaluated datasets are the whole original image with its associated ground truth which has been defined at X-height level.

However, in the test phase the architecture input is fed by only the whole original image, while the output is the obtained heatmap (probability matrix) which determines the pixels belonging to the X-height areas.

In order to reduce the computational complexity in terms of the memory requirements (the number of trainable parameters) in the training phase, we have resized all document images used in our experiments to a smaller resolution (608×608 pixels). We have trained the four FCN architec-

7 https://www.primaresearch.org/tools.

(12)

tures during a maximum number of training iterations that is set equal to 700 and we have used a batch size that is set equal to 1.

All the experiments were conducted using the keras framework onTITAN X GPU which has 12 GB allocated memory. The benchmarking issues related to the four evaluated FCN architectures have been deduced after carrying out our experiments on 2499 images of theREADdataset during a maximum number of training iterations that is set equal to 10 and using a batch size that is set equal to 1. To evaluate the performance of the proposed post-processing method, our experiments have been conducted on theANT-AandANT-L datasets that contain 736 and 848 text lines, respectively.

4.3 Performance evaluation metrics

Visual assessment of the effectiveness and robustness of a text line segmentation method is inherently subjective. Thus, it is obviously important to evaluate quantitatively the performance of each evaluated FCN architecture according to the document script (Arabic or Latin) and layout (simple or complex).

In order to present a constructive comparison of the four evaluated FCN architectures with the different participating methods in thecBAD 2017competition, we have computed the same per-pixel accuracy metrics used in the context of thecBAD 2017 competition: precision (P), recall (R), and F-measure (F). The higher the values of the computed performance evaluation metrics, the better the results.

Furthermore, comparing visually the effectiveness of the proposed post-processing method is not sufficient. Hence, five performance evaluation metrics (“Match”, “Merge”, “Miss”,

“Split”, and “False alarm”), which were introduced by Gal- ibert et al. [49], have been computed to assess quantitatively the performance of the proposed post-processing method.

Figure8illustrates the five metrics computed to evaluate the performance of the post-processing method.

Fig. 8 Illustration of the performance evaluation metrics computed to evaluate the post-processing method

5 Results

In this section, the performances of the four evaluated FCN architectures as well as those of the post-processing method are presented and discussed.

5.1 X-height-based text line extraction step

To analyze the performances of the four investigated FCN architectures and provide an additional insights into their numerical complexity, qualitative and quantitative results, and then the computational cost of each FCN architecture are firstly presented. Afterward, based on the obtained results, many observations and recommendations about the FCN architectures having the best trade-off between the highest performance and the lowest computational cost are discussed.

5.1.1 Qualitative and quantitative results

Figures 9 and 10 illustrate the resulting images obtained when using the classical U-Net [21], dilated FCN [23], RU-Net [22], and adaptive U-Net [37] architectures on a test image of theRASM andDIVA-HisDBdatasets, respectively.

We note that both the dilated FCN and the classical U-Net architectures classify the non-textual content (representing different kinds of noise such as ink stains, back-to-front interference, and bleed-through) as text line compared to the adaptive U-Net and the RU-Net architectures. This misclassification of the textual content could lead to biased assessments of the text recognition task.

Hence, it is worth noting that a post-processing method is required to refine the obtained results, and to thereby provide sufficient and reliable information for the text recognition task.

In order to evaluate objectively the performance of each FCN architecture by the different test sub-datasets of the cBAD (Track A), cBAD (Track B), DIVA-HisDB, RASM,ANT-AandANT-Ldatasets, the precision (P), recall (R), and F-measure (F) metrics have been computed (cf.

Tables 1 and 2). Table 1 presents the obtained performances by fine-tuning each FCN architecture on the training sub-dataset of each dataset of our experimental corpora in the training phase. In order to investigate the versatil- ity of the different evaluated architectures (i.e., how they could handle heterogeneous datasets), Table2presents the obtained performance by fine-tuning each FCN architecture on the training sub-dataset of the cBAD dataset. In the tables below, the values which are quoted in italic and bold, are considered as the lowest and highest, respectively

(13)

Fig. 9 Resulting images of applying the four FCN architectures on an example of document image of theRASM dataset (extracted text lines at X-height level)

(a) Input image (b) Classical U-Net

(c) Dilated FCN (d)RU-Net

(e) Adaptive U-Net

(14)

Fig. 10 Resulting images of applying the four FCN architectures on a document image example of the DIVA-HisDBdataset (extracted text lines at X-height level)

(a) Input image (b) Classical U-Net

(c) Dilated FCN (d) RU-Net

(e) Adaptive U-Net

(15)

Table 1 Performance evaluation on the test dataset of the four assessed FCN architectures by fine-tuning each FCN architecture on the training sub-dataset of each dataset of our experimental corpora in the training phase

Dataset T P(%) R(%) F(%)

Classical U-Net

cBAD (Track A) 0.1 79.8 85.2 81.6

cBAD (Track B) 0.1 86.4 52.8 64.5

DIVA-HisDB 0.2 91.1 92.5 91.7

ANT-L 0.3 72.5 94.4 86.3

RASM 0.1 88.5 89.8 89.1

ANT-A 0.1 90.8 89.4 90.0

Average metric – 84.8 84.0 83.8

Dilated FCN

cBAD (Track A) 0.3 77.9 85.4 80.6

cBAD (Track B) 0.1 73.7 81.2 75.4

DIVA-HisDB 0.5 92.6 91.9 92.2

ANT-L 0.7 86.5 89.7 87.9

RASM 0.4 87.9 88.4 88.0

ANT-A 0.4 88.4 90.8 89.4

Average metric – 84.5 87.9 85.5

RU-Net

cBAD (Track A) 0.1 78.7 85.6 81.3

cBAD (Track B) 0.1 82.7 61.8 69.2

DIVA-HisDB 0.1 91.6 91.6 91.5

ANT-L 0.9 80.4 92.3 85.8

RASM 0.1 89.0 88.4 88.6

ANT-A 0.1 89.9 89.4 89.5

Average metric – 85.3 84.8 84.3

Adaptive U-Net

cBAD (Track A) 0.1 80.6 83.9 81.4

cBAD (Track B) 0.1 82.0 59.1 67.4

DIVA-HisDB 0.1 90.7 92.7 91.6

ANT-L 0.9 81.9 92.4 86.7

RASM 0.1 89.0 90.6 89.7

ANT-A 0.1 90.9 90.3 90.5

Average metric – 85.8 84.8 84.5

The values which are quoted in italic and bold, are considered as the lowest and highest, respectively.

Each FCN architecture output (heatmap) is represented by a probabilistic value that determines the pixels belonging to the X-height areas according to different rejection threshold values. Hence, for each document of the different test sub-datasets of the experimental corpus, we have varied automatically the rejection threshold value (T ∈ [0.1,0.9]), and afterward explored the changes in the evaluation performance. Then, for each test sub-dataset, the rejection threshold value (T) that maximizes the Fmetric average is retained. It is worth noting that the retained rejection threshold value may differ from one dataset to another depending

Table 2 Performance evaluation the test dataset of the four assessed FCN architectures by fine-tuning each FCN architecture on the training sub-dataset of thecBADdataset

Dataset T P(%) R(%) F(%)

Classical U-Net

cBAD (Track A) 0.1 76.0 83.7 78.3

cBAD (Track B) 0.1 72.4 72.0 69.9

DIVA-HisDB 0.1 74.4 81.4 77.3

ANT-L 0.8 82.5 89.5 85.6

RASM 0.3 72.0 82.3 76.5

ANT-A 0.1 65.0 79.3 71.2

Dilated FCN

cBAD (Track A) 0.3 74.5 81.1 75.9

cBAD (Track B) 0.2 68.8 74.9 66.9

DIVA-HisDB 0.2 75.2 79.5 76.9

ANT-L 0.7 79.8 86.8 82.9

RASM 0.3 66.8 74.4 70.2

ANT-A 0.4 65.4 78.2 71.0

RU-Net

cBAD (Track A) 0.2 76.6 83.0 78.3

cBAD (Track B) 0.1 69.0 74.9 71.8

DIVA-HisDB 0.1 72.6 80.6 75.6

ANT-L 0.7 81.9 87.7 84.4

RASM 0.5 70.7 82.9 75.7

ANT-A 0.5 62.0 80.4 69.7

Adaptive U-Net

cBAD (Track A) 0.3 75.9 85.1 79.1

cBAD (Track B) 0.2 67.4 80.5 70.5

DIVA-HisDB 0.9 68.6 84.6 74.9

ANT-L 0.6 82.5 88.7 85.2

RASM 0.7 77.2 84.7 79.8

ANT-A 0.9 70.6 80.1 74.4

on the particularities of the analyzed dataset. In Table 1, we observe that the best performance in terms of F is obtained by the dilated FCN architecture (75.4%, 92.2%

and 87.9%) on thecBAD (Track B),DIVA-HisDBandANT- L datasets, respectively. However, we see that the classical U-Net architecture achieves the best performance with F equal to 81.6% on thecBAD (Track A). We also note that the adaptive U-Net architecture achieves the best performance withFequal to 89.7% and 90.5% on theRASMandANT-A datasets, respectively. Hence, we show that the dilated FCN architecture is well-suited for Latin document images. Nev- ertheless, the adaptive U-Net architecture is well-adapted for Arabic document images. Moreover, we note that the classical U-Net architecture is well-suited for simple document images.

In Table 2, we observe that the best performance in terms of F is obtained by the adaptive U-Net archi-

(16)

tecture (79.1%, 79.8% and 74.4%) on the cBAD (Track A), RASM and ANT-A datasets, respectively. However, we see that the classical U-Net architecture achieves the best performance with F equal to 77.3% and 85.6% on the DIVA-HisDB and ANT-L datasets, respectively. This can be justified by the robustness of the classical U-Net architecture to characterize small patterns in the document images of the DIVA-HisDB and ANT-L datasets.

Furthermore, we observe that the lowest performances in terms of F are obtained by the dilated FCN architecture (75.9%, 66.9%, 82.9% and 70.2%) on the cBAD (Track A), cBAD (Track B), ANT-L and RASM datasets, respectively. This is can be explained by the fact that the dilated FCN is not well-adapted for strongly degraded document images.

By comparing the performances of the four evaluated FCN architectures on thecBADdataset with those obtained with the five participating methods (DMRZbased on using the U-Net architecture,UPVLC based on the interest point clustering technique,BYU based on the FCN architecture, IRISAbased on using blurred image combined with connected component analysis and contextual alignment rules, and LITIS based on the dilated FCN architecture in the Track A and Track B of the cBAD2017 competition (cf.

Table 3), we note for both tracks of cBAD competition

Table 3 Performance comparison of the participating methods in both tracks of thecBAD2017competition with the four evaluated FCN architectures

Method P(%) R(%) F(%)

Track A

DMRZ 97.3 97.0 97.1

UPVLC 93.7 85.5 89.4

BYU 87.8 90.7 89.2

IRISA 88.3 87.7 88.0

LITIS 78.0 83.6 80.7

Adaptive U-Net 75.9 85.1 79.1

Classical U-Net 76.0 83.7 78.3

RU-Net 76.6 83.0 78.3

Dilated FCN 74.5 81.1 75.9

Track B

DMRZ 85.4 86.3 85.9

BYU 77.3 82.0 79.6

IRISA 69.2 77.2 73.0

RU-Net 69.0 74.9 71.8

Adaptive U-Net 67.4 80.5 70.5

UPVLC 83.3 60.6 70.2

Classical U-Net 72.4 72.0 69.9

Dilated FCN 68.8 74.9 66.9

that the best performance in terms of F is obtained by the DMRZ team (97.1% and 85.9% for the Track A and Track B, respectively. We also observe that the lowest val- ues of F are noted for the both tracks by the dilated FCN (75.9% and 66.9% for the Track A and Track B, respec- tively). We note a drop in performance for the Track B compared with those of the Track A. This can be justified by the particularities of the documents of theTrack Bwhich are more complex and challenging than those of theTrack A. Moreover, we observe that the methods based on deep architectures provide better performance than the ad hoc ones (based on different image processing techniques, such as CC). Finally, we show that the four investigated architectures are not among the top ranked ones. This can be explained by the fact that we have not used the same experimental protocol. Indeed, the baseline representation was used in the ground truth definition for the cBADcompetition.

5.1.2 Computational cost

Table 4 presents the computational cost (i.e., the learning time, the number of trainable parameters and the inference time) of the four investigated FCN architectures in our work.

The ‘++’, ‘+’, ‘–’ and ‘-’ marks are placed in Table4after comparing the F values of the four evaluated architectures according to the document category (Arabic, Latin, simple and complex), independently to the performances of the other document categories. Table4is deduced after reading hor- izontally across the rows of Table 2 (shows the obtained performances by fine-tuning each FCN architecture on the training sub-dataset of the cBAD dataset). For example, we observe the following F values: 69.9%, 66.9%, 71.8% and 70.5% when using the classical U-Net, dilated U-Net, RU- Net and adaptive U-Net architectures, respectively, on the cBAD (Track B) dataset. Hence, we have put in Table 4 for the category of complex documents ‘-’, ‘–’, ‘++’ and

‘+’. On the other side, we observe the following F values:

78.3%, 75.9%, 78.3% and 79.1% when using the classical U-Net, dilated U-Net, RU-Net and adaptive U-Net architectures, respectively, on thecBAD (Track A) dataset. Hence, we have put in Table4for the category of simple documents

‘+’, ‘-’, ‘+’ and ‘++’. From another angle, if we compare the obtained F values of the four architectures for simple and complex documents in Table2, we confirm that the four architectures perform better for simple documents than complex ones.

We note that the learning time and the number of trainable parameters are not sufficiently congruent. Usually, the learning time depends on the number of trainable parameters.

Nevertheless, we observe that the dilated FCN architecture has the lowest number of trainable parameters but the highest learning time. This is can be justified by the adjustment