A synthetic population for the greater São Paulo metropolitan region

(1)

Working Paper

A synthetic population for the greater São Paulo metropolitan region

Author(s):

Sallard, Aurore; Balać, Miloš; Hörl, Sebastian Publication Date:

2020-08

Permanent Link:

https://doi.org/10.3929/ethz-b-000429951

Rights / License:

In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library

(2)

Aurore Sallard

IVT, ETH Zürich, 8093 Zürich, Switzerland phone: +41-44-633-38-01

email: aurore.sallard@ivt.baug.ethz.ch orcid: 0000-0001-6465-858X

3

4

Miloš Balać

email: milos.balac@ivt.baug.ethz.ch orcid: 0000-0002-6099-7442

5

6

Sebastian Hörl

email: sebastian.hoerl@ivt.baug.ethz.ch orcid: 0000-0002-9018-432X

7

8

Words: 7468 words

9

(3)

ABSTRACT

1

This paper presents an open-source generalized pipeline for the creation of a synthetic population

2

for the Greater São Paulo Metropolitan Region, entirely based on open data. The pipeline that is

3

first developed and applied to the Île-de-France region is used as a baseline. Using data-driven

4

algorithms, the pipeline creates a path from raw data to the synthetic population and, further, to the

5

final mobility scenario.

6

A definite advantage of this approach is that it enables to easily reproduce not only synthetic

7

populations, but also to reproduce transportation studies. The São Paulo’s synthetic population, that

8

comprises of as many agents as there are inhabitants in this area, is created using this framework and

9

then analyzed. All considered indicators suggest that this approach is able to model the population

10

on a high level, even if certain gaps could be filled with additional information.

11

(4)

INTRODUCTION

1

Agent-based models can represent complex interactions of single entities in large systems. In

2

transportation they are used to model, on a large scale, interactions between individual travelers,

3

their impact on transportation system, and different transportation providers and their operational

4

decisions.

5

Especially recently, agent-based models have gained popularity to simulate human behavior.

6

The main reasons for this are emerging mobility solutions, like shared and on-demand mobility,

7

and dynamically changing demand and supply, which leads to the need for short-term operational

8

decisions. Agent-based models can therefore be a very useful tool to model new mobility solutions

9

and their operational challenge. However, in order to do this, they require substantial input data.

10

This data can be usually separated into transport supply and demand. The core of the transportation

11

demand are individuals that perform activities in the studied area and their activity patterns. In the

12

literature, this demand is usually referred to as the synthetic population with activity chains.

13

This paper presents an open-source approach using only open-data to create the synthetic

14

population with activity chains for the Greater São Paulo Metropolitan Region, the largest urban

15

area in South America and ninth in the world with a population estimate of 21 million inhabitants.

16

This metropolis spreads over a 8 thousand square kilometers area and connects 39 municipalities

17

and is at the center of the São Paulo Macrometropolis, a megalopolis gathering more than 30 million

18

inhabitants. São Paulo is the cultural, economic and financial center of Brazil, representing alone

19

10.7% of the Brazilian gdp.

20

While the main goal of the authors is to use the generated synthetic population as an input to an

21

agent-based model, it can be utilized for other purposes as well. This is achieved by providing the

22

synthetic population in different formats and data frames, which are only later transformed into the

23

necessary format for the use in an agent-based model.

24

The rest of the paper will guide the reader through the open-data used for this work, different

25

stages of the population synthesis pipeline developed in theeqasim framework, and validation

26

results, before concluding with a discussion of the methodology and results.

27

BACKGROUND

28

Aggregated four-step models (Ortuzar and Willumsen (1)) have been used in transportation for

29

decades to assess the impact of new policies or investments. However, they do not consider the

30

individuals’ decisions and their interactions, and do not capture the fact that the demand for travel

31

comes from the necessity to perform activities as it was shown in Chapin (2). This is why the

32

more recent activity-based and agent-based models could be applied successfully in the field of

33

transportation science.

34

Activity-based models (for early reviews, see Kitamura(3), Axhausen and Gärling(4) and Recker

35

(5)) emerged as an answer to the drawbacks of the four-steps models. This approach is based on the

36

work presented in the 70s in Chapin(2) and Hägerstrand(6) —the latter formulating that individual’s

37

activities are limited by social and personal constraints.

38

(5)

The activity-based models allow for scheduling activities and making mode and destination choices

1

at the individuals’ scale, within the household context. Several methods, presented in Chu et al.

2

(7), can be applied to synthesize the daily activity patterns. The authors of Wen(8) developed an

3

operational econometric model for generating complex daily patterns, taking interdependencies

4

within households into account and including activity location assignments and travel mode choices.

5

In Lee et al.(9), the Household Travel Survey conducted in the Tucson area in fall 2000 was used and

6

models to better understand the trip chaining behaviors within five different categories of households

7

were constructed. A third approach, featuring the discrete choice models, was proposed in Bowman

8

(10) and in Bowman(11). In those works, the daily activity pattern is seen as a set of tours, each

9

one being characterized by a primary (which means here “most important”) activity. This approach

10

was applied to the Portland area; the results include a synthesized, detailed daily activity pattern for

11

each individual in the population.

12

Unlike the four-step approach, the activity-based models enable insights into various aspects of the

13

results. For instance, with the discrete choice approach from Bowman(10) and in Bowman(11),

14

the results can be aggregated either according to some socio-demographic attribute or at a zonal

15

level. Nevertheless, to achieve this, one needs to estimate and calibrate sophisticated econometric

16

models. Furthermore, the activity-based models have mostly been developed for a small number of

17

regions, making them not easily extendable. Furthermore, they are often not open-source or lack

18

documentation. A notable exception to this is the activity-based model ActivitySim (ActivitySim

19

(12) ), which is an open-source platform for activity-based travel modeling, developed and used by

20

multiple transportation agencies in the USA.

21

On the other side, agent-based models (described in e.g. Bonabeau(13)) are founded on a synthetic

22

population, in which the agents’ attributes reflect the distributions observed in the actual population.

23

For this purpose, several different data sets, that may have been collected at different times, often

24

have to be combined with each other, which requires the utilization of mathematical procedures like

25

the iterative proportional fitting (IPF, see Wong(14) or Norman(15)) or the more recent iterative

26

proportional updating (IPU, see Ye et al.(16)). For instance, ActivitySim is based on PopulationSim,

27

a framework developed by the same teams, which creates a synthetic population from marginal data

28

obtained from the USA census.

29

The agent-based models aim at simulating the agents’ behavior and their competition to access

30

and use transport infrastructures. Such models make it possible to model congestion patterns

31

and interactions between individuals, a characteristic often needed today as several transportation

32

services co-exist and not only compete with each other, but can also be used in a complementary

33

way. Moreover, they allow for modeling highly dynamic services and interactions on a shorter time

34

scale than the one activity-based models can provide. However, here too, the lack of documentation

35

of the processes leading to the creation of the synthetic population make the scenarios often not

36

reproducible, or not verifiable. Furthermore, the data on which those scenarios are based are rarely

37

open-source.

38

(6)

In Hörl and Balać (17), the authors provide an integrated and open-source pipeline aiming at

1

generating a synthetic population from raw data. Thanks to its modularity, this framework can

2

be adapted and extended with ease. A first application of this pipeline to Île-de-France, the

3

region around Paris, is described. First, the input data (the national census, two household travel

4

surveys—one regional and one national—, the national tax registry, and a data bank containing all

5

work places, shops and leisure-related places, all of them being open data sets) is presented. Then,

6

the process leading from this raw data to the final synthetic population is documented in detail.

7

Afterwards, an error analysis is performed, setting the theoretical basis to further assessment of the

8

quality of the synthetic population. This pipeline has been applied to other study cases, namely to

9

California (Balać and Hörl(18)) and to Switzerland (Hörl et al.(19)) – this last scenario being an

10

exception as it is not based on open data.

11

This paper will present another use case of theeqasimframework presented in Hörl and Balać(17).

12

As in the Île-de-France scenario, all the data that are used here are open-source. The goal of the

13

paper is to present a way to create a synthetic population of Sao Paulo region, from raw open-data,

14

with minimal calibration effort that can be used for further behavioral, socio-demographic and

15

transportation analysis.

16

INPUT DATA

17

Input data are the essence of each agent-based scenario. It can be divided into two categories, the

18

first one representing thetransport supplyin the study area, while the second focuses onmobility

19

demand. In the context of this paper, the emphasis will be placed on the mobility demand.

20

The demand is comprised of asynthesized population, namely a set ofagentscharacterized by their

21

attributesand theirplans. A plan is an activity chain describing an agent’s typical schedule during

22

an average working day. It also contains information on the desired times and locations at which the

23

agent wishes to perform those activities and on the trips linking one activity to the following. The

24

attributes describe the socioeconomic condition of the agents and provide information on transport

25

modes that they can access. Agents are grouped intohouseholds, that are themselves characterized

26

by certain attributes.

27

The transport supply consists typically of a street and public transport network. Information

28

concerning transit schedules are required as well, and it is necessary to supplement the road network

29

withfacilitylocalizations, a facility being a place where an agent can perform an activity.

30

In this section, the different sources that were used in the context of the creation of the synthetic

31

population for the Greater São Paulo Metropolitan Region will be presented.

32

Zonal system

33

Figure 3 shows the extent of the study area, which corresponds to the administrative borders of the

34

Greater São Paulo Metropolitan Region – in spite of its contribution to the traffic flows in the study

35

area, the city of Santos could not be included in the model because the household travel survey data

36

(7)

does not cover this area. Despite its proximity to the Atlantic Ocean, Sao Paulo is located on a

1

plateau with an average elevation of about 800 meters above the sea level.

2

The study area was divided in 633 zones depicted in Figure 1. This zonal system is the one that

3

was used in the census (which will be described in the next sub-section). This zonal system has

4

been used since the 70s and reviewed regularly. It divides the territory into zones according to

5

geographical characteristics, such as population density, concentration of activities and presence

6

of historical monuments and natural spaces. Moreover, this system ensures a rather homogeneous

7

distribution of the population among the zones: in each one of them, the number of residents is

8

between 20 000 and 55 000.

9

Facility locations

10

Facility locations (including homes, work, shops and leisure-related places) were retrieved from

11

Open Street Maps(20)(osm).

12

In neighborhoods where no home place could be found through osm, home locations were assigned

13

alongside the residential or living streets. Moreover, as OSM data lacks a substantial number of

14

educational places in the study area, a data set from the São Paulo’s Ministry of Education, “Dados

15

Abertos da Educação”(21), was employed to fill this gap. This data set contains in particular the

16

geographical coordinates of all education places in the state of São Paulo, but the level of offered

17

education is unfortunately missing.

18

Mobility demand – the population

19

Two main data sources were used as inputs to create the population. The first one is a census

20

conducted in 2010 in Brazil(22). After removing all samples that had a home place outside the

21

São Paulo State, 3 622 779 weighted samples remained. For each of them, information is provided

22

on the individual’s age, gender, personal income, employment and/or student status. Plenty of

23

other attributes are available, but they were not used in the present study. One has as well access

24

to household related attributes such as total household income, car and motorcycle availability,

25

number of household members and municipality and area codes of the residence place. Among

26

those individuals, 1 211 311 live in the study area and their weights sum up to 19 918 293, which is

27

approximately the total number of inhabitants in the Greater São Paulo Metropolitan Region at the

28

time the survey was conducted. This census is necessary to make sure that the attributes distribution

29

(whether individual or household related) in the synthesized population reflect accurately the real

30

ones. Moreover, it shows the diversity of São Paulo’s population. Figure 1 depicts for instance the

31

average (weighted) personal income per administrative zone in the study area, which is computed

32

as the total household income divided by the number of individuals in the household. The wealth

33

inequalities are obvious: in the most peripheral neighborhoods, the average personal income appears

34

to be lesser than 1 000 BRL (as of January 2020, the minimal legal salary in São Paulo is 1 163

35

BRL; 1 BRL is equivalent to 0.18 USD or 0.17 EUR (exchange rate accessed on May, 6th 2020)).

36

whereas it can reach more than 6 000 BRL in the most central districts.

37

(8)

FIGURE 1 Average personal income, computed from the weighted household income, de- pending on the residence administrative zone, in BRL.

Background map ©OpenStreetMap contributors

The second data source was the household travel survey (hts) conducted in the Greater São Paulo

1

Metropolitan Region in 2017(23). It contains 84 889 samples which are weighted, so that the total

2

weight sum amounts to 20 508 979, more or less the number of inhabitants in the area in 2017. For

3

each sample, not only individual attributes are provided, namely age, gender, personal income and

4

employment status, but also information related to the household – household income and number

5

of available cars and bikes for instance. The most important part of the survey are the travel diaries

6

of interviewed individual. They enable to track each individual’s schedule during an average work

7

day. Each entry in the data set corresponds to a trip linking two given activities, which take place

8

at locations known at the coordinate level. Moreover, one has access to the trips characteristics:

9

departure and arrival time and chosen mode. Some sample individuals also answered questions

10

about the parking type they parked in and how much they paid for it. However, those persons were

11

too few to make a further use of this information possible.

12

Origin–Destination matrices

13

An origin–destination matrix is a matrix in which each cell represents the number of trips from an

14

origin zone (given by the corresponding row of the matrix) to a destination zone (column), or the

15

(9)

percentage of trips starting in the origin zone that reach the destination zone. Those matrices can be

1

created from the household travel survey. In this study, one weighted origin–destination matrix was

2

generated for work trips.

3

CREATION OF A SYNTHESIZED POPULATION

4

The goal of this section is to present the process leading towards the creation of a synthesised

5

population using the data presented in the previous chapter. The main steps of this process are

6

summarized in Figure 2. The pipeline is available as a public GitHub repository(24). Apart from

7

the framework generating the synthetic population that will be described below, this repository

8

also provides scripts embedding this population synthesis into a transport simulation realized with

9

MATSim ((25)) using discrete-mode choice extension(26).

10

Generate agents from census

Match them with hts samples

Create synthetic households and assign them home locations

Assign localizations of primary activities to the synthetic agents

Assign localizations of secondary activities to the synthetic agents

The synthetic population is ready

FIGURE 2 Overview of the population synthesis

The first step is to pre-process the input data in order to keep only relevant persons and trips.

11

After the synthesized agents and households are created from census, that are matched to the hts

12

individuals according a number of attributes. Those agents are then assigned to a specific home

13

location. Finally, the agents’ plans are finalized with the imputation of activity locations.

14

(10)

Pre-processing the input data

1

While most of the data sets are used in their original form, some of the information from the hts

2

needed to be adapted to reduce complexity. These adaptations are presented in what follows.

3

Employment, transport mode and trip purpose categories

4

In the hts, respondents were allowed to choose among many different transportation modes. In

5

order to simplify the modeling tasks, they were all merged to eight modes, namely public transport,

6

car, car passenger, walk, bike, taxi and ride-hailing.

7

Similarly, the trip purposes – or activities done at the trip destination – were merged into six

8

categories (home, work, shopping, leisure, education, and other). It has to be noticed that trips done

9

by non studying adults to escort their children from or to school were considered as “education´´

10

trips in the original data set. Those activities were changed to “other´´ to allow for a better reliability

11

of the activity chains prevalence in the output data.

12

With regard to the socio-demographic attributes, it was also decided to reduce the number of

13

employment categories from eight to three —employed, not employed and student.

14

Comparing hts with census employment numbers presented a large disparity in the number of

15

unemployed. The hts contains an additional variable about current school enrollment. Therefore,

16

we performed a check whether those going to school are classified as students. While a substantial

17

number is classified as student, there are some individuals that were classified as either ’jobless’ or

18

’has never worked’. For these, we changed the status of their employment to “student”. As a result,

19

the respective shares of students, employed and unemployed individuals in the hts are closer to the

20

one observed in the census, as Figure 5, page 12, shows.

21

Adding information on residence area

22

One’s mobility patterns are also influenced by one’s residential environment. For instance, in less

23

densely inhabited zones, a trip tends to be longer than in a highly populated neighborhood and the

24

car prevalence tends to decrease in the most urbanized areas, mostly due to difficulties of finding

25

(affordable) parking. It was decided to capture this phenomenon by creating a new attribute, which

26

splits all individual samples from the census and the household travel survey into three groups

27

depending on the location of their home. The Figure 3 shows the three zones that were defined. As

28

the figure shows, a pure geographical definition of those three zones was chosen. One could easily

29

replace this zones by new ones defined by a different criteria easily within the pipeline.

30

Creating synthetic households

31

After cleaning the census and the household travel survey, it is possible to create synthesized agents

32

by directly expanding census data according to their weights. As census is anonymized by only

33

providing a home zone location, further assignment of the exact home location is conducted later in

34

the pipeline. In the next step, each sampled individual is then matched to an observation from the

35

household travel survey, using hot-deck matching ((27),(17)).

36

The idea is to find all source observations (i.e. all samples from the household travel survey)

37

(11)

FIGURE 3 The three residential areas defined in the Greater São Paulo Metropolitan Re- gion. The red, inner zone corresponds to the city center of São Paulo; the orange one to the administrative borders of the City of São Paulo and the yellow zone to the rest of the district.

Background map ©OpenStreetMap contributors

that match the target observations (i.e. synthetic agents previously sampled from the census) on a

1

list of given matching attributes, and then to sample randomly one of those source observations.

2

To avoid over-fitting, if too few source observations are found for a given target observation, some

3

matching attributes are removed to enhance the set of matching source observations.

4

The attributes that are taken into account to perform matching are age class, gender, employment

5

status and availability of a car inside the household. In addition, observations that are similar with

6

respect to the residence area (as defined in subsubsection 5.1.2) are preferred.

7

Imputing primary locations

8

Once the agents have been assigned a daily plan based on the household travel survey, a location for

9

each of their primary activities (home, work and education) has to be defined. The aim of this step is

10

twofold: First, a correct number of agents should commute from one zone to another; Secondly, the

11

commute distances should fit the activity chains that have been assigned to the agents in the previous

12

step. While only an overview of the algorithms will be given here, more details can be found in(17).

13

Imputing home locations

14

The next step consists of assigning each synthesized household to a home location. The administrative

15

zone in which each agent lives is known from the census and thus, as all admissible home locations

16

(12)

are available from the facility locations database, it is quite straightforward to impute a home place

1

to each synthesized household or agent by selecting randomly a home place among all available

2

locations.

3

Imputing work locations

4

Once the agents are assigned a home location, one can provide them with work locations, if they do

5

have a work-related trip registered in their activity chain. For this purpose, the Origin-Destination

6

(od) matrices are used.

7

Given the residence district of an agent, their workplace district is sampled from the corresponding

8

line of the weighted od matrix. Then, once one knows, for each pair of districts(k,k⁰), the exact

9

number of agents living in the zonek and commuting to the zonek⁰, a number denoted by fk,k⁰, one

10

can sample fk,k⁰ exact destinations from the data set containing all available work places in the zone

11

k⁰. The coordinates set resulting from this step is denoted byC_k,k⁰. Those coordinates sets are then

12

aggregated by home districtk: C_k :=Ð

k⁰C_k,k0.

13

The last step consists of finding a bijective function such that each personuis mapped to the

14

coordinates of a work placec ∈ C_k, such that the distance between the agent’s home and their work

15

place corresponds to the commute distance found in the household travel survey. If there is no direct

16

trip between home and work places in the household travel survey, a random distance is drawn from

17

the commute distances found in this survey.

18

Imputing education locations

19

The imputation of the education locations followed a different way. For the less dense districts, too

20

few observations were registered, which lead to biased od matrices. Moreover, the facility data sets

21

obtained from the Ministry of Education did not provide enough information about the category

22

of education facility (kindergarten, primary or high school or university). Another method was

23

therefore implemented.

24

All education-related trips from the household travel survey were first split into several groups

25

depending first on the residence area type (see subsubsection 5.1.2) the agent lives in, secondly, on

26

the agent’s gender, and, thirdly, on the age of the individual sample who made the trip (and thus

27

on the category of education facility the individual visited: pre-school or elementary school for

28

children aged 14 or less, high school or technical school for teenagers aged 14 to 18, university for

29

people aged 18 to 30 and various places for agents aged 30 or more. For each of these groups, it was

30

then possible to construct the histogram of the distances separating the education place to the home

31

of the individual samples. Finally, a probability density function corresponding to each histogram

32

was obtained.

33

For each agent, a target distance was drawn from the probability function related to the group

34

(age and type of residence area) the agent belongs to. Using a bi-dimensionalk-d tree, an education

35

place was then selected such that the distance separating it from the agent’s home location was as

36

near to the target distance as possible.

37

(13)

Imputing secondary locations

1

The imputation of secondary locations, which means places in which leisure, shopping or other

2

activities are performed, is taken over by a method described in(28)or, more briefly, in(17). Here,

3

only a basic idea will be given, so as to provide the reader with some intuition on the employed

4

algorithm.

5

While primary activities (home, work or education) have fixed locations, which were determined

6

in the previous paragraphs, secondary activities (shopping, leisure and other) are not assigned

7

particular locations. The activity chains can be split into smaller chains, in which two fixed activities,

8

the first and last ones, are separated only by various assignable activities. From the household travel

9

survey, one knows ideally how long the trips of each sub-chain should be.

10

First, all trips present in the household travel survey are divided into bins of modes and travel

11

times. Then, given the transport mode and the ideal travel time of each trip that have to be assigned a

12

location, a distance is sampled from the bins previously created. Afterwards, a gravity model is used

13

to assign the variable activities to some locations, defined by coordinates, such that the observed

14

distances resemble the sample. Finally, the closest facility of the target activity type is selected from

15

the facility data sets (for instance, if an agent has to go "shopping", the sampled coordinates will be

16

snapped to the nearest available shop).

17

INSIGHT INTO THE SYNTHESIZED POPULATION

18

The process described above enabled the creation of a synthesized population, in which the agents

19

have been given activity chains obtained from the household travel survey and where those activities

20

are performed in places drawn with various sampling methods from the facility databases.

21

The fact that the census is very accurate, and that the synthetic agents and households are

22

directly sampled from this data set lead to the direct conclusion that a validation step to assess the

23

accuracy of the socio-demographic attributes distribution in the synthesized population is actually

24

not necessary. This is why this point will not be addressed below.

25

Comparison of the activity chains in the synthesized and actual populations

26

The Figure 4, page 12, shows the distribution of activity chains in the synthesized population and

27

compares it to the observed distribution obtained from the household travel survey.

28

This graph suggests that the synthesis process was quite accurate: the activity chains are present

29

in the correct order and the observed differences between the actual population and the synthesized

30

one are always lower than two percentage points.

31

It can however be seen that chains containing at least one “work” activity (like"h-w-h"or

32

"h-w-l-w-h"in Figure 4) are more frequent in the synthesized population than in the survey

33

population in hts. The reason for this is that the two surveys that were employed for this study were

34

not conducted in the same year. Indeed, the population distribution among the three employment

35

categories – namely “employed”, “unemployed” (which includes retired people as well) and “student”

36

– changed during the seven years separating the time when the census was conducted (in 2010) and

37

the period at which the household travel survey was realized (in 2017). This is what shows Figure 5.

38

(14)

FIGURE 4 Activity chains comparison.

hstands for “home”,wfor “work”,efor “education”,lfor “leisure”,sfor “shopping”

andofor “other”.

FIGURE 5 Distribution among employed, unemployed and currently studying persons in the census and in the household travel survey

The employment rate as well as the percentage of students in the population dropped between

1

2010 and 2017. This is why, as the comparison is performed between the synthesized population

2

– sampled from the census conducted in 2010 – and the activity chains present in the household

3

travel survey of 2017 – when the unemployment rate had increased – the plans containing one or

4

more work or education activities are slightly over-represented. Moreover, for the same reason, the

5

number of agents that do not leave their home (those whose activity chain is only"h") tends to

6

be higher in the household travel survey than in the census, and, thus, in the synthetic population.

7

Official sources confirm the quite dramatic raise of unemployment in São Paulo: the unemployment

8

rate was actually around 7% in 2010 (Instituto Brasileiro de Geografia e Estatística(29)) in the

9

metropolis, and increased to 13.4% in 2017 (Instituto Brasileiro de Geografia e Estatística(30)).

10

(15)

Number of activities in the activity chains and per purpose

1

It could be of interest to have a look at the number of activities performed by the agents. This is

2

what Figure 6, page 13 shows. A number of activities equal to zero means that the agent did not

3

conduct any trip during the day; otherwise, this number was computed by excluding the starting and

4

ending “home” activity. For instance, it was considered that the chain"h-w-h-o-l-h"has four

5

activities.

6

FIGURE 6 Comparison of the number of activities in the agent’s activity chains between the hts and the synthetic population

It can be observed that the relative prevalence order of the activity counts is well respected in

7

the synthetic population. Furthermore, this order makes sense in itself: the major part of the agents

8

(around 55%) has only one activity, namely work or education for the majority of them. Then follow

9

agents with no activity, which is consistent with Figure 4, then agents with 3 activities—a great

10

number of them are working or studying and have their lunch at home. The other activity numbers

11

are much less represented.

12

Those observations are consistent with Figure 7, that shows the prevalence of activity counts per

13

purpose in the synthetic population and compares it to the hts.

14

Comparison of the distance distribution in the synthesized and actual populations

15

Comparing how far agents have to travel to perform a given activity with what is observed in reality

16

will provide helpful evidence of the efficiency of the stages where facility locations are imputed to

17

them. The results of this comparison are presented in Figure 8.

18

When looking at Figure 8(b), it can be noticed that the distance distributions fit reasonably well.

19

Regarding the average distances, the results are satisfactory as well.

20

Comparison of travel purposes and distances between male and female agents

21

As described in the previous section, the activity chains present in the hts are correlated with

22

the sociodemographic attributes of the interviewees and, thanks to the matching process, those

23

(16)

FIGURE 7 Comparison of the number of activities per purpose in the agent’s activity chains between the hts and the synthetic population.

Interpretation: both in the hts and in the synthetic population, around 33% of the agents go to work once in the day, while 6% go twice to work.

chains are distributed in a meaningful way among the synthetic agents. The Figure 9 compares the

1

prevalence of the most frequent activity chains in the hts and the synthetic population for male

2

and female agents between 18 and 40 years old. The figure shows that the chainh-w-h(going

3

from home to work and then back home) is the most prevalent for both agents groups, but, in the

4

hts as well as in the synthetic population, the observed frequency among males is more than 10

5

percentage points above the frequency observed among female agents (42-45 % versus 55-57%). As

6

a consequence, the chain distribution observed for women seems to be slightly more heavy-tailed

7

than the one characterizing men.

8

This indicates a larger variety of activity patterns for women, a phenomenon that have already

9

been investigated in(31).This observation is confirmed by Figure 10, page 16, that shows the number

10

of activities in the hts and the synthetic population for male and female agents between 18 and 40

11

years old, and by Figure 11, page 16, that illustrates the number of activities per purpose in the same

12

population.

13

It can also be noticed that the fifth most prevalent activity chain is different between the male and

14

the female population: it is indeed“h-w-h-w-h”for men and“h-o-h-o-h”. A further analysis

15

reveals that this chain was originally“h-e-h-e-h”for women; those “education” activities were

16

changed into “other” ones during the cleaning part — some agents, that are not studying, were

17

assigned activity chains with education-related trips if they escorted their children to school. This

18

difference in the activity chain distribution among men and women thus reflects an activity splitting

19

among household members: women who stay at home, take care of the children, whereas men are

20

more often employed and some of them return home for lunch.

21

The Figure 12 compares the average travelled distances for different purposes in the same

22

population. As well as before, it can be seen that the reference distributions, obtained from the hts,

23

are well reflected by the synthetic population.

24

(17)

(a) Average distances

(b) Distance cumulative distributions FIGURE 8 Crowfly distances towards a facility by activity purpose

Whereas the travelled distances are, on average, similar between men and women, it can be

1

observed that men travel on average 1 to 1.5 more kilometers than women if they travel to an

2

educational place. The travelled distance to home is impacted by this phenomenon—it amounts to

3

around 5.8 km for women and to more than 6 km for men. This would mean that women tend to

4

make more trips related to education, but that the places where they study is located nearer to their

5

homes than they are for men.

6

Comparison of distance from home to the education facility

7

As a special attention was paid to the imputation of education facilities to students and pupils, it

8

was decided to look into the resulting distribution of distances between an agent’s home and the

9

education place they were assigned to. This is shown in Figure 13.

10

(18)

(a) Most frequent activity chains among female agents (b) Most frequent activity chains among male agents

FIGURE 9 Most frequent activity chains, comparison between the hts and the synthetic population, split between men and women aged 18 to 40.

(a) Female agents (b) Male agents

FIGURE 10 Number of activities in the chains, comparison between the hts and the syn- thetic population, split between men and women aged 18 to 40.

(a) Female agents (b) Male agents

FIGURE 11 Number of activities per purpose, comparison between the hts and the syn- thetic population, split between men and women aged 18 to 40.

(19)

(a) Average travelled distances by female agents (b) Average travelled distances by male agents

FIGURE 12 Average travelled distances, comparison between the hts and the synthetic population, split between men and women aged 18 to 40.

- age.png

FIGURE 13 Comparison of the average distance between an agent’s home and the education place they were assigned to, according to the agent’s age and in the entire population

From the figure, it is clear that the gap between the distances obtained from the hts and the one

1

observed in the synthetic population is small, but one can observe that it increases with the age of

1

the agents. This is linked to the fact that, for instance, there are many more samples of kids aged 14

2

or less going to school than of students aged 25 and more, so the facility sampling process could not

3

achieve the same level of accuracy for all age groups.

4

Figure 14(a) and Figure 14(b) show the average distances between home and education facility

5

for agents, according to their gender and category of residence area, as those were the two other

6

factors taken into account during the sampling process.

7

With gaps always smaller than 200 meters for target distances around 3 km, it can be concluded

8

that the approach used for assigning education facilities to students was successful.

9

(20)

(a) Average distances according to the agent’s gender (b) Average distances according to the agent’s residence area. “Downtown” designates the agents living in the central area of São Paulo, “city” those who live in the city but not in the downtown, and “state” those living in other parts of the study area, according to the zones defined in subsubsection 5.1.2

FIGURE 14 Comparison of the average distance between an agent’s home and the education place they were assigned to, according to the agent’s gender or residence area and in the entire population

DISCUSSION

1

While the generated synthetic population matches quite well the reference data, some of the observed

2

discrepancies and limitation have to be pointed out.

3

Input data

4

The available input surveys (the census and the household travel survey, conducted respectively

5

in 2010 and 2017), were not carried out the same year and, during the time span separating them,

6

the population structure evolved in many aspects. The unemployment rate in Brazil rose by four

7

percentage points between 2010 and 2017(32)and the observed mobility patterns were influenced

8

by the last developments of the public transport network (like the construction of new metro lines

9

(33)or the start of operations of famous ride-hailing platforms, like Uber in June 2014 as reported

10

in(34)).

11

The population was thus synthesized from two distinct populations and this is why its observed

12

mobility patterns sometimes do not match exactly the ones that were taken as a reference. This can

13

explain most of the differences observed in the previous section.

14

Moreover, the household travel survey only allows to model local personal trips: for instance,

15

neither freight nor tourism are taken into account in the presented approach, due to the lack of data.

16

(21)

Imputing categories to facilities

17

A few issues have arisen concerning the creation of the facility data sets. Open Street Map has

18

a poor representation of educational places and the data gathered by the Brazilian Ministry of

19

Education does not separate education places into kindergartens, primary and high schools and

1

universities. Therefore, the first attempts to assign education places to the synthesized agents ended

2

up being erroneous: the distribution of the distances that the agents cover to reach their study

3

location starting from their home was too dissimilar to the targeted distribution. As presented in

4

section 5, the proposed solution was to differentiate those distributions according to the agents’ age;

5

in this way, it ensures the distributions being respected but no guarantee can be offered that each

6

agent is actually linked to an education facility matching his or her age.

7

Further improvements

8

As pointed out in the previous pages, there is still room for improvement which would lead to

9

more accurate results and a better representation of the average mobility demand in the São Paulo

10

Metropolitan Region. Most of it has to do strongly on the data availability:

11

• As mentioned in the introduction, Santos is a major city with a population of more than 400

12

000 inhabitants. As home to the largest seaport of Latin America, located only 80 km away

13

from São Paulo, it is obvious that it contributes to the observed transport flows in the megacity.

14

In particular, taking in account commuter flows from one city to the other would enhance the

15

travel survey and, as a result, improve the quality of the modeled transport demand.

16

• Freight traffic as well as commercial agents’ routes are missing as well in the current trips

17

data sets. As the impact of such trips may not be negligible on the global transport situation,

18

taking them into account would benefit later transport simulation.

19

• Currently, in the process of matching activity chains to individuals, household structure is not

20

considered. As all household members are interviewed in the Household Travel Survey, it

21

would however be possible to maintain the interactions existing within the households in the

22

matching phase. This would ensure, first, that joint trips are modeled properly and, secondly,

23

that shared resources (cars or bicycles, for instance) are distributed appropriately among the

24

household members. For example, this would guarantee that, if an adult member leaves home

25

with the only car available to the household, then no other member can take the car to go

26

shopping before the first one is back.

27

CONCLUSION

28

This paper presented a process to generate a synthetic population for São Paulo based on a new

29

pipeline allowing, among others, to obtain an operational scenario directly from raw data. All the

30

data sets utilized here as well as the framework itself are open-source, and, consequently, if access

31

to raw data is provided, the results are entirely reproducible by others.

32

The proposed approach, based on rather simple algorithms, can be easily adapted to the generation

33

of other scenarios. It could therefore serve as a benchmark for future improvements. It also shows

34

that reliable outputs can be obtained even if the input data is not the most suitable—in the case at

35

hand, the census data was collected seven years before the household travel survey was conducted.

36

(22)

Different parts in the article indicate potential future work. Taking into account other variables,

37

such as parking costs, freight traffic or commuter flows from and to the neighboring city of Santos,

38

or considering household structure during the process of matching activity chains to individuals are

1

possible improvement axes. The hope is to continue to collect data to improve the quality of the

2

results and, furthermore, to expand this open-source and open data approach to new scenarios.

3

ACKNOWLEDGMENT

4

We would like to acknowledgeAirbus Urban Mobility GmbHwhose funding has supported the

5

development of a synthetic agent-based scenario for the Greater São Paulo Metropolitan Region.

6

AUTHOR CONTRIBUTION

7

The authors confirm contribution to the paper as follows: study conception and design: A. Sallard,

8

M. Balać, S. Hörl; data collection: A. Sallard, M. Balać; analysis and interpretation of results: A.

9

Sallard, M. Balać; draft manuscript preparation: A. Sallard, M. Balać. All authors reviewed the

10

results and approved the final version of the manuscript.

11

REFERENCES

12

1. Ortuzar, J. and L. Willumsen (2011) Modelling Transport, 4th edition.

13

2. Chapin, F. S. (1974)Human activity patterns in the city: Things people do in time and in space,

14

vol. 13, Wiley-Interscience.

15

3. Kitamura, R. (1988) An evaluation of activity-based travel analysis,Transportation 15, 9–34.

16

4. Axhausen, K. and T. Gärling (1992) Activity-based approaches to travel analysis: conceptual

17

frameworks,models, and research problems,Transport reviews 12, 323–341.

18

5. Recker, W. (1995) The household activity pattern problem: general formulation and solution,

19

Transportation Research Part B: Methodological 29, 61–77.

20

6. Hägerstrand, T. (1970) What about people in regional science?, paper presented at thePapers of

21

the Regional Science Association, vol. 24.

22

7. Chu, Z., L. Cheng and H. Chen (2012) A review of activity-based travel demand modeling, in

23

CICTP 2012: Multimodal Transportation Systems—Convenient, Safe, Cost-Effective, Efficient,

24

48–59.

25

8. Wen, C.-H. (1998) Development of stop generation and tour formation models for the analysis

26

of travel/activity behavior.

27

9. Lee, Y., M. Hickman and S. Washington (2007) Household types and structure, time-use pattern,

28

and trip-chaining behavior, Transportation Research Part A: Policy and Practice, 41 (10)

29

1004–1020.

30

10. Bowman, J. L. (1995) Activity based travel demand model system with daily activity schedules,

31

Ph.D. Thesis, Massachusetts Institute of Technology.

32

(23)

11. Bowman, J. L. (1998) The day activity schedule approach to travel demand analysis, Ph.D.

1

Thesis, Massachusetts Institute of Technology.

2

12. ActivitySim (2020) An open platform for activity-based travel modeling, https://

3

activitysim.github.io/.

4

13. Bonabeau, E. (2002) Agent-based modeling: Methods and techniques for simulating human

5

systems,Proceedings of the National Academy of Sciences 99, 7280–7287.

6

14. Wong, D. W. (1992) The reliability of using the iterative proportional fitting procedure,The

7

Professional Geographer,44(3) 340–348.

8

15. Norman, P. (1999) Putting iterative proportional fitting on the researcher’s desk.

9

16. Ye, X., K. Konduri, R. M. Pendyala, B. Sana and P. Waddell (2009) A methodology to match

10

distributions of both household and person attributes in the generation of synthetic populations,

11

paper presented at the88th Annual Meeting of the Transportation Research Board, Washington,

12

DC.

13

17. Hörl, S. and M. Balać (2020) Reproducible scenarios for agent-based transport simulation: A

14

case study for Paris and Île-de-France, May 2020.

15

18. Balać, M. and S. Hörl (2020) Synthetic population for the state of California based on open-data:

16

examples of san francisco bay area and san diego county. Submitted for presentation at TRB

17

2021.

18

19. Hörl, S., F. Becker, T. Dubernet and K. Axhausen (2019) Induzierter Verkehr durch autonome

19

Fahrzeuge: Eine Abschätzung (traffic induced by autonomous vehicles: an estimation),SVI

20

2016/001, Schriftenreihe 1650.

21

20. OpenStreetMap contributors (2017) Planet dump retrieved from https://planet.osm.org ,

22

https://www.openstreetmap.org .

23

21. Coordenadoria de Informação, Evidência, Tecnologia e Matrícula (CITEM) (2020) En-

24

dereços de escolas (addresses of the schools), https://dados.educacao.sp.gov.

25

br/dataset/endereços-de-escolas.

26

22. Instituto Brasileiro de Geografia e Estatística (2011) The 2010 population census sum-

27

mary, https://www.ibge.gov.br/en/statistics/social/population/

28

18391-2010-population-census.html?edicao=19720&t=publicacoes.

29

23. Transportes Metropolitanos (2017) Resultados finais da pesquisa origem e destino 2017

30

(final results of the 2017 origin-destination survey),http://www.metro.sp.gov.br/

31

pesquisa-od/.

32

24. Balać, M. and S. Hörl (2020) Eqasim,https://eqasim.org/.

33

25. Horni, A., K. Nagel and K. W. Axhausen (2016)The Multi-Agent Transport Simulation MATSim,

34

Ubiquity Press, London.

35

(24)

26. Hörl, S., M. Balać and K. W. Axhausen (2018) A first look at bridging discrete choice modeling

1

and agent-based microsimulation in MATSim,Procedia computer science,130, 900–907.

2

27. D’Orazio, M., M. Di Zio and M. Scanu (2012) Statistical matching of data from complex sample

3

surveys, paper presented at theProceedings of the European Conference on Quality in Official

4

Statistics-Q2012, vol. 29.

5

28. Hörl, S. and K. W. Axhausen (2020) Relaxation-discretization algorithm for spatially constrained

6

secondary location assignment, paper presented at the99th Annual Meeting of the Transportation

7

Research Board.

8

29. Instituto Brasileiro de Geografia e Estatística (2016) Principais destaques da evolução do

9

mercado de trabalho nas regiões metropolitanas abrangidas pela pesquisa (Main highlights of

10

the evolution of the labor market in the metropolitan regions covered by the survey), ftp:

11

//ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Mensal_de_

12

Emprego/Evolucao_Mercado_Trabalho/retrospectiva2003_2011.pdf.

13

30. Instituto Brasileiro de Geografia e Estatística (2017) Pesquisa nacional por amostra de domicílios

14

contínua — quarto trimestre de 2017 (national household sample survey — fourth quar-

15

ter 2017), https://biblioteca.ibge.gov.br/visualizacao/periodicos/

16

2421/pnact_2017_4tri.pdf.

17

31. Scheiner, J. and C. Holz-Rau (2017) Women’s complex daily lives: a gendered look at trip

18

chaining and activity pattern entropy in Germany,Transportation,44(1) 117–138.

19

32. Plecher, H. (2019) Brazil: Unemployment rate from 1999 to 2019, https://www.

20

statista.com/statistics/263711/unemployment-rate-in-brazil/.

21

33. G1 São Paulo (2014) Primeiro trecho da Linha 15-Prata do monotrilho

22

é aberto em São Paulo (first section of the metro line 15 is opened in

23

São Paulo), http://g1.globo.com/sao-paulo/noticia/2014/08/

24

primeiro-trecho-da-linha-15-prata-do-monotrilho-e-aberto-em-sao-paulo.

25

html.

26

34. Zanatta, R. A. and B. Kira (2018) Regulation of Uber in São Paulo: from conflict to regulatory

27

experimentation,International Journal of Private Law,9(1-2) 83–94.

28